Difference between revisions of "Metadata"

From VALEP
Jump to navigation Jump to search
(User)
Line 107: Line 107:
  
 
Examples might be blue print or Xerox for text documents. There always can be added new items to the list.
 
Examples might be blue print or Xerox for text documents. There always can be added new items to the list.
 
== User ==
 
 
This is the internal database of users of VALEP. Users can be created, edited, and deleted by the admin only. They receive a username and password and a specific user role inside of VALEP.
 
  
 
== Data types in VALEP ==
 
== Data types in VALEP ==

Revision as of 14:28, 27 May 2021

VALEP uses a relational database as well as parsing tools that ensures consistent metadata. A persons, for example, can be identified as the author of a document, only if the person is already stored in VALEPs table of persons. A date is only accepted by VALEP if it is specified in accordance to the socalled EDTF format. On the other hand, (almost) all free text metadata categories enable the usage of the whole range of Unicode symbols in VALEP, e.g. Hong Qian can be alternatively spelled as 洪谦.


Date

Date and time is specified in VALEP using all levels of the highly flexible Extended Date Time Format (EDTF): see the [detailed specification].

  • To specify a simple date use the Year-Month-Day format, e.g. 2020-12-02 (and make sure that you seperate the digits with an hyphen - rather than any other symbol)
  • For a range of dates use / between the dates, e.g. 1900-12-24/1900-12-31
  • Entire months and years can be specified in an obvious way as 1932-10 (= October 1932) and 1968 (= the entire year 1968)

Language

Languages are stored in VALEP in a csv file that can be edited by the admin. The list was initially taken from the Library of Congress and recently comprises about 500 languages. To select a language, just type some characters of the name of the language and select the language from the list that pops up (klick once or press return). If the language you are going to use does not show up in the list, please send an email to Christian Damböck.

Location

For locations we are using a powerful internal tool, which is based on the specification of hierarchical structure of areas. Each city must be located inside of an area. A concrete address, then, is always based on a city.

Area

Areas are hierarchically structure. This means two different things.

  • Main areas are specified as nested boxes, e.g. Lower Austria is inside of Austria, Austria is inside of Europe, therefore Lower Austria also belongs to Europe but because Austria does not belong to Asia, Lower Austria also is not a part of Asia.
  • Special areas do not fit into the nested structure of main areas; examples are:
    • The English Speaking World which includes regions from all over the world (USA, Canada, Great Britain, etc.)
    • The Habsburg Empire which includes Austria and parts of other states (Hungary, Czech Republic, Italy, Ukraine, etc.)

The main purpose of this complex combined structure of areas and special areas is that one can filter all documents that were produced in a certain - arbitrarily complex - region (implementation of this filter feature is still pending).

City

A city is a geographical entity that belongs to an area. An address must always be connected to a city, rather than an area.

Address

An address must specify a city at least but optionally might also add further information about district, zip code, road, number of building, etc. Therefore, Vienna is one address, Vienna 1080, Alserstraße 23/23 is another one.

Authority: Person and Institution

Persons and Institutions (= authorities) can be optionally used for the specification of all metadata categories that cover authors, creators, receivers, issuers, and those being involved in the development of a document. In almost all these cases it is also possible to specify various persons/institutions. Just type in one or more characters that belong to the name of the person/institution and then select the name from the list, using the mouse (single klick) or keyboard (press return).

The difference between persons and institutions, at the level of VALEP's datastructure, is only a matter of complexity:

  • Institutions are characterized only by a Name and an Abbreviated name, together with an optional address, URL and description
  • Persons, by contrast, add to the Abbreviated Name and optional address, URL and descriptionthe following
    • First Name and Surname
    • Date of Birth and Date of Death, both optional fields being specified in the EDTF format (see date)
    • Profession, an optional field that covers a brief description of a person's professional occupation and biography (as it might be used in a name index)
    • Two optional fields that cover a Short biography and a long biography
    • An optional list of Institutions that might associate, for example, Rudolf Carnap, with institutions such as the Vienna Circle, Logical Empiricism, the journal Erkenntnis or the German Youth Movement. These institutions must belong to the table of Institutions as described above.

Event

An Event is specified here by a Name and an optional Description, together with the following:

  • An optional location and optional date
  • An event type that needs to be chosen from an internally predefined list that includes items such as Conference or Discussion Circle Meeting

Events can be added by all users of VALEP.

Typeface

Specifies the way in which a text was produced. The options are

  • Long Hand
  • Short Hand
  • Machine Written
  • Printed
  • Electronic
  • Mixed

These options are fixed and cannot be changed by users of VALEP.

Card File

Each instance of the document category File Card needs to become associated with a certain Card File. The latter is identified in VALEP by a Name and by the Person or Institution that ows the Card File. Optionally, the Typeface of a card file might become pre-specified and a Description might be added.

Document format

A Document Format is characterized by a Name and an optional Description and must be associated with one of the following categories:

  • Text/2D/3D Object (= all document categories except Photograph, Audio, Video)
  • Photograph
  • Audio
  • Video

Examples might be A4 or letter for text documents, vinyl disc for Audio.

Document formats can be edited or added by the admin only.

Document Status

There are three possible values Beginner, Advanced, and Admin. The value restricts the possibility to edit and delete general documents and versions.

  • Users of the type Beginner can edit and delete only those general documents and their versions that have the status Beginner
  • Users of the type Advanced can edit and delete only those general documents and their versions that either have the status Beginner or Advanced
  • Users of the type Admin can edit and delete all general documents and versions

Copying process

A Copying process is characterized by a Name and an optional Description and must be associated with one of the following categories:

  • Text/2D/3D Object (= all document categories except Photograph, Audio, Video)
  • Photograph
  • Audio
  • Video

Examples might be blue print or Xerox for text documents. There always can be added new items to the list.

Data types in VALEP

These are the data types being used in VALEP, here specified in the way in which they are covered in the following two sections:

  • Enum specifies a metadata category where the user must choose one value from an internal predefined list
  • Boolean can be either true or false
  • Name means that the user needs to choose a data set from the table Name that can be edited in the admin section
  • Authority means that the user needs to choose a data set either from the table Person or Institution
  • Date and Location need to be specified, according to the rules being described above
  • Unicode (X) means that the user can specify text using the whole range of Unicode symbols; the text is limited to X characters
  • Simple (X) means that the user can specify text only by using a restricted set of characters that include [A … Z] [a … z] [1 … 0] .,;:-+=*/\~#@§$%!?&(){}[]<>|^°´`‘“
  • NN means that the content of the data field must not be null (per default, data fields can always be empty in VALEP)
  • UQ means that the content of the data field mus be unique, among all instances of the respective data sets (per default, data fields need not be unique in VALEP)
  • (n) means that a data field might contain several instances of data of the specified type (per default, data fields contain either zero or 1 instance of data of the specified type)

All metadata (archive tree)

The archive tree develops from a root called Archives. The leaves of the tree are files.

Nodes of the archive tree

  • might contain a Description
  • all except Collections must contain a Title and optionally contain a Long Title
  • all nodes of the archive tree are child unique, regarding their title, i.e., they are unique among all instances of childs of their parent

Archive

These are the top level nodes of the archive tree. They represents, typically, a physical archive that, in turn, might represent either a public institution (e.g., university archive, state archive) or a private collection being held by a private institution or person. But an archive might also house digital collections, of course, that disintegrate into electronic files and folders. The Archive node in itself only stores metadata that identify an archive, inside of VALEP:

  • Title Unicode (300)
  • Long Title Unicode (300)
  • Description Unicode (30,000)
  • URL Unicode (300)
  • Address (is recently Unicode (300) but in a future implementation it will become Location)
  • Private Collection Boolean (indicates that an archive is not a public institution)
  • Owners User(n)

Collection

These second level nodes must have an Archive as parent. They are not characterized by a Title but rather by an Authority that specifies the respective collection, e.g., Carnap collection or Vienna Circle collection.

  • Collection Authority
  • Description Unicode (30,000)
  • Owners User(n)

Digitization

These are third level nodes and must have a Collection as parent. They represent instances of digitization of an archive that form a unit of some kind. They might cover the material that was digitized by a particular person or group, using a certain digitization method, e.g. Karl's compact camera files or the archive's original scans.

  • Title Unicode (300)
  • Long Title Unicode (300)
  • Description Unicode (30,000)
  • Date (is recently Unicode (300) but in a future implementation it will become Date)
  • Signature type Enum, options are No signature proposals and Signature like Folder name (in a future implementation, this will be available as a technique to specify the Signature of a version)
  • Source type Enum, options are Microfiche, Original, Paper Copy and Other (the purpose of this data field is basically to distinguish original sources from microfiche and paper copies)
  • Digitization Type Enum, options are, among others, Compact camera handheld or Scan
  • Producer Unicode (300), refers to the person or group that produced the digitization
  • Owner is Producer Boolean if yes, then the owner (next field) is also the producer of a digitization (is not entirely consistent and might be removed in a future version)
  • Owners User(n)

Box (recursive)

Each box must have either a digitization or a box as parent. Therefore, boxes might construct several levels of the archive tree, viz. they can be nested / boxes may contain other boxes. Additionally, it is an idiosyncratic feature of VALEP that boxes may only contain boxes and folders but no files.

  • Title Unicode (300)
  • Description Unicode (30,000)
  • Owners User(n)

Folder

Each folder must have either a box or a digitization as parent. Folders may only contain files and versions.

  • Title Unicode (300)
  • Description Unicode (30,000)
  • Owners User(n)

File

Files can only be contained in folders. In other words, boxes cannot contain files or files and folders/boxes at the same time. This is a difference between VALEP and the usual nested file structure in computer systems, which aims at getting the nested structure more transparent and strict. Physical archives might house any kind of objects, however, VALEP only allows to store digitizations in files of the following types:

  • Photograph (jpg)
  • Text (pdf)
  • Audio (mp3)
  • Video (mp4)

Any file is first characterized by the following metadata

  • Title Unicode (300)
  • Description Unicode (30,000)
  • Open Boolean If Yes, the content of published files can be viewed in the file viewer / If no, only the metadata of a published file is accessible
  • Owners User(n)

All metadata (general documents)

Information on archive items is stored in VALEP in the context of general documents (this section) and versions (next section). General documents contain only metadata about an archival item, whereas the document becomes connected with files of the archive tree only via versions.

The nomenclature for general documents is specified in VALEP by means of several csv documents and therefore can be easily edited by the admin. The general structure is this.

  • There is a fixed list of 49 metadata categories being used in VALEP
  • There is a flexible list of document categories which recently contains 13 items
  • There is a flexible document categories table that associates document categories with those items of the list of 49 metadata categories being used in the respective document category
  • There is a flexible list of document types that adds to each document category a list of different document types to which the category disintegrates

Metadata categories

These are the metadata categories for general documents

  1. Document Category Enum, NN
  2. Title Unicode (300)
  3. Title (alternative, long) Unicode (1,000)
  4. Description Unicode (30,000)
  5. Document Type Enum
  6. Card file Card File, NN
  7. URL Unicode (300)
  8. Author / Sender Authority(n)
  9. Receiver Authority(n)
  10. Involved Authority(n)
  11. Event Event
  12. Related Events Event(n)
  13. Date Date(n)
  14. Location / Place of Record Location(n)
  15. Place of Posting Location
  16. Language Language(n)
  17. Typeface, Enum
  18. Document format Document Format
  19. Scope Simple (30) (e.g., 30 pp, 210 min.) (should become Unicode in a later version)
  20. Document status Enum
  21. Publisher Unicode (100)
  22. Place of Publication Unicode (100)
  23. Series Editor Authority(n)
  24. Series Title Unicode (300)
  25. Volume (Series) Simple (30)
  26. Number of volumes Simple (30)
  27. Edition Simple (30)
  28. Date of first edition Date
  29. Place of first edition Unicode (100)
  30. Publisher of first edition Unicode (100)
  31. ISBN Simple (50)
  32. DOI Simple (50)
  33. Autonumous publication Unicode (300)
  34. Volume (Journal) Simple (30)
  35. Issue Simple (30)
  36. Original Publication Unicode (300)
  37. ISSN Simple (50)

Document categories and document types

In the present nomenclature, VALEP features 13 document categories and 78 document types. In the following list we add to the document categories the abbreviations being used in VALEP, e.g., (M) for Manuscript / Chronicle / Object. We list the document types and mention only some of the features as being outlined in the document categories table (viz. which metadata categories belong to a document category)

  • Manuscript / Chronicle / Object (M)
    No receiver, no place of posting (see letter), no event (see Minute or Memo), no publication data (see Book or Article)
    This is the main document category that covers all kinds of manuscripts, chronicles, and notes, but also financial records, and all kinds of 2D and 3D objects that might become characterized with the metadata that belong to this category
    • General Manuscript
    • Book Manuscript
    • Article Manuscript
    • Lecture Manuscript
    • Sketch
    • Note
    • Diary
    • Chronicle
    • Calendar
    • Financial Record
    • Accounting
    • Map
    • Internet Object
    • Other 2D Object
    • 3D Object
  • Minute (During Event) (Minute)
    Similar to (M) but includes event; date and location are either directly entered or covered by the event (the user is responsible for consistency)
    • Minute
    • Lecture Notes
    • Discussion Protocol
  • Photo Series (During Event) (PhotoS)
    Similar to (Minute) but, as in (Photo), no language, typeface, document format, and scope
    • Photo Series (the specific type of the series might be specified by the type of the event that it is documenting)
  • Memo / Speech (Before or after Event) (M/S)
    Similar to (Minute) but the Event is to be distinguished here from the date and location of the production of the speech or memo
    • Memo (after event)
    • Speech / lecture (before event)
  • Letter / Issued document (L)
    Similar to (M) but includes receiver and place of posting
    Letters usually might not have a title (which is optional), but the category also includes all varieties of issued documents (bills, tickets, certificates etc.) that mostly might have a title
    • Letter
    • Post card
    • Picture post card
    • Telegram
    • Email
    • Bill
    • Ticket
    • Prescription
    • Confirmation of Payment
    • General Certificate
    • Personal Document
    • School Certificate
    • Testament
  • Photograph (Photo)
    Similar to (PhotoS) but instead of an event it covers date and location
    • Analog Photograph
    • Diapositive
    • Digital Photograph
  • File Card (FC)
    Similar to (M) but is bound to a card file
    • General File Card
    • Addresses / Biographical Notes
    • Bibliographical Notes
    • Private Matters
    • Business / Financial Matters
  • Book / other printed matter (B)
    Similar to (M) but instead of a location it covers a range of bibliographical data (20-31)
    This category also covers all kinds of printed material that does not unequivocally belong to a periodical of some kind
    • Book
    • Edited book
    • Handbook
    • Web page
    • General printed matter
    • Newspaper clipping (as long as it cannot be identified as an Article)
    • Bulk mail
    • Promotion brochure
    • Letter head
    • Envelope
    • Calling card
  • Article (A)
    Similar to (B) but covers the bibliographical data (32-36)
    • Journal article
    • Handbook article
    • Newspaper article
    • Web article
  • Proceedings (Book) (PrB)
    Similar to (B) but also covers an event (e.g., the conference whose contributions are published in the proceedings)
    • Conference
    • Other Event
  • Proceedings (Article) (PrA)
    Similar to (A) but also covers an event (e.g., the conference whose contributions are published in the proceedings)
    • Conference
    • Other event
  • Audio (Audio)
    Similar to (M) but also covers an event, no typeface
    Will typically but not necessarily be instantiated by versions that contain audio files (mp3)
    • Interview
    • Lecture
    • Conference
    • Discussion circle meeting
    • Radio broadcast
    • Podcast
    • Music
    • Other
  • Video (Video)
    Similar to (M) but also covers an event, no typeface
    Will typically but not necessarily be instantiated by versions that contain video files (mp4)
    • Interview
    • Lecture
    • Conference
    • Dicsussion circle meeting
    • TV broadcast
    • Podcast
    • Documentary
    • Other movie
    • Other

All metadata (versions)

Versions connect files of the archive tree with documents. A version is a container that consists of a non-empty sequence of files that belong to a folder. Sequences of files must not have gaps. If A1 ... An is the alphabetically ordered sequence of all files of a folder, then a version must always be characterized by a sequence Ai ... Aj with 1 ≤ i ≤ j ≤ n. There are six possible types of versions in VALEP:

  • Original
  • Copy
  • Written Duplicate
  • Transcription
  • Translation
  • Commentary

A document might contain several versions of any type. Most versions are characterized by the following metadata:

  1. Version Type Enum as specified above, NN
  2. Copying process Copying Process (available only for versions of type Copy)
  3. Signature Unicode (300) (Note that signatures of a document represent there location in an archive and therefore cannot be associated here with the general document but only with the version; different versions of the same document might, of course, have different signatures)
  4. Specific Comments on this version Unicode (30,000)
  5. Version URL Unicode (300)

The following metadata is available only for versions of type Written duplicate, Transcription, Translation, and Commentary, and for all versions of a document of the category Photograph or Photo series

  1. Document format (version) Document format/li>
  2. Typeface (version) Typeface
  3. Author / Developer (version) Authority
  4. Date (version) Date
  5. Location (version) Location
  6. Language (version) Language
  7. Scope (version) Simple (30)