The Arapesh Grammar and Digital Language Archive

About the Archive

Standards & Infrastructure

The Arapesh Grammar and Digital Language Archive is based entirely on open standards and technologies. AGDLA integrates four types of archival and scholarly content: (1) recordings and transcriptions of linguistic events (these constitute the text collection); (2) field notebooks that contain additional data, scholarly reflection, and analytical notes on various aspects of the language; (3) a lexicon that provides glossing and morphological analysis of Arapesh vocabulary items; and (4) a description and control database (the AGDLA catalog) that is used to document and administer the project contents.

Linguistic Events: Recordings, Sound Files, and Transcriptions

The original field recordings were created on magnetic cassette tapes. These analog tapes were digitized using ProTools, and were digitally captured in the WAVE format at 48 kHz sampling and 24 bit-depth. After digitizing, the recordings were segmented into individual linguistic events, and, when necessary, further segmented into smaller units in order to make them manageable within the processing limits of ELAN, an open source software tool developed at the Max-Planck Institute for Psycholinguistics (Nijmegen, Netherlands) to allow researchers to create complex synchronized annotation of audio or video streams coded in Extensible Markup Language (XML) (http://www.lat-mpi.eu/tools/elan/). The AGDLA project uses ELAN to create transcriptions of linguistic events and to tokenize (approximately window-sized) utterances into word-sized units. Each recorded linguistic event is thus represented in one or more WAVE files and corresponding ELAN/XML transcriptions. Using Extensible Stylesheet Language Transformation (XSLT), attested forms (i.e., tokens) are extracted from each transcription. The extracted tokens are sorted and duplicate strings eliminated. For each token that is unique within an individual transcription, a complete address is retained: the ELAN file name (and thus, indirectly, the linguistic event) and the identifier for all utterances within which the token occurs. The lists of extracted tokens constitutes the basis for the lexicon, a grammatical database that incorporates all attested realizations of a lexeme in the text collection, and that hence also serves as a concordance. The XML markup used to represent the lists of extracted tokens matches (in both tag names and structures) the markup used for the lexicon. The markup thus enables us to represent inflected forms within lexicon entries which reference each token occurrence in each utterance of each transcription, based on the retention of addressing data in the token lists.

Lexicon

The lexicon is based on an XML schema developed using Relax NG (ISO/IEC 19757-2; see http://www.relaxng.org/). The lexicon schema represents the major lexical categories of Arapesh in a language-specific way. There are four major categories in the lexicon: noun, predicate (branching into 'verb' and 'other predicate'), proforms, and a fourth category that subsumes a series of minor category types. Each category is further differentiated according to its defining formal characteristics. Within each entry, inflected forms attested in the transcriptions are recorded as described above, by including the addresses of each form in the transcription and corresponding WAVE file. In addition, the addresses of corresponding notebook entries are recorded when relevant to the entry's interpretation. Each lexicon entry thus references not only defining formal and semantic characteristics of each lexical element, but also the location of tokens attested in the transcriptions and related notebook entries. This addressing enables the linking of each entry to one or more tokens in context and in related notebook entries. When the structure is complete, users will be able to examine each lexicon entry for inflected forms in any linked recording, transcription, or notebook entry.

Notebooks, Fortune's 1942 Arapesh Grammar, Robert J. Conrad Field Materials

Fourteen of Dobrin's field notebooks are represented in the digital archive. Each page on which information was recorded was scanned on an Epson Expression 1640XL at 600 dpi 24 bit color and archived in the TIFF format. Working reference (or thumbnail) and full size copies were derived in JPEG. Full size copies are 150 dpi. Each notebook is represented using the Metadata Encoding and Transmission Standard (METS) and includes descriptive, administrative, file inventory, and structural metadata. In addition to representing an archival package of the data, the METS enables specific addressing of each notebook down to the page level and thus supports linking individual notebook pages to lexicon entries. The METS and corresponding page images are made accessible through a "page turner" based on Cocoon (2.1; see The Apache Cocoon project: http://cocoon.apache.org), using Saxon (Saxon the XSLT and XQuery processor: http://saxon.sourceforge.net) and XSLT 2.0.

Reo Fortune's 1942 grammar of Mountain Arapesh was scanned by UVA's Library Digital Services at 600 dpi 24 bit color. Working reference (or thumbnail) and full size copies were derived in JPEG. Full size copies are 150 dpi. The grammar is represented using the Metadata Encoding and Transmission Standard (METS) and includes descriptive, administrative, file inventory, and structural metadata.

Robert J. Conrad's printed Bukiyip and Southern Arapesh materials were scanned by the University of Virginia Library's Rare Materials Digital Services at 24 bit color, 600 or 400 dpi depending on the size of the originals.

Description and Control Database (Catalog)

The description and control database is used to provide intellectual description of individual linguistic events and related representations of them: tape recordings, WAVE files, and transcriptions. The field notebooks are also described. In addition to providing intellectual access to the resources, the database also serves to manage the digital representations and to provide a nexus by which addresses in the lexicon can be resolved to individual recordings, transcriptions, and notebook pages. The database has five major types of records: recorded event, participants, WAVE/ELAN (transcription) files, tapes, and notebooks. For each recorded event, title, type, date, location, participants, duration, primary speech variety, and brief description are provided. Each participant has a name entry, and, when necessary, other designations used for the individual. In addition, the approximate age, sex, village, and when possible a brief description is provided. For each WAVE/ELAN (a one-to-one corresponding pair), URLs are provide for the WAVE file and the ELAN XML transcription. Each WAVE/ELAN file is linked to the corresponding notebook and page-sequence. Finally, for WAVE/ELAN entries representing a part of a recorded event, a sequence number is recorded to keep the segments of multipartite recorded events in order. For each tape, a link is provided to a page image of each cassette label as well as an identifier. Each notebook has a unique identifier and a URL for the corresponding METS file. The WAVE/ELAN files and tape cover scans are linked to the appropriate record event entries. The database is in PostgreSQL, an open source and widely used and supported SQL platform. PHP is used to provide the maintenance interface.

Website

The W3C standards HTML 4.1 Strict and CSS are used for all Web content.

Creation and Maintenance

All digital resources are mounted on an Apple MAC G4 server, with a dual core processor and approximately 1.5 terabytes of disk space. The operating system is OS 10.4. All files are backed up nightly on a Western Digital 1.5 terabyte external drive.