Ulrike Henny and Frederike Neuber, in collaboration with the members of the IDE; Version 1.0, February 2017.[1]

Preliminary remarks

Scope

Purpose. This paper provides a framework for the description and evaluation of digital text collections for humanities research as conducted e.g. in literary studies, linguistics, and history. The guidelines aim to be applicable to various types of humanities text collections and to promote consistency when reviewing text collections. Furthermore, these guidelines want to encourage a more vivid discourse about questions of reliability and sustainability of digital text collections, their methodological frameworks and the transfer of methods between single disciplines when building text collections in Digital Humanities.

In this context, digital text collection, as a rather generic term, refers to digital resources that involve the collection, structuring and enrichment of textual data. Thus the term includes a great heterogeneity of resources that follow various, often not clearly distinct, purposes, and engage on different scholarly levels with the text. Hereinafter a few examples for types of text collections, their objectives and general characteristics will be outlined:

Digital text collections can be understood as inventories of particular types of textual sources that have been brought together according to certain unifying criteria as, for instance, the language of the texts, their provenance, date of origin, genre, authorship, topic and many other factors external to (metadata) and internal to the text (properties derived from the texts themselves). Such collections do not necessarily follow an explicit research scope. In that case, they can represent central access points that provide textual data for collection-external research and also integrate new data that conforms to the general characteristics of the collection. In this respect, collections can function as or be embedded in repositories for textdata. In other cases, in particular if the collection is conceived as a scholarly resource that is built to address a certain research question, its content is determined by carefully chosen selection- and design-criteria that seek to be derived from and be adequate for the research object and purpose of the collection.[2] A few examples for collection design principles are completeness (e.g. if the corpus aims to represent the work of an author as a whole), representativeness (if the corpus claims to be representative for a specific subject domain and functions as a reference for that domain) and balance (e.g. if the corpus is built to allow for contrastive analyses between its components such as different text genres or regional language varieties). In empirically oriented research areas, a text collection can serve as a data basis to test hypotheses.

Engagement with text. Text collections engage with text in many and various ways. In general, text can be transcribed written text or speech, whereby the former can be associated with a document or another material text carrier. The focus of these guidelines is on computationally processable digital text, while collections of images without transcripts are not considered.[3] Depending on the disciplinary perspective and theoretical approach, text, as a coherent stretch of written information, can be perceived in various ways: e.g. as content and meaning, as linguistic code, or as set of glyphs and graphs.[4] The chosen perspective determines which characteristics of the text are relevant for the collection and how the text is represented. A collection may, for example, engage with the content of the texts (e.g. topics, named entities) or rather explore the way a text communicates information (e.g. through language, layout, or writing structure), or even both.

Editorial enrichment. Usually, text collections include and are structured according to a set of metadata. They can furthermore include manual or automatically applied annotations of the textual data. Yet the depth of the textual enrichment can vary considerably from collection to collection. Generally, the more a collection aims to meet scholarly standards, the more elaborated and specified the metadata and annotations tend to be.

Constitution. Collections can range in size from a few to over a million texts, and in kind from special purpose collections to general compilations. Larger collections, in particular, tend to build on existing datasets, others are newly created. Text collections are built on basic (full) texts or transcripts that can be recorded either manually and/or (semi-)automatically. Text collections can be created by single editors, an editorial team or by the community on crowdsourcing platforms. There are collections that can be considered complete or finalized while others are conceived as dynamic and growing.

Access and publication. Regarding the provision of a text collection, it may be presented as a simple bundle of text files for download but may also include a user interface for browsing, searching and viewing texts. If a user interface is part of a text collection, it can include specialized analysis and/or visualization tools, e.g. for concordances and collocations, for the occurrence of topics or for the exploration of entity-relations and networks.

Documentation and quality. If a text collection claims to be a scholarly resource, it is of utmost importance that the approaches, decisions and factors that guided its creation and constitution (i.e. in the first instance the criteria for text selection, the provenance of the texts and their treatment) are documented and explained. Likewise, the importance of quality checks of the texts, metadata and annotations increases with the scholarly claim of the text collection.

Traditions, concepts and terminology. In the (digital) humanities, there is a range of different terms that refer to text collections as outlined above, depending on the kind of collection and on the disciplinary background: for example ‘collection’, ‘corpus’ and ‘archive’. In addition to that, each term may have a number of different meanings. Their usage can rely on commonly established terminologies of the individual disciplines. For instance, in historical sciences, a ‘corpus’ or a ‘collection’ (e.g. in German: ‚Quellenkorpus’, ’Quellensammlung’) describes a set of (written) testimonies, which are not necessarily represented in the corpus as full texts but often as excerpts or regesta.[5] In other disciplines, e.g. philosophy and theology, a ‘corpus’ can represent or function as a canon, in that the corpus contains a textual transmission that the editors define as complete and/or authoritative.[6] In linguistics, particularly in the dedicated sub-discipline ‘corpus linguistics’ with its own inventory of concepts and meanings, ‘corpus’ has a narrower definition. Rather than being solely a term to name a set of sources, ‘corpus’, in linguistics, refers to a whole methodological framework that allows studying languages and their varieties. Besides the varying usage of the terms in single disciplines, in everyday practice the terms are often used with rather fuzzy meanings. Often, self-established terminologies of single collections whose meanings differ from project to project (e.g if a resource entitles itself as ‘text archive’, ‘digital library’, or ‘repository’), complicate a mutual understanding of terms to identify different types of text collections.

Typology of text collections. These guidelines do not establish a further categorization of ‘text collection’, because there is no generally known and accepted typology yet that would be valid for resources from different disciplines. Instead, these guidelines operate with the term ‘collection’ since it is broad enough to be applicable to all sorts of assemblages of digital text resources and is less disciplinary coined than, for instance, the term ‘corpus’. However, one type of digital resource that is also dedicated to text but not in particular considered in these guidelines is the scholarly digital edition. Criteria that consider the methodological traditions of textual scholarship including canonized editorial tasks (e.g. record of transmission history, creation of critical apparatuses and stemmata, provision of a commentary) have already been developed and can be consulted in a dedicated catalogue of criteria for digital scholarly editions.

Application

Methodological perspectives and directions. The evaluation of text collections is thus based on established methods from the respective disciplines as well as on interdisciplinary methods from the Digital Humanities in general that are beginning to take shape. Reviewing text collections will help to disseminate and canonize approved disciplinary methods and approaches. Against the background of the Digital Humanities’ interdisciplinary character, evaluating text collections will furthermore allow observing whether methods are shared and best-practices adopted across different disciplines.[7] Finally, in the long term, evaluating text collections can lead to a more differentiated understanding of what types of ‘text collections’ exist and in what way they differ from each other. Hence, guidelines such as these should ideally be updated continuously to reflect the results of ongoing methodological discussions.

General and specific criteria. On the one hand, there are general criteria that are applicable to any kind of text collection. These criteria should all be addressed in the review. On the other hand, there are specific and detail-oriented criteria, which can only be applied to certain kinds of text collections. The different levels of generality of individual criteria should be kept in mind when reviewing a text collection.

Parameters. Text collections should be reviewed considering their individual circumstances: goals set by the creators, financial resources and staff, the run-time of a project and previously existing resources that a collection might rely on. It should be appraised positively when a text collection is very extensive or complete, e.g. rich in content, in critical information and functionalities. But it must also be recognized when only narrower, self-defined aims are met. Where self-defined ambitious goals are not fulfilled, the temporality of the presentation and its discussion should be taken into account. Publication and review are often valid only for a moment in an ongoing, open process of (scholarly) engagement. It is therefore not necessary that a text collection has reached a final state and is concluded at the time it is reviewed. Still, the reviewer should examine whether the text collection has, at the point of its review, reached sufficient maturity and consistency to be a worthwhile object of review and to what extent it can be considered completed. The review of a digital collection should also serve as a professional comment in the development process of a text collection and as a suggestion for improvement.

1. Opening the review

1.1 Bibliographic identification of the reviewed text collection. A text collection should be identified in terms similar to traditional bibliographic descriptions: a title, the responsible editors, other responsible persons and institutions, the dates of its publication (initial, versions, last modification), and the address (in terms of web-URL or other naming conventions like DOI, URN or PURL) should all be evident.

1.2 General introduction. The subject of the text collection should be described briefly. What is the academic, disciplinary or interdisciplinary or non-academic scope and context of the text collection? How does it relate to other printed or digital resources, to its predecessors or to similar projects? What desiderata and possible research questions does it address?

1.3 General parameters. Who are the creators, the participating institutions and staff? What were their responsibilities? Are there content-related connections to other projects? What financial, personnel and time resources were available for the project?

1.4 Transparency. Are the general parameters easily accessible? Does the text collection provide an imprint, institutional or personal contact information?

2. Aims and content

2.1 Documentation. Is there a description of the aims and contents of the text collection? If not, are these points self-evident from the content and presentation of the text collection?

2.2 Purpose. What is the purpose of the text collection? Is it a general purpose collection or does it support specific research interests? If the latter, to which field(s) of research does it contribute?

2.3 Mission. What does the text collection want to accomplish? What does it promise explicitly? What does it merely suggest by self-classification (e.g. ‘collection’, ‘corpus’, ‘digital archive’, ‘digital library’, ‘portal’, etc.)?

2.4 Audience. What is the text collection’s target audience? Is it, for example, meant to be used in research, in teaching or intended for the interest of the general public?

2.5 Content. What is the subject of the text collection? What content is published? Characterise the information presented (e.g. introduction or commentary, metadata, sampled or full text transcriptions, summaries, translations, markup, annotations, analytic output, context material, bibliography). Is relevant content missing?

3. Methods

3.1 Design, selection and composition

3.1.1 Design. Has the text collection been created according to a deliberate and methodical design? Does its design reflect its purpose? Which principles guide the design of the text collection, does it for example aim at completeness, representativeness, balance, exemplarity?

3.1.2 Selection criteria. Are the principles of selection explained? Are definitions given for the type of texts that the text collection is composed of? Do the selection criteria correspond to the purpose of the text collection and to its design principles? What kind of criteria have been chosen for the text collection (e.g. external criteria such as author, country, epoche vs. internal criteria such as topic, style, linguistic characteristics)? Are the criteria easy to establish or complex? How are the different selection-criteria connected to each other? Has a sampling technique been applied and if so, is it justified?

3.1.3 Text selection. How extensive and how complete is the selection or sample within the context of the text collection? What obstacles may the creators of the text collection have faced in the acquisition of materials (lost, unavailable or inaccessible texts, e.g. due to copyright restrictions) and is this commented on?

3.1.4 Size. How large is the text collection? How many texts and how much text (if accessible e.g. in number of word tokens and word types) are included? Quantify other types of content that the text collection may include. Are reasons given for the size of the text collection? Is the size in accord with the aims of the text collection? If it seeks to make statistical analyses possible, are the text collection’s size and amount of text appropriate?

3.1.5 Composition. What components is the text collection made up of? Can sub-collections be identified? How do the components relate to each other, is the number and amount of text balanced for the different components or are there major and minor components? Is the composition of the text collection adequate for its purpose?

3.1.6 Data acquisition and integration. How does the project build up the dataset? Does it record or transcribe the textual data for the first time or have parts of the text collection been taken from other resources? If so, what kind of material has been taken over (e.g full texts, annotations, metadata, etc.)? To what extent has data from other resources been adapted and transformed to the new context? How much effort has been put into the critical assessment of the material? How would you describe the proportion of previous to new work?

3.1.7 Quality assurance. Has the quality of the data (transcriptions, metadata, annotations, etc.) been checked? If yes, how was this done and are the results communicated? If texts are obtained by digitization or re-use from other collections, has the quality of the texts been checked and how has this been done? Is the quality sufficient for the main research interests in the material?

3.1.8 Typology. Bearing in mind the self-description, purpose, design and content of the text collection, can it be considered a certain type of text collection according to some established typology (a canon, the complete works or œuvre, a primary source collection, a reference corpus, a descriptive/normative corpus, a parallel or contrastive corpus, a monitor corpus, etc.)? Is the text collection designed to support qualitative and/or quantitative research?

3.2 Data Modelling

3.2.1 Theory and method. What is the theoretical stance behind the text collection? How does the text collection deal with the texts that it is built up of? What perspective on the text do the transcription rules convey, e.g. a linguistic, a materialistic or a semiotic perspective? Are the communicative aspect and form of the texts at the center of the study and/or does the text collection explore the textual contents? Is the chosen transcription technique adequate for the purpose of the collection?

3.2.2 Annotations. Does the text collection include annotations of the texts? If yes, what kind of annotations are provided (e.g. layers of linguistic annotations)? How are the annotations linked to the texts themselves (directly embedded or stand-off)? Have annotations been added automatically or manually? Which tools have been used to add annotations? How specific are the annotations for the project and how useful are they for re-use?

3.2.3 Metadata. Does the text collection include metadata on the collection as a whole and on the individual texts? What kind of metadata are provided and how detailed are they?

3.2.4 Data modelling. How is the methodological approach technically implemented? Which data formats and models are used? Does the text collection make use of standardized practices for text encoding, metadata and annotation schemes (e.g. TEI, tagsets for part-of-speech)? If not, is the deviation from existing standards sufficiently justified and is the project specific data model documented and available through a formal schema?

3.2.5 Linked data and community standards. Is the text collection connected with other resources through LOD-technique (e.g. authority files)? Does it use community standards (like the CLARIN CMDI, the Europeana Data Model or similar initiatives) that enable its integration into common data infrastructures?

4. Provision

4.1 Data access and export formats. Is the basic or underlying data of the text collection accessible and if so, how? Is it provided and exportable for each single object and/or for the whole text collection? Which export formats are offered and downloadable (e.g. TCF, TXT, XML)?

4.2 Technical interfaces. Are there technical interfaces like OAI-PMH, REST, APIs etc., which allow the reuse of the data of the text collection in other contexts? Can you harvest or download the data?

4.3 Analytical data. Does the text collection provide analytical data? If yes, what kind of analyses have been undertaken, with which methods and for what purpose?

4.4 Re-use scenarios. Can you use the data with other tools useful for this kind of content? Can you integrate the content in other systems, e.g. aggregating content from several sources? Are you aware of any project which re-used the data of the text collection in the context of its own research and/or beyond the initially intended research scope?

5. User Interface

5.1 Text and Interface. Does the text collection have a dedicated user interface designed for the collection at hand in which the texts of the collection are represented and/or in which the data is analyzable? If not, is the textual data provided via a common repository or a standard interface (e.g. GitHub) or is it offered as a bundle of files available for download? Does the text collection give instructions on how to use the data in case it is not accessible via a user interface?

If the text collection does not have a dedicated user interface the following criteria of paragraph 5 can be skipped.

5.2 Technical infrastructure. Which technologies are used for the publication of the text collection? Why are these technologies used (e.g. as a trade-off between local conditions and best practices)?

5.3 Usability. Is the interface of the text collection clearly arranged so that the user can quickly identify the purpose, the content and the main access methods of the text collection? Is the interface in line with common visual patterns? Is the user at any time made aware of what content is currently displayed, of their position in the overall architecture of the text collection, and how other content can be accessed?

5.4 Access modes. Is it possible to browse through the entirety of the content? Is there a simple and/or an advanced search interface? Does the search provide feasible results when searching without specific knowledge of the content? Is the content presented in any other formats that provide an overview of the text collection and supports access to the materials, such as compilations, indices or registers?

5.5 Analysis. Does the text collection integrate tools for analyses of the data on-the-fly? If so, what tools are used and have they been developed in the context of the text collection? If analyses are presented via the user interface, are they comprehensible and how useful are they for the audience of the text collection?

5.6 Visualization. Does the text collection provide particular visualizations of the data (e.g. networks, charts, treemaps, wordclouds)? Are the visualizations self-explanatory and, if not, are they accompanied by explanations and/or interpretations? Are the visualizations helpful for the understanding of the texts and analyses? Does the visualization lead to new research insights?

5.7 User empowerment. To what extend can the user alter the interface in order to affect the outcomes of representation and analysis of the text collection? Can the user change the presentation of the materials, e.g. add or remove interpretational layers? To what extent can the user apply his own queries to the search mode or customize the data analysis (e.g. chose parameters) in order to follow personal research interests? Is there a personalisation mode that enables the users e.g. to create their own sub-collections of the existing text collection?

6. Preservation

6.1 Documentation and associated texts. Does the text collection provide introductory or explanatory texts? Is there sufficient documentation of the project, the text collection, and its technical implementation? Are the sources and the selection of the materials described? Are the applied methods clearly explained?

6.2 Rights and licences. Does the text collection provide sufficient information on rights and restrictions for the reuse of its different parts (e.g. text data, annotations, commentary)? Does the text collection utilize a rights model feasible for scholarly reuse of the data? Is a specific licence model (e.g. Creative Commons) in use?

6.3 Identification and citation. Are there persistent identifiers for the text collection? Which level of the content structure do they address (e.g. the whole collection, single texts, parts of the texts, tokens)? Which resolving mechanisms and naming systems are used? Does the text collection supply citation guidelines?

6.4 Long term usage. What are the text collection’s prospects for long term use? Is the text collection complete or does it promise further modifications and additions? Is there institutional support for the curation and sustainment of the text collection? Is the basic data archived? Is there a plan to provide continuous access to the presentation?

7. Conclusion

7.1 General characteristics. According to the descriptions outlined in these guidelines, what would you consider the main characteristics of the reviewed text collection, e.g. regarding its purpose, its design principles, and its conception of ‘text’? Can you specify the type of text collection further, e.g. from a disciplinary perspective?

7.2 Realisation of aims. To what extent has the text collection successfully accomplished its aims?

7.3 Scholarly quality. Is the text collection sufficiently documented? Is it citable and transparent? How is the quality of the content (e.g. texts, annotation, metadata)? How would you describe the academic quality of the resource in general?

7.4 Achievements and scholarly contribution. What does the text collection contribute to the current state of knowledge of its topic? To what extent does the text collection contribute to current scholarship in its target field?

7.5 Methodological impact. What does the text collection contribute to best practices regarding the design and usage of text collection in general?

7.6 Particularities. Which features merit special attention for noteworthiness and/or innovation, even if they are beyond the scope of these general criteria? Think e.g. of innovative modelling and annotation techniques, features and tools for re-using the data, options for visualisations and interactive usage, etc.

7.7 Suggestions for improvement. Taking into account strengths and weaknesses of the project, what should be considered for further improvement? What would be nice and useful additions?

 

 

References

  • Cohen, Daniel J. and Joan Fragaszy Troyano (eds.). Closing the Evaluation Gap. Journal of Digital Humanities Vol. 1, No. 4, 2012. <http://journalofdigitalhumanities.org/1-4/>
  • Geyken, Alexander et al. TEI und Textkorpora: Fehlerklassifikation und Qualitätskontrolle vor, während und nach der Texterfassung im Deutschen Textarchiv. Jahrbuch für Computerphilologie, 2012. <http://computerphilologie.digital-humanities.de/jg09/geykenetal.pdf>
  • Henny, Ulrike and Christof Schöch. How good are our texts, really? Quality assurance for literary texts from various sources [Blog post]. CLiGS, February 27th, 2016. <http://cligs.hypotheses.org/371>
  • Schöch, Christof. Aufbau von Datensammlungen. Einführung in die Digital Humanities, edited by Fotis Jannidis, Malte Rehbein and Hubertus Kohle, Stuttgart, Metzler, 2017, pp. 223-233.
  • Schreibman, Susan, Laura Mandell and Stephen Olsen (eds.): Evaluating Digital Scholarship” [Special section]. Profession, 2011, pp. 123-201. <http://www.mlajournals.org/toc/prof/2011/1>
  • Wynne, Martin (ed.). Developing Linguistic Corpora: a Guide to Good Practice, Oxford, Oxbow Books, 2005. <http://ota.ox.ac.uk/documents/creating/dlc/>
  • Unsworth, John. Computational Work with Very Large Text Collections. Interoperability, Sustainability, and the TEI. Journal of the Text Encoding Initiative, issue 1, 2011. <http://jtei.revues.org/215>

Review examples

 


[1] These guidelines are a spin-off derived from the Criteria for Reviewing Scholarly Digital Editions authored by Patrick Sahle and published by the IDE. The following parts of the Criteria for Reviewing Scholarly Digital Editions have been partly or entirely integrated into these guidelines for digital text collections: Application, 1.2/3/4/5, 3.1/2/3/7, 4.1/2/8/9/12/13/15/16, 5.2/4/5/7. Other sources that have been consulted for the development of new criteria for text collections are given in the bibliography. []

[2] There may be so-called ‘disposable’ text collections which are compiled ad hoc and used for experimental purposes. These review guidelines concentrate on collections that are prepared to be more mature and durable resources and are meant to be published. []

[3] Collections of smaller parts or samples from texts, for example, which do not represent a textual expression in its entirety, are within the scope of these guidelines. []

[4] Sahle, Patrick (2013). Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels. Teil 3: Textbegriffe und Recodierung. Schriften des Instituts für Dokumentologie und Editorik, 9. BoD, Norderstedt, p. 9ff. []

[5] E.g. diplomatic documents on imperial or papal powers. []

[6] E.g. the collection of the survived and transmitted works of Aristotle in the Corpus Aristotelicum or the arrangement of sacred epistles as Corpus Paulinum as part of the biblical canon.[]

[7] It can already be noted that the general increase in digital, corpus-based and quantitative studies brings the various humanities disciplines closer together. One example of an attempt to make corpus-related methods from linguistics fruitful for another humanities discipline is the CLARIN-D working group for History. Cf. http://www.clarin-d.de/en/disciplines/history []

 

  • Publikationen