Machine Learning and Data Mining for Digital Scholarly Editions

University of Rostock, 9-10 June 2022

In several areas of the Digital Humanities, Data Mining and Machine Learning techniques are increasingly applied and discussed, for example for the processing and extraction of information from digital images that represent humanistic sources, or for the analysis of full texts that are relevant for the Humanities and have already been extracted from images or are born digital. The goal of both Data Mining and Machine Learning is to develop and apply effective and intelligent methods to detect, extract and structure information from big amounts of data that could not be processed reasonably with other methods (Alpaydin 2014, Han and Kamber 2012).

As a subfield of Digital Humanities, Digital Scholarly Editing is no exception to this trend. Data Mining and Machine Learning methods have been used for several tasks in the Digital Scholarly Editing workflow, for example to prepare the transcription and scholarly description of texts by recognizing and classifying text from image data (Boenig et al. 2016, Kestemont et al. 2017, Reul et al. 2019, Brusuelas 2021), to automatically compare text witnesses and reconstruct their history (Nassourou 2013, Hoenen 2018), or to enrich edited texts with information about mentioned entities, topics, or other content- and discourse-related phenomena (Koncar et al. 2020, Haeder 2020).

However, compared to other Digital Humanities subfields such as Computational Linguistics or Computational Literary Studies, so far, these methods have not reached the same widespread use and are not yet discussed as intensely and fundamentally in Digital Scholarly Editing. There may be several reasons for this. Digital Scholarly Editions, which can be defined as “the critical representation of historic documents” in “editions that are guided by a digital paradigm in their theory, method and practice” (Sahle 2016), have particularly high demands regarding the accuracy of transcription and annotation. At the same time, the focus is often on historic and handwritten texts, making a computational treatment more challenging. Also, the amount and extent of edited materials are often not as big as they are when other kinds of text or image collections are analyzed.

The main purpose of this conference is to foster the discussion on Machine Learning and Data Mining techniques in the area of Digital Scholarly Editing. The following questions can be addressed:

  • Where can Machine Learning and Data Mining be usefully and meaningfully applied in a Digital Scholarly Editing workflow?
  • How are Machine Learning and Data Mining already used for the creation of Digital Scholarly Editions and what are potential use cases for the future?
  • What are challenges in Digital Scholarly Editing that can be successfully addressed by using Machine Learning and Data Mining?
  • Do editions pose special challenges to the application of Machine Learning and Data Mining that need to be overcome?
  • What are biases or side effects when applying Machine Learning and Data Mining methods to historical data/texts?
  • How does the use of Machine Learning and Data Mining change the way editors work and the way editions are created? Does it change the role of the editor? How does it change the methods of editing?
  • How does Digital Scholarly Editing relate to other Digital Humanities subfields regarding the application of Machine Learning and Data Mining?
  • How can a critical engagement with Machine Learning and Data Mining techniques in Digital Scholarly Editing be developed and encouraged?

We are interested in a wide range of topics where Machine Learning and Data Mining can be used in the Digital Scholarly Editing workflow, for example pattern recognition in image analysis, OCR, NLP (tokenization, lemmatization, part-of-speech tagging, NER), topic modeling, sentiment analysis, clustering and classification tasks which prepare transcription, interpretation, text constitution, annotation, and commentary. We encourage proposals that go beyond the presentation of specific research projects towards more general reflections about Machine Learning and Data Mining for Digital Scholarly Editions.

Proposals

Papers of 4,000 to 6,000 words (not counting the bibliographic references) should be submitted to ml-dse@i-d-e.de as .odt or .docx until 10 February 1 March 2022. We only accept papers in English. Please see the submission guidelines. The proposals will be carefully reviewed by the scientific committee, and authors will be notified about acceptance in April 2022.

The conference will be held on 9 and 10 June 2022. After the conference, papers can be revised. The final version should be submitted for publication until 31 August 2022. The proceedings of the conference will be published Open Access in the IDE’s book series SIDE.

Authors whose proposals are based on research data are encouraged to also publish the data sets, for example on GitHub. A Zenodo community will be created to bundle and archive the data sets related to the conference.

Important dates

10 February 1 March 2022: Submission of papers
10 19 April 2022: Notification of acceptance/rejection
9-10 June 2022: Conference
31 August 2022: Submission of final full papers
early 2023: Publication of proceedings

Venue

The conference will be held at the University of Rostock, in Northern Germany. It is planned as an on-site event until further notice, considering the future development of the pandemic. There will be the possibility to present virtually, as well.

Keynote

Prof. Dr. Roger Labahn (University of Rostock)

Organizers

Dr. Bernhard Geiger (Know-Center Graz)
Jun.-Prof. Ulrike Henny-Krahmer (University of Rostock)
Fabian Kaßner (University of Rostock)
Marc Lemke (University of Rostock)
Gerlinde Schneider (University of Graz)
Dr. Martina Scholger (University of Graz)

Scientific Committee

Dr. Helena Bermúdez-Sabel (University of Neuchâtel)
Hannah Busch (Royal Netherlands Academy of Arts & Sciences (KNAW))
PD Dr. Katrin Dennerlein (University of Würzburg)
Dr. Bernhard Geiger (Know-Center Graz)
Prof. Dr. Denis Helic (Graz University of Technology)
Jun.-Prof. Ulrike Henny-Krahmer (University of Rostock)
Prof. Dr. Tobias Hodel (University of Bern)
Fabian Kaßner (University of Rostock)
Ass. Prof. Dr. Roman Kern (Know-Center Graz)
Prof. Dr. Mike Kestemont (University of Antwerp)
Marc Lemke (University of Rostock)
Prof. Dr. Fotis Jannidis (University of Würzburg)
Prof. Dr. Manuel Portela (University of Coimbra)
Prof. Dr. Patrick Sahle (University of Wuppertal)
Gerlinde Schneider (University of Graz)
Dr. Martina Scholger (University of Graz)
Prof. Dr. Georg Vogeler (University of Graz)

The conference is co-organized by the Institut für Dokumentologie und Editorik, DH Rostock, the Know Center and the Centre for Information Modelling at the University of Graz. It is funded by the University of Rostock and supported by the NEISS project.

               

References

Alpaydin, Ethem (20143): Introduction to machine learning. Cambridge, Mass.: MIT Press.

Boenig, Matthias, Kay-Michael Würzner, Arne Binder, and Uwe Springmann (2016): “Über den Mehrwert der Vernetzung von OCR-Verfahren zur Erfassung von Texten des 17. Jahrhunderts.” In: DHd2016. Konferenzabstracts. https://www.dhd2016.de/abstracts/vortr%C3%A4ge-032.html

Brusuelas, James H. (2021): “Scholarly Editing and AI: Machine Predicted Text and Herculaneum Papyri.” magazén 2 (1). http://doi.org/10.30687/mag/2724-3923/2021/03/002

Haeder, Tamara (2020): Zurück in die Zukunft: Named-Entity-Recognition für digitale (historisch-kritische) Editionen. Master’s thesis. Hochschule Darmstadt.

Han, Jiawei, Micheline Kamber, and Jian Pei (20123): Data Mining: Concepts and Techniques. Amsterdam [et al.]: Elsevier.

Hoenen, Armin (2018): “From Manuscripts to Archetypes through Iterative Clustering.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): 712-718. https://www.aclweb.org/anthology/L18-1114.pdf

Kestemont, Mike, Vincent Christlein, and Dominique Stutzmann (2017): “Artificial Paleography: Computational Approaches to Identifying Script Types in Medieval Manuscripts.” Speculum 92 (S1): 86-109. https://doi.org/10.1086/694112

Koncar, Philipp, Alexandra Fuchs, Elisabeth Hobisch, Bernhard C. Geiger, Martina Scholger, and Denis Helic (2020): “Text Sentiment in the Age of Enlightenment: an analysis of spectator periodicals.” Applied Network Science 5, Article Number: 33. https://doi.org/10.1007/s41109-020-00269-z

Nassourou, Mohamadou (2013): Computer-Supported Textual Criticism. Theory, Automatic Reconstruction of an Archetype. Norderstedt: Books on Demand.

Reul, Christian, Uwe Springmann, Christoph Wick, and Frank Puppe (2019): “State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines.” In: DHd2019. Konferenzabstracts. https://doi.org/10.5281/zenodo.4622016

Sahle, Patrick (2016): “What is a Scholarly Digital Edition?” In: Digital Scholarly Editing: Theories and Practices. Edited by Matthew James Driscoll and Elena Pierazzo. Cambridge: Open Book Publishers, 19-39. http://dx.doi.org/10.11647/OBP.0095.02

Call for Papers: ML/DS – DSE
Markiert in: