Marco Passarotti

Digital Classical Philology

The Project of the Index Thomisticus Treebank

De Gruyter Saur | 2019
Marco PassarottiThe Project of the Index ThomisticusTreebankAbstract:The paper introduces the project of the Index Thomisticus Treebank(IT-TB). The IT-TB is a dependency-based treebank based on the corpus of theIndex Thomisticus by father Roberto Busa (IT), which includes theoperaomniaof Thomas Aquinas, for a total of approximately 11 million words.Currently, the IT-TB is the largest Latin treebank available, with more than350,000 nodes in around 17,000 sentences. The annotation covers the entirebooks 1, 2 and 3 ofSumma contra Gentiles,plusexcerptsfromScriptum superSententiis Magistri Petri LombardiandSumma Theologiae. The paper detailsthe multi-layer annotation style of the IT-TB and its background theoreticalmotivations. The conversion process to the now widely used UniversalDependencies style is described as well. Across more than a decade, the proj-ect has developed a number of linguistic resources and NLP tools for Latinconnected to the IT-TB. As for the resources, the paper presents the syntax-based subcategorization lexicon IT-VaLex and the valency lexicon LatinVallex. As for the tools, the automatic dependency parsing process is de-scribed, highlighting the core issue of portability of NLP tools across the widediachronic and diatopic span of Latin texts. A section is dedicated to auto-matic morphological analysis of Latin, introducing the analyzer Lemlat andits recent enhancement with information on derivational morphology anda new set of lexical entries covering a largeOnomasticon(from Forcellini dic-tionary) and Medieval Latin (from Du Cange glossary).1 IntroductionThe name of the Italian Jesuit Roberto Busa is quoted in almost every introduc-tion to Computational Linguistics or Digital Humanities. His often recountedMarco Passarotti,Università Cattolica del Sacro Cuore, MilanoNote: The author gratefully acknowledges the support of the project LiLa (Linking Latin.Building a Knowledge Base of Linguistic Resources for Latin). This project has received fund-ing from the European Research Council (ERC) European Unions Horizon 2020 research andinnovation programme under grant agreement No 769994.Open Access. © 2019 Marco Passarotti, published by De Gruyter.This work is licensedunder a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.