The lemmatized corpus of published letters of the correspondence Datini


A textual database in computerized form of letters published

The project involved the creation of a textual database of approximately 4000 published letters preserved in the Datini. The project started in 2003 CNR Institute through the award of the Opera Italian vocabulary and it was delivered and tested in October 2005.

& Nbsp;

It was implemented through:

  • identification and digitization of published letters to be entered in the database and that they constitute an organic corpus;
  • OCR operations and the generation of files on which to insert, after the correction of the output, the markings on the CAT software 3.3 (prepared by the Institute of the CNR Opera Italian vocabulary from D. Iorio-Fili and granted free of charge);
  • the development of a customized version of the same software, with the functions required by the State of Prato;
  • the creation of a database of all the texts, with the possibility of search of the shapes present;
  • lemmatization of terms belonging to certain sectors vocabulary: historical and economic, commercial (including maritime trade), military, historical-juridical, technicalities;
  • the identification and indexing of antroponomini and place names;
  • creating iperlemmi (lemmas of the second level) by categories of terms of particular significance for the research;
  • the creation of a compatible version for web, called up from the project website Datini.

Some summary data:
And’ produced a corpus of lemmatized correspondence Datini, set up with the same program that manages the corpus “Origins of the Treasury of the Italian Language (TLIO)”, in a version specially dedicated and queried via the web.

The corpus consists of:

  • 2.511 texts
  • 45259 form
  • 977.034 occurrences of which 126.663 lemmatized
  • 6.510 lemmas and 22 iperlemmi.

Go to the site

RESPONSIBILITY scientific project, directions of work and testing: Diana Toccafondi (State Archives of Prato)

Scientific advice and cooperation to the identification of the texts: Jerôme Hayez (CNRS – Paris)

Implementation of the database: Work of the vocabulary of the Italian Institute National Research Council (CNR-OVI), Via di Castello 46, Florence
Software Development: Domenico Iorio-Fili
Responsible for the philological aspects and lexicographical: Par Larson

Lemmatization, entering and checking data: Paolo Squillacioti (resp. entering and checking data), Elena Artale (resp. lemmatizzazione), Mariafrancesca Giuliani, Rossella Mosti