Historical Manuscript Sections Detection 1.0
This package provides a script, training and test data for recognition
of sections of historical documents. Recognition models are first
trained using manually marked sections in training documents. The same
sections can then be recognized in the test data. Script supports
recognition of 'intitulatio', 'publicatio', 'inscriptio', 'narratio',
'arenga', 'dispositio','corroboratio', and 'datatio' sections of
medieval diplomatic manuscripts. The script supports three methods
based on Cosine Distance, TF-IDF weighting and adapted Viterbi
The source of data were taken from edition Urkunden- und Quellenbuch
zur Geschichte der altluxemburgischen Territorien IX. Die Urkunden Graf
Johanns des Blinden (1310-1346). Teil 2: Die Urkunden aus den Archives
Générales du Royaume, Brüssel, ed. Aloyse von Estgen Michel Pauly,
Hérold Pettiau, Jean Schroeder, Luxemburg 2009 (= Publications du
CLUDEM 29). For the project only the charters written in Latin due to
its much more constant orthography in comparison with vernacular
languages were chosen. Data in this edition are constant also in the
latin othography thanks to edition rules. But the system is also
prepared for new edited manuscripts of monasterium.net .
The script is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit
Charles University in Prague
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Malostranske náměstí 25
Charles University in Prague
Faculty of Arts
Departement of Auxiliary Sciences of History
Praha 1, 116 38
Python script for automatic recognition of sections of medieval
manuscripts. Script requires scikit-learn package .
Script can be run using following command:
python scripts/parser.py -t data/train.xml -i data/heldout.xml
Script supports following settings:
--tagged: Set of manuscripts with marked sections in
the CEI format .
--input: Set of plain CEI documents for testing.
--classtype: Manuscript class which should be used for
training and testing. This class should be be set for
--names: Implicitly, 'c', 'persName', 'rolename',
'measure' and 'placeName' tags are removed from the
training data, if they are present. These names are
preserved with the 'names' tag.
--variance: Reduce different writing variants into a
single one (e.g. replace 'ae' by 'e' or 'y' by 'ii')
--lemmas: Use lemmas instead of word forms for
training and testing. File with lemmas generated by
LemlatTXT software  should be given on the input.
--method: Method which should be used for calculation
of the phrases. Either 'cosine', 'tfidf' or 'graph' can
--max: Distance between example phrases and text
sections can be calculated for "max" number of the most
similar phrases in the cosine distance algorithm.
Training documents in the CEI format with marked text sections.
Untagged test documents in the CEI format.
Test documents in the CEI format with marked text sections
which can be used for evaluation.
All words from training and heldout data lemmatized by the
LemlatTXT software .
Example output for train.xml and heldout.xml documents. Most
probable text section for each section type is returned
together with the similarity score and id of the training
document with the most similar section.
This package was created in the Digital Editing of Medieval Manuscripts
Training Programme .
This research has been supported by the Czech Science Foundation
(grant n. P103/12/G084).