Index of /demm/sandbox/2017/PetraGaluscakova/lucie

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[DIR]data/06-Sep-2017 12:28 -  
[DIR]scripts/06-Sep-2017 12:28 -  

Historical Manuscript Sections Detection 1.0
Petra Galuščáková

1. Description
This package provides a script, training and test data for recognition
of sections of historical documents. Recognition models are first
trained using manually marked sections in training documents. The same
sections can then be recognized in the test data. Script supports
recognition of 'intitulatio', 'publicatio', 'inscriptio', 'narratio',
'arenga', 'dispositio','corroboratio', and 'datatio' sections of
medieval diplomatic manuscripts. The script supports three methods
based on Cosine Distance, TF-IDF weighting and adapted Viterbi

2. Preamble

2.1 Source
The source of data were taken from edition Urkunden- und Quellenbuch
zur Geschichte der altluxemburgischen Territorien IX. Die Urkunden Graf
Johanns des Blinden (1310-1346). Teil 2: Die Urkunden aus den Archives
Générales du Royaume, Brüssel, ed. Aloyse von Estgen Michel Pauly,
Hérold Pettiau, Jean Schroeder, Luxemburg 2009 (= Publications du
CLUDEM 29). For the project only the charters written in Latin due to
its much more constant orthography in comparison with vernacular
languages were chosen. Data in this edition are constant also in the
latin othography thanks to edition rules. But the system is also
prepared for new edited manuscripts of [1].

2.2. License
The script is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit

3. Authors
Petra Galuščáková

Charles University in Prague
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Malostranske náměstí 25
118 00
Prague 1
Czech Republic

Lucie Neužilová

Charles University in Prague
Faculty of Arts
Departement of Auxiliary Sciences of History
Palackého 2
Praha 1, 116 38

4. Data
4.1. Scripts
Python script for automatic recognition of sections of medieval
manuscripts. Script requires scikit-learn package [2].
Script can be run using following command:
python scripts/ -t data/train.xml -i data/heldout.xml
Script supports following settings:
--tagged: Set of manuscripts with marked sections in
the CEI format [3].
--input: Set of plain CEI documents for testing.
--classtype: Manuscript class which should be used for
training and testing. This class should be be set for
documents using tag.
--names: Implicitly, 'c', 'persName', 'rolename',
'measure' and 'placeName' tags are removed from the
training data, if they are present. These names are
preserved with the 'names' tag.
--variance: Reduce different writing variants into a
single one (e.g. replace 'ae' by 'e' or 'y' by 'ii')
--lemmas: Use lemmas instead of word forms for
training and testing. File with lemmas generated by
LemlatTXT software [3] should be given on the input.
--method: Method which should be used for calculation
of the phrases. Either 'cosine', 'tfidf' or 'graph' can
be used.
--max: Distance between example phrases and text
sections can be calculated for "max" number of the most
similar phrases in the cosine distance algorithm.

4.2 Data
* train.xml
Training documents in the CEI format with marked text sections.
* heldout.xml
Untagged test documents in the CEI format.
* heldout-tagged.xml
Test documents in the CEI format with marked text sections
which can be used for evaluation.
* train+heldout-lemmas.xml
All words from training and heldout data lemmatized by the
LemlatTXT software [4].
* example-output.csv
Example output for train.xml and heldout.xml documents. Most
probable text section for each section type is returned
together with the similarity score and id of the training
document with the most similar section.

5. Acknowledgements
This package was created in the Digital Editing of Medieval Manuscripts
Training Programme [5].
This research has been supported by the Czech Science Foundation
(grant n. P103/12/G084).

6. References