Resources and tools

sPeriodika Corpus

5. July 2025

The corpus, created as part of our research programme, includes texts from Slovenian periodical publications spanning from 1771 to 1914. It was compiled by processing texts retrieved from the digital library service dLib.si of the National and University Library of Slovenia. The retrieved texts were prepared using optical character recognition (OCR) of PDF files and plain text format. During corpus preparation, the texts were further cleaned and processed, lemmatised and part-of-speech tagged, with named entity recognition also added. The corpus, available in the CLARIN.SI repository with the noSketch Engine concordancer, contains 910,064,957 tokens and 708,306,576 words from 216 different periodical publications.