• Simone Torsani Dipartimento di Lingue e Culture Moderne, Università di Genova


A corpus of the Italian local press. This paper introduces CoSIL, a corpus of articles from Italian local newspapers containing about 180,000 texts and 66,000,000 words. The corpus was built to provide researchers with a freely downloadable balanced corpus of journalistic texts and a material for linguistic research on online local press, a nowadays-pervasive source of information. Besides the objectives behind the construction of the corpus, the paper describes its design and development, focusing on its representativeness and balance.


Aliprandi, S. (2013). Creative Commons: manuale operativo. Ledizioni.

Baroni, M., & Bernardini, S. (2004, May). BootCaT: Bootstrapping Corpora and Terms from the Web. In LREC

Baroni, M., Bernardini, S., Comastri, F., Piccioni, L., Volpi, A., Aston, G., & Mazzoleni, M. (2004). Introducing the La Repubblica Corpus: A Large, Annotated, TEI (XML)-compliant Corpus of Newspaper Italian. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04).

Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209-226.

Baroni, M., & Ueyama, M. (2006). Building general-and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application (pp. 31-40).

Kamocki, P.; Ketzan, E. (2014): Creative Commons and Language Resources: General Issues and what's new in CC 4.0. In: CLARIN Legal Issues Committee (CLIC)-White Paper Series. In rete, all’indirizzo https://www.clarin-d.de/images/legal/CLIC_white_paper_1.pdf.

Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell'Orletta, F., Pirrelli, V. (2014). The PAISA'Corpus of Italian Web Texts. In 9th Web as Corpus Workshop (WaC-9)@ EACL 2014 (pp. 36-43). EACL (European chapter of the Association for Computational Linguistics).

Magnini B., Pianta E., Girardi C., Negri M., Romano L., Speranza M., Bartalesi Lenzi V., Sprugnoli R., (2006). I-CAB: the Italian Content Annotation Bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May. European Language Resources Association (ELRA).

McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.

Spina, S. (2014). Il Perugia Corpus: una risorsa di riferimento per l’italiano. Composizione, annotazione e valutazione. In First Italian Conference on Computational Linguistics CLiC-it 2014 (Vol. 1, pp. 354-359). Pisa University Press.