• Rodolfo Delmonte Ca' Foscari University Venice
Keywords: TextToSpeech, Prosody, NLP, Poetry


In this paper we present SPARSAR, a system for English poetry recital. The system is parasitic on TextToSpeech (TTS) systems available both online and on Macintosh computers. It creates prosodic parameters and phonetic transcriptions on any input text to be used by the TTS in order to normalize and improve current systems which are statistically based. In order to show TTS inability to produce semantically coherent and expressive readings Italian texts will be used at first and critical points indicated and discussed. Then SPARSAR architecture will be introduced and its three layers presented in detail. The ability of the system to generate appropriate prosodic parameters will be discussed in relation to a poem by Sylvia Plath, Edge. The peculiarity of this poem is its richness in enjambments, which are not captured at all by statistically based TTS. Eventually, latest work on Elisabethan poetry will be presented and a Sonnet by Shakespeare will be transcribed and annotated with prosodic parameters’ values by the system; in particular, it will be shown how the lack of a specific component to account for contractions and rhyming violations makes best commercial TTS systems even unable to pronounce words correctly.

Author Biography

Rodolfo Delmonte, Ca' Foscari University Venice
Associate Professor of Linguistics and Computational Linguistics in the Departement of Language Studies now retire since 2016.


o Montaño, Raúl, Francesc Alías, Josep Ferrer, (2013), Prosodic analysis of storytelling discourse modes and narrative situations oriented to Text-to-Speech synthesis, 8th ISCA Speech Synthesis Workshop, pp. 171-176.

o Saheer, Lakshmi, Blaise Potard – IDIAP, Martigny, (2013), Understanding Factors in Emotion Perception, At 8° Speech Synthesis Workshop, pp. 59-64.

o Delmonte R. (1981a), An Automatic Unrestricted Tex-to-Speech Prosodic Translator, Atti del Convegno Annuale A.I.C.A., Pavia, pp. 1075-83.

o Delmonte R. (1981b), Automatic Word-Stress Patterns Assignment by Rules: a Computer Program for Standard Italian, Proc. IV F.A.S.E. Symposium, 1, ESA, Roma, pp. 153-156.

o Delmonte R., G.A.Mian, G.Tisato, (1984), A Text-to-Speech System for the Synthesis of Italian, Proceedings of ICASSP'84, San Diego(Cal), pp. 291-294.

o Delmonte R., (2008), Speech Synthesis for Language Tutoring Systems, in V.Melissa Holland & F.Pete Fisher(eds.), (2008), The Path of Speech Technologies in Computer Assisted Language Learning, Routledge - Taylor and Francis Group, New York, pp. 123-150.

o Delmonte R., (2015), Visualizing Poetry with SPARSAR - Poetic Maps from Poetic Content, Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, Denver, Colorado, Association for Computational Linguistics, pp. 68–78.

o Delmonte R., (2016), Expressivity in TTS from Semantics and Pragmatics, in Vayra, M., Avesani, C. & Tamburini F. (Eds.) (2016). Il farsi e disfarsi del linguaggio. Acquisizione, mutamento e destrutturazione della struttura sonora del linguaggio/Language acquisition and language loss. Acquisition, change and disorders of the language sound structure, Milano: AISV. pp. 407-427.

o Archana Balyan, S. S. Agrawal, Amita Dev, (2013), Speech Synthesis: A Review, International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 6, pp. 57-75.

o Sproat, R. (ed.), (1997), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, Dordrecht, Kluwer Academic.

o T. Delić, S. Suzić, M. Sečujski, D. Pekar, (2017), Multi-style Statistical Parametric TTS. In Proceedings Digital Speech and image processing (DOGS 2017), pp. 5–8.

o Rajeswari K. C., Uma Maheswari P., (2012), Prosody Modeling Techniques for Text-to-Speech Synthesis Systems - A Survey, International Journal of Computer Applications (0975 – 8887) Vol. 39, No.16, pp. 8-11.

o Alan W. Black, et al., (2011), New Parameterization for Emotional Speech Synthesis, Final Report for NPESS team, CSLP Johns Hopkins Summer Workshop 2011, http://www.clsp.jhu.edu/workshops/ws11/groups/npess.

o Alan W. Black, et al., (2012), Articulatory Features For Expressive Speech Synthesis, in Proceeding of ICASSP 2012, IEEE, 4005-4008.

o Jonathan Shen and Ruoming Pang, (2017), Tacotron 2: Generating Human-like Speech from Text, Google AI Blog, https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html

o Yuxuan Wang, RJ Skerry-Ryan, (2018), Expressive Speech Synthesis with Tacotron, Google AI Blog, https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html.

o Delmonte, R., Francesco Stiffoni, (2011), Using Speech Synthesis to Simulate an Interlanguage and Learn the Italian Lexicon, SLATE_2011, in Proceedings Workshop “Speech and Language Technology in Education“, Venezia, ISCA Archives, pp. 25-28, downloadable at https://www.isca-students.org/archive/slate_2011/papers/sl11_025.pdf

o Bacalu C., Delmonte R., (1999a), Prosodic Modeling for Syllable Structures from the VESD - Venice English Syllable Database, in Atti 9° Convegno GFS-AIA, Venezia.

o Bacalu C., Delmonte R., (1999b), Prosodic Modeling for Speech Recognition, in Atti del Workshop AI*IA - "Elaborazione del Linguaggio e Riconoscimento del Parlato", IRST Trento, pp. 45-55.

o Delmonte R., et al., (2005), VENSES – a Linguistically-Based System for Semantic Evaluation, in J. Quiñonero-Candela et al.(eds.), Machine Learning Challenges. LNCS, Springer, Berlin, pp. 344-371.

o Bresnan J. (ed.), (1982), The Mental Representation of Grammatical Relations, The MIT Press, Cambridge MA.

o Greene E., T. Bodrumlu, K. Knight, (2010), Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation, in Proceedings of the 2010 Conference on EMNLP, pp. 524–533.

o Tsur Reuven, (2012), Poetic Rhythm: Structure and Performance: An Empirical Study in Cognitive Poetics, Sussex Academic Press, pp. 472.

o Baayen R. H., R. Piepenbrock, and L. Gulikers, (1995), The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium.