Odyssey 2026

Keynote Speakers

Prof. Dr. Rita Singh – School of Computer Science, Carnegie Mellon University

Keynote Title: Genetic information in human voice: how much do we know today and how much more will technology uncover?

Abstract: Speech arises from a tightly integrated biomechanical and cognitive–motor system in which cortical speech areas, basal ganglia, and cerebellum jointly control the larynx and vocal tract to generate highly individualized acoustic output. Contemporary neurobiological models frame speech production as a hierarchically organized sensorimotor loop that transforms abstract linguistic plans into finely timed articulatory gestures, with cortico–basal ganglia–cerebellar circuits supporting both motor sequencing and higher-order cognitive control. Building on this framework, the talk will link inter-individual variation in vocal biomechanics: laryngeal tissue properties, vocal tract morphology, and neuromuscular control, to underlying genetic architecture. Twin and developmental studies already show substantial heritability for fundamental frequency and related voice-quality measures, indicating a genetic contribution to key acoustic parameters. Recent genome-wide association work has identified specific loci, including variants in ABCC9, that modulate median voice pitch across populations, suggesting that common variants in ion channels and other pathways systematically shape habitual phonation. 
Against this backdrop, the talk will discuss emerging efforts to connect detailed acoustic features (e.g., source–filter characteristics, prosodic dynamics, micro-perturbation measures) to cytogenetic and genomic information, including voice–genomics profiling studies and AI-based analyses of voices in monogenic and chromosomal syndromes. These data, combined with the statistical arguments in my paper Human Voice is Unique, quantify the improbability that two individuals share indistinguishable voices once the full multidimensional acoustic space is considered, and motivate treating the voice as a rich, partially decoded phenotypic readout of genotype. The talk will conclude by assessing how far current evidence justifies inferring genetic information from voice, outlining realistic near-term capabilities and limitations, and sketching future directions such as large-scale voice–genome cohorts, mechanistic models linking genes to vocal biomechanics, and ethically grounded machine-learning methods, to determine how much more our voices can reveal about our genomes.

Biography: Rita Singh is a Research Professor at the CMU’s School of Computer Science/Language Technologies Institute, with affiliations to three other departments. At CMU, she leads the Center for Voice Intelligence and Security, and co-leads the Machine Learning for Signal Processing and Robust Speech Processing research groups. She has worked on speech and audio processing for over two decades. Since 2014, her work has been focused on developing the science of profiling humans from their voice, a niche area at the intersection of Artificial Intelligence and Voice Forensics. The technology pioneered by her group has led to three world firsts: In 2018, her team created the world’s first voice-based profiling system, demonstrated live at the World Economic Forum. In 2019 her group also created the world’s first instance of human voice – that of the artist Rembrandt – generated based on evidence from facial images. In 2020, her team conceptualized and enabled the first voice-based detection system for Covid-19. She is the author of the book “Profiling Humans from their Voice,” published by Springer-Nature in 2019. 

Prof. Dr. Lukáš Burget – Brno University of Technology

Keynote Title: From Single-Channel Foundations to Multi-Speaker and Multi-Modal Understanding

Abstract: Recent advances in foundation models such as Whisper and WavLM have transformed automatic speech recognition, yet most systems still assume a single, clean speaker. This talk traces the progression toward models that can process and understand natural multi-speaker conversations. I will discuss how large pre-trained speech models can be extended to multi-channel input and spatially aware processing, how speaker diarization and recognition can be unified within a single framework, and how efficient model compression enables real-time deployment. Together, these developments move the field from modular, error-prone pipelines toward integrated systems capable of attributing and transcribing overlapping speech in realistic acoustic conditions. Looking ahead, I will outline ongoing efforts to extend these ideas beyond audio, toward audio-visual modeling and toward combining speech encoders with large language model decoders that can summarize or interpret conversations. These directions reflect a broader goal in speech technology: bringing machines closer to understanding who is speaking, what is being said, and ultimately what it means.

Biography: Lukáš Burget is an associate professor at the Faculty of Information Technology, Brno University of Technology (FIT BUT), and the scientific director of the BUT Speech@FIT group. He is recognized internationally as a leading expert in speech processing. He received his Ph.D. from Brno University of Technology in 2004. From 2000 to 2002, he was a visiting researcher at OGI, Portland, USA, and in 2011–2012, he spent a sabbatical at SRI International in Menlo Park, USA. His research interests include various areas of speech processing, such as acoustic modeling for speech, speaker, and language recognition. He has played a leading role in several JSALT workshops, notably leading the 2008 team that introduced i-vectors, the first widely adopted speech embeddings, and contributing in 2009 to the creation of the widely used Kaldi ASR toolkit. In 2006, he co-founded Phonexia, a company that now employs over 50 full-time staff and delivers speech technologies to clients in more than 60 countries.

Prof. Dr. Björn Schuller – TUM School of Medicine and Health, TUM School of Computation, Information and Technology

Keynote Title: Every breath you take: From Vocal Chords to Health Scores

Abstract: Our voices are more than carriers of words – they are rich, noisy, beautifully imperfect biosensors. Every breath you take and laugh, sigh, or stutter you utter encodes a chord progression of physiology and psychology: from affective mood, to fatigue, calmness or gleeful states. With AI’s help, we can pick up sharp irregularities and when voices go just a little flat before your health does. Moving from classic speaker ID, we will explore how we can turn vocal cords into health scores using modern affective and health-aware AI. Such AI can tell you who you are, how you feel, and how you’re doing: from detecting anxiety, burnout, and cognitive load to depression, all the way to links with yourself, your behaviour, and chronic conditions. We will discuss how we build models that stay robust in the wild—across devices, languages, and accents—while keeping them clinically meaningful. On the technical side, we will move from representation and neural architecture learning for paralinguistics to self-supervised learning at scale, and how large reasoning models change the game for speaker characterisation. On the societal side, we shall tackle the uncomfortable but essential questions: privacy, bias, efficiency, and explainability—just to name a few of the most essential aspects. Expect a tour from lab demos to real-world deployments in everyday devices—highlighting where “Computational Paralinguistics” rock already, and what more it will take to make voice a trusted instrument in the future health orchestra.

Biography: Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany where he is Full Professor and Chair of Health Informatics. He is also Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, Core Member in the Munich Data Science Institute (MDSI), Principal Investigator in the Munich Center for Machine Learning (MCML), amongst other Professorships and Affiliations. Previous stays include Full Professor at the Universities of Augsburg and Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the ACM, Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, and Elected Full Member Sigma Xi. He (co-)authored 1,500+ publications (80,000+ citations, h-index 123), is Field Chief Editor of Frontiers in Digital Health, Editor in Chief of AI Open and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community. His 50+ awards include having been honoured as one of 40 extraordinary scientists under the age of 40 by the WEF in 2015. Currently, he was awarded ACM Distinguished Speaker for the term 2024-2027 and was an IEEE Signal Processing Society Distinguished Lecturer 2024. He served as consultant of companies such as Barclays, GN, Huawei, Informetis, or Samsung. Schuller counts more than 300 public press appearances including in Newsweek, Scientific American, and Times.

Prof. Dr. Daniel Ramos – Universidad Autonoma de Madrid

Keynote Title:

Abstract:

Biography: