Keynote Speakers – Odyssey 2026

Prof. Dr. Daniel Ramos – Universidad Autonoma de Madrid

Keynote Title: Rigorous Forensic Automatic Speaker Recognition: Bayesian Decision Theory, Probabilistic Calibration and Case-Specific Validation

Abstract: (click to expand)

The use of automatic speaker recognition systems in forensic science has undergone a dramatic improvement in recent years in terms of scientific rigor, objectivity, and consensus. As a result, the discipline has become strongly aligned with the recommendations of the recently released ISO 21043 standard for forensic sciences. In this talk, we will identify three elements that are now essential for the proper and standardized use of automatic speaker recognition systems in forensic science. First, the adoption of a Bayesian decision-theoretical framework ensures the logical incorporation of system information, expressed as a likelihood ratio, into the decision-making process of judges or juries. Second, the probabilistic calibration of likelihood ratios ensures that decisions are made optimally by the fact finder. Third, the strict and systematic validation of systems under case-specific conditions ensures that forensic casework is conducted with sufficient quality. In this context, the contribution and impact of the speaker recognition community have been paramount, with the proposal of techniques such as score-based likelihood ratios, proper scoring rules for validation, and probabilistic calibration. These methods have since been progressively adopted in other areas, including forensic biometrics, forensic chemistry, and forensic DNA profiling, and now extend to an ever-growing range of forensic disciplines.

Biography: (click to expand)

Dr. Daniel Ramos is an Associate Professor at the Universidad Autónoma de Madrid and a member of the AUDIAS research group (https://audias.ii.uam.es/). His research focuses on forensic evidence evaluation and validation, speech and signal processing, machine learning, and artificial intelligence. He has visited different institutions worldwide with a strong research focus on forensic interpretation and Bayesian machine learning, including the University of Lausanne (Switzerland), the University of Stellenbosch (South Africa), the University of Edinburgh (UK), the Netherlands Forensic Institute (NFI), and the University of Cambridge (UK). He has also been a visiting professor in probabilistic machine learning at the Universidad de Buenos Aires (Argentina). Dr. Ramos has collaborated with different forensic institutions, notably in the long term with the Spanish Guardia Civil and the NFI, as well as with the Institute of Forensic Research at Krakow (Poland) and the International Forensic Research Institute at Florida International University (USA). An associate member of the ENFSI Forensic Speech and Audio Working Group, Dr. Ramos has pioneered the field of forensic speaker recognition using likelihood ratios, but also in other forensic disciplines such as forensic chemistry. His contributions to the validation of forensic likelihood ratios now serve as the basis for guidelines in different forensic institutions across Europe. He has been invited to multiple scientific events related to forensic science, notably by the NFI and the National Institute of Standards and Technology (NIST), and its Organization of Scientific Area Committees (OSAC), which drives the development of standards for forensic science in the USA.

Keynote presentation

Prof. Dr. Rita Singh – School of Computer Science, Carnegie Mellon University

Keynote Title: Genetic information in human voice: how much do we know today and how much more will technology uncover?

Abstract: (click to expand)

The human voice is a powerful biometric: it encodes not only who is speaking but also what is happening to the speaker at the moment of speech. Behind it lies one of the body’s most intricate processes: a cognitive, neuromotor, physiological and bio‑mechanical cascade spanning cortical language areas, the cerebellum and basal ganglia, the cranial nerves, the larynx and the vocal tract. This talk asks how deeply that process is shaped by our genes, and whether the voice can therefore be read as a window onto the genome. Starting from molecular and cytogenetic basics — DNA, chromosomes, and the ~45,000 genes that build and operate the speech apparatus — it surveys genes already linked to voice, from speech‑motor control (e.g., FOXP2) and vocal‑fold biomechanics (collagens, elastin) to laryngeal structure and pitch. It then moves from static anatomy to dynamic biology, using systems‑biology pathways to trace, gene by gene, how an external input reaches the voice: a worked example follows alcohol from the glass through its metabolic genes (ADH1B, ALDH2, CYP2E1) and central effects to measurable slurred articulation, showing how the voice becomes a readout of both the substance consumed and the listener’s own metabolic genotype. From such cases the talk advances a central hypothesis and a gene‑based algorithm: if any factor influences mind or body, and a biological pathway can be traced from that influence to voice production, then its imprint on the voice can in principle be discovered, opening a route to inferring traits, external influences, and a wide range of the 30,000+ known conditions from speech. Because the very variations that make a voice unique also confound any single influence, the talk closes with the mathematics of confounding: detection, adjustment, and causal separation via Pearl’s do‑calculus, and argues for treating the voice as a rich, partially decoded phenotypic readout of genotype. This work underpins the observations reported in Singh, R. (2023). A Gene-Based Algorithm for Identifying Factors That May Affect a Speaker’s Voice. Entropy, 25(6), 897.

Biography: (click to expand)

Rita Singh is a Research Professor at the CMU’s School of Computer Science/Language Technologies Institute, with affiliations to three other departments. At CMU, she leads the Center for Voice Intelligence and Security, and co-leads the Machine Learning for Signal Processing and Robust Speech Processing research groups. She has worked on speech and audio processing for over two decades. Since 2014, her work has been focused on developing the science of profiling humans from their voice, a niche area at the intersection of Artificial Intelligence and Voice Forensics. The technology pioneered by her group has led to three world firsts: In 2018, her team created the world’s first voice-based profiling system, demonstrated live at the World Economic Forum. In 2019 her group also created the world’s first instance of human voice – that of the artist Rembrandt – generated based on evidence from facial images. In 2020, her team conceptualized and enabled the first voice-based detection system for Covid-19. She is the author of the book “Profiling Humans from their Voice,” published by Springer-Nature in 2019.

Keynote presentation

Prof. Dr. Björn Schuller – TUM School of Medicine and Health, TUM School of Computation, Information and Technology

Keynote Title: Every breath you take: From Vocal Chords to Health Scores

Abstract: (click to expand)

Our voices are more than carriers of words – they are rich, noisy, beautifully imperfect biosensors. Every breath you take and laugh, sigh, or stutter you utter encodes a chord progression of physiology and psychology: from affective mood, to fatigue, calmness or gleeful states. With AI’s help, we can pick up sharp irregularities and when voices go just a little flat before your health does. Moving from classic speaker ID, we will explore how we can turn vocal cords into health scores using modern affective and health-aware AI. Such AI can tell you who you are, how you feel, and how you’re doing: from detecting anxiety, burnout, and cognitive load to depression, all the way to links with yourself, your behaviour, and chronic conditions. We will discuss how we build models that stay robust in the wild—across devices, languages, and accents—while keeping them clinically meaningful. On the technical side, we will move from representation and neural architecture learning for paralinguistics to self-supervised learning at scale, and how large reasoning models change the game for speaker characterisation. On the societal side, we shall tackle the uncomfortable but essential questions: privacy, bias, efficiency, and explainability—just to name a few of the most essential aspects. Expect a tour from lab demos to real-world deployments in everyday devices—highlighting where “Computational Paralinguistics” rock already, and what more it will take to make voice a trusted instrument in the future health orchestra.

Biography: (click to expand)

Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany where he is Full Professor and Chair of Health Informatics. He is also Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, Core Member in the Munich Data Science Institute (MDSI), Principal Investigator in the Munich Center for Machine Learning (MCML), amongst other Professorships and Affiliations. Previous stays include Full Professor at the Universities of Augsburg and Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the ACM, Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, and Elected Full Member Sigma Xi. He (co-)authored 1,500+ publications (80,000+ citations, h-index 123), is Field Chief Editor of Frontiers in Digital Health, Editor in Chief of AI Open and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community. His 50+ awards include having been honoured as one of 40 extraordinary scientists under the age of 40 by the WEF in 2015. Currently, he was awarded ACM Distinguished Speaker for the term 2024-2027 and was an IEEE Signal Processing Society Distinguished Lecturer 2024. He served as consultant of companies such as Barclays, GN, Huawei, Informetis, or Samsung. Schuller counts more than 300 public press appearances including in Newsweek, Scientific American, and Times.

Keynote presentation

Prof. Dr. Lukáš Burget – Brno University of Technology

Keynote Title: From Single-Channel Foundations to Multi-Speaker and Multi-Modal Understanding

Abstract: (click to expand)

Recent advances in foundation models such as Whisper and WavLM have transformed automatic speech recognition, yet most systems still assume a single, clean speaker. This talk traces the progression toward models that can process and understand natural multi-speaker conversations. I will discuss how large pre-trained speech models can be extended to multi-channel input and spatially aware processing, how speaker diarization and recognition can be unified within a single framework, and how efficient model compression enables real-time deployment. Together, these developments move the field from modular, error-prone pipelines toward integrated systems capable of attributing and transcribing overlapping speech in realistic acoustic conditions. Looking ahead, I will outline ongoing efforts to extend these ideas beyond audio, toward audio-visual modeling and toward combining speech encoders with large language model decoders that can summarize or interpret conversations. These directions reflect a broader goal in speech technology: bringing machines closer to understanding who is speaking, what is being said, and ultimately what it means.

Biography: (click to expand)

Lukáš Burget is an associate professor at the Faculty of Information Technology, Brno University of Technology (FIT BUT), and the scientific director of the BUT Speech@FIT group. He is recognized internationally as a leading expert in speech processing. He received his Ph.D. from Brno University of Technology in 2004. From 2000 to 2002, he was a visiting researcher at OGI, Portland, USA, and in 2011–2012, he spent a sabbatical at SRI International in Menlo Park, USA. His research interests include various areas of speech processing, such as acoustic modeling for speech, speaker, and language recognition. He has played a leading role in several JSALT workshops, notably leading the 2008 team that introduced i-vectors, the first widely adopted speech embeddings, and contributing in 2009 to the creation of the widely used Kaldi ASR toolkit. In 2006, he co-founded Phonexia, a company that now employs over 50 full-time staff and delivers speech technologies to clients in more than 60 countries.

Keynote presentation