
Prof. Dr. Rita Singh – School of Computer Science, Carnegie Mellon University
Keynote Title: Genetic information in human voice: how much do we know today and how much more will technology uncover?
Abstract: (click to expand)
Speech arises from a tightly integrated biomechanical and cognitive–motor system in which cortical speech areas, basal ganglia, and cerebellum jointly control the larynx and vocal tract to generate highly individualized acoustic output. Contemporary neurobiological models frame speech production as a hierarchically organized sensorimotor loop that transforms abstract linguistic plans into finely timed articulatory gestures, with cortico–basal ganglia–cerebellar circuits supporting both motor sequencing and higher-order cognitive control. Building on this framework, the talk will link inter-individual variation in vocal biomechanics: laryngeal tissue properties, vocal tract morphology, and neuromuscular control, to underlying genetic architecture. Twin and developmental studies already show substantial heritability for fundamental frequency and related voice-quality measures, indicating a genetic contribution to key acoustic parameters. Recent genome-wide association work has identified specific loci, including variants in ABCC9, that modulate median voice pitch across populations, suggesting that common variants in ion channels and other pathways systematically shape habitual phonation.
Against this backdrop, the talk will discuss emerging efforts to connect detailed acoustic features (e.g., source–filter characteristics, prosodic dynamics, micro-perturbation measures) to cytogenetic and genomic information, including voice–genomics profiling studies and AI-based analyses of voices in monogenic and chromosomal syndromes. These data, combined with the statistical arguments in my paper Human Voice is Unique, quantify the improbability that two individuals share indistinguishable voices once the full multidimensional acoustic space is considered, and motivate treating the voice as a rich, partially decoded phenotypic readout of genotype. The talk will conclude by assessing how far current evidence justifies inferring genetic information from voice, outlining realistic near-term capabilities and limitations, and sketching future directions such as large-scale voice–genome cohorts, mechanistic models linking genes to vocal biomechanics, and ethically grounded machine-learning methods, to determine how much more our voices can reveal about our genomes.
Biography: (click to expand)
Rita Singh is a Research Professor at the CMU’s School of Computer Science/Language Technologies Institute, with affiliations to three other departments. At CMU, she leads the Center for Voice Intelligence and Security, and co-leads the Machine Learning for Signal Processing and Robust Speech Processing research groups. She has worked on speech and audio processing for over two decades. Since 2014, her work has been focused on developing the science of profiling humans from their voice, a niche area at the intersection of Artificial Intelligence and Voice Forensics. The technology pioneered by her group has led to three world firsts: In 2018, her team created the world’s first voice-based profiling system, demonstrated live at the World Economic Forum. In 2019 her group also created the world’s first instance of human voice – that of the artist Rembrandt – generated based on evidence from facial images. In 2020, her team conceptualized and enabled the first voice-based detection system for Covid-19. She is the author of the book “Profiling Humans from their Voice,” published by Springer-Nature in 2019.

Prof. Dr. Lukáš Burget – Brno University of Technology
Keynote Title: From Single-Channel Foundations to Multi-Speaker and Multi-Modal Understanding
Abstract: (click to expand)
Recent advances in foundation models such as Whisper and WavLM have transformed automatic speech recognition, yet most systems still assume a single, clean speaker. This talk traces the progression toward models that can process and understand natural multi-speaker conversations. I will discuss how large pre-trained speech models can be extended to multi-channel input and spatially aware processing, how speaker diarization and recognition can be unified within a single framework, and how efficient model compression enables real-time deployment. Together, these developments move the field from modular, error-prone pipelines toward integrated systems capable of attributing and transcribing overlapping speech in realistic acoustic conditions. Looking ahead, I will outline ongoing efforts to extend these ideas beyond audio, toward audio-visual modeling and toward combining speech encoders with large language model decoders that can summarize or interpret conversations. These directions reflect a broader goal in speech technology: bringing machines closer to understanding who is speaking, what is being said, and ultimately what it means.
Biography: (click to expand)
Lukáš Burget is an associate professor at the Faculty of Information Technology, Brno University of Technology (FIT BUT), and the scientific director of the BUT Speech@FIT group. He is recognized internationally as a leading expert in speech processing. He received his Ph.D. from Brno University of Technology in 2004. From 2000 to 2002, he was a visiting researcher at OGI, Portland, USA, and in 2011–2012, he spent a sabbatical at SRI International in Menlo Park, USA. His research interests include various areas of speech processing, such as acoustic modeling for speech, speaker, and language recognition. He has played a leading role in several JSALT workshops, notably leading the 2008 team that introduced i-vectors, the first widely adopted speech embeddings, and contributing in 2009 to the creation of the widely used Kaldi ASR toolkit. In 2006, he co-founded Phonexia, a company that now employs over 50 full-time staff and delivers speech technologies to clients in more than 60 countries.

Prof. Dr. Björn Schuller – TUM School of Medicine and Health, TUM School of Computation, Information and Technology
Keynote Title: Every breath you take: From Vocal Chords to Health Scores
Abstract: (click to expand)
Our voices are more than carriers of words – they are rich, noisy, beautifully imperfect biosensors. Every breath you take and laugh, sigh, or stutter you utter encodes a chord progression of physiology and psychology: from affective mood, to fatigue, calmness or gleeful states. With AI’s help, we can pick up sharp irregularities and when voices go just a little flat before your health does. Moving from classic speaker ID, we will explore how we can turn vocal cords into health scores using modern affective and health-aware AI. Such AI can tell you who you are, how you feel, and how you’re doing: from detecting anxiety, burnout, and cognitive load to depression, all the way to links with yourself, your behaviour, and chronic conditions. We will discuss how we build models that stay robust in the wild—across devices, languages, and accents—while keeping them clinically meaningful. On the technical side, we will move from representation and neural architecture learning for paralinguistics to self-supervised learning at scale, and how large reasoning models change the game for speaker characterisation. On the societal side, we shall tackle the uncomfortable but essential questions: privacy, bias, efficiency, and explainability—just to name a few of the most essential aspects. Expect a tour from lab demos to real-world deployments in everyday devices—highlighting where “Computational Paralinguistics” rock already, and what more it will take to make voice a trusted instrument in the future health orchestra.
Biography: (click to expand)
Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany where he is Full Professor and Chair of Health Informatics. He is also Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, Core Member in the Munich Data Science Institute (MDSI), Principal Investigator in the Munich Center for Machine Learning (MCML), amongst other Professorships and Affiliations. Previous stays include Full Professor at the Universities of Augsburg and Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the ACM, Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, and Elected Full Member Sigma Xi. He (co-)authored 1,500+ publications (80,000+ citations, h-index 123), is Field Chief Editor of Frontiers in Digital Health, Editor in Chief of AI Open and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community. His 50+ awards include having been honoured as one of 40 extraordinary scientists under the age of 40 by the WEF in 2015. Currently, he was awarded ACM Distinguished Speaker for the term 2024-2027 and was an IEEE Signal Processing Society Distinguished Lecturer 2024. He served as consultant of companies such as Barclays, GN, Huawei, Informetis, or Samsung. Schuller counts more than 300 public press appearances including in Newsweek, Scientific American, and Times.

Prof. Dr. Daniel Ramos – Universidad Autonoma de Madrid
Keynote Title: Rigorous Forensic Automatic Speaker Recognition: Bayesian Decision Theory, Probabilistic Calibration and Case-Specific Validation
Abstract: (click to expand)
The use of automatic speaker recognition systems in forensic science has undergone a dramatic improvement in recent years in terms of scientific rigor, objectivity, and consensus. As a result, the discipline has become strongly aligned with the recommendations of the recently released ISO 21043 standard for forensic sciences. In this talk, we will identify three elements that are now essential for the proper and standardized use of automatic speaker recognition systems in forensic science. First, the adoption of a Bayesian decision-theoretical framework ensures the logical incorporation of system information, expressed as a likelihood ratio, into the decision-making process of judges or juries. Second, the probabilistic calibration of likelihood ratios ensures that decisions are made optimally by the fact finder. Third, the strict and systematic validation of systems under case-specific conditions ensures that forensic casework is conducted with sufficient quality. In this context, the contribution and impact of the speaker recognition community have been paramount, with the proposal of techniques such as score-based likelihood ratios, proper scoring rules for validation, and probabilistic calibration. These methods have since been progressively adopted in other areas, including forensic biometrics, forensic chemistry, and forensic DNA profiling, and now extend to an ever-growing range of forensic disciplines.
Biography: (click to expand)
Dr. Daniel Ramos is an Associate Professor at the Universidad Autónoma de Madrid and a member of the AUDIAS research group (https://audias.ii.uam.es/). His research focuses on forensic evidence evaluation and validation, speech and signal processing, machine learning, and artificial intelligence. He has visited different institutions worldwide with a strong research focus on forensic interpretation and Bayesian machine learning, including the University of Lausanne (Switzerland), the University of Stellenbosch (South Africa), the University of Edinburgh (UK), the Netherlands Forensic Institute (NFI), and the University of Cambridge (UK). He has also been a visiting professor in probabilistic machine learning at the Universidad de Buenos Aires (Argentina). Dr. Ramos has collaborated with different forensic institutions, notably in the long term with the Spanish Guardia Civil and the NFI, as well as with the Institute of Forensic Research at Krakow (Poland) and the International Forensic Research Institute at Florida International University (USA). An associate member of the ENFSI Forensic Speech and Audio Working Group, Dr. Ramos has pioneered the field of forensic speaker recognition using likelihood ratios, but also in other forensic disciplines such as forensic chemistry. His contributions to the validation of forensic likelihood ratios now serve as the basis for guidelines in different forensic institutions across Europe. He has been invited to multiple scientific events related to forensic science, notably by the NFI and the National Institute of Standards and Technology (NIST), and its Organization of Scientific Area Committees (OSAC), which drives the development of standards for forensic science in the USA.
