Skip to main content

Trinity College Dublin, The University of Dublin

Trinity Menu Trinity Search



Visiting Research Fellow: Dr Iván López-Espejo on Speech Enhancement Technology and Hands-On Research.

Speech enhancement technologies improve both the quality and intelligibility of noisy speech. This is crucial for the implementation of hearing supports in everyday life. However, being able to successfully implement this enhancement is an ongoing challenge, as it often does not translate into real-life settings. Visiting Fellow Dr Iván López-Espejo, who works in the field of telecommunications and electronic engineering, came to Trinity College Dublin in October 2024 to conduct an experiment on the intelligibility of noisy speech with English-speaking volunteers. Discussing his findings with researchMATTERS, he talked about the challenges and rewards of hands-on research, forging connections with colleagues in Dublin, and future advancements in speech enhancement technology.

López-Espejo is currently a research fellow in his hometown at the University of Granada in Spain. The inspiration for his visiting fellowship came from his time at Aalborg University in Denmark. The research he was conducting there also revolved around speech enhancement. At that time, he explains, he and his colleagues were training a series of neural network or speech enhancement models, which performed very well in terms of the algorithms that were designed to automatically evaluate a system's performance. However, they soon realised that these improvements did not translate to the everyday conditions they would be required for. “We ran a test in Denmark with real people, and we found that there was actually a very poor correlation between what objective metrics predict in terms of intelligibility, and what happens in real life And what happens in real life is that these systems still don't really work very well, especially when the level of noise is very, very high.”

Speech enhancement technology is a rapidly evolving field, meaning experiments are crucial to ensuring that advancements are having the desired effect. López-Espejo explains that there was previously a different paradigm for speech enhancement based on physical models that were quite limited and constrained. Now, with the introduction and use of neural networks, this has completely changed. “Today we can consider employing very effective speech enhancement systems in most situations where the levels of noise are not very high.”

There are, however, certain scenarios where speech enhancement still needs to be improved. Speech enhancement neural networks typically learn from examples. If they have to process an unfamiliar noise, it will be difficult for them to remove background interference. It is therefore important to improve the generalisation ability of these systems. Additionally, speech enhancement systems still work quite poorly with high levels of noise. Noisy parties or restaurants can still be difficult for people who require hearing aids, leaving them feeling excluded from social events.

Indeed, López-Espejo gives one example of an Italian colleague who noticed additional elements which may be important for speech enhancement: that of sight! “To improve the quality and intelligibility of speech for hearing aid users, he designed a prototype of a hearing aid with a camera attached. The idea was not only to exploit the audio information, but also the visual information. That’s something that we humans do: improve our recognition performance or rates by paying attention to the lips of a person, or their gestures, and so on.” This system, based on neural networks, attempted to exploit not only audio, also visual signals, to further improve the intelligibility and quality of speech for hearing aid users.

Time researching in Denmark, then, convinced López-Espejo that further practical research was needed to interrogate speech enhancement technologies. “It's very important to confirm the performance of these systems using a panel of subjects,” he explains. “However, many people just stick to objective metrics. They don't perform real tests, because they’re time consuming and costly in terms of resources, money, energy. So there are only a few people pointing out the drawbacks or the flaws of these metrics, and that is why I got in touch with colleagues in Norway.” These researchers in Norway had come to the same conclusion that the objective metrics were not very accurate. Along with López-Espejo, they came up with a preliminary study that was published last year at a conference held by Aalborg University.

Following that conference, they planned to conduct a more in-depth study revolving around listening effort and reaction times. Trinity became López-Espejo’s university of choice, as Dublin had been the host city for INTERSPEECH 2023, the biggest international conference on speech technologies, jointly organised by the ISCA (International Speech Communication Association) and Trinity’s SFI ADAPT Centre for AI-Driven Digital Content Technology. One of INTERSPEECH’s three chairs was Trinity Professor in Speech Technology from the School of Engineering, Naomi Harte.

One of López-Espejo’s own PhD students presented at INTERSPEECH, and this connection prompted López-Espejo to contact Prof. Harte himself. Additionally, he notes, “English is a worldwide language. So I think it's a very good thing to endorse this hypothesis with a new intelligibility test run in English. So basically, it’s a collaboration between me, Naomi Harte, and colleagues in Norway who work in SINTEF.” In Norwegian, this is Stiftelsen for industriell og teknisk forskning, and in English, "The Foundation for Industrial and Technical Research".

In Trinity, López-Espejo was able to use his research space in Stack B to conduct an intelligibility test with native English speakers, recruiting volunteers from around the university and the wider city. Each volunteer was presented with a series of noisy sentences. In them, spoken words were affected by background noises and other audio interferences. By measuring how easy or difficult it was for volunteers to correctly identify the words in the sentences, he was also able to measure how quickly it took volunteers to identify them: he highlights that these reaction times are useful as an approximation of a listening effort, and, in turn, could be used to evaluate the speech enhancement systems. He ran tests for approximately five weeks, factoring in breaks for the Christmas and New Year period.

This visiting research trip gave him an invaluable opportunity to focus on research away from his busy schedule at his home university. Alongside collaborating with his colleagues in Norway and teaching undergraduate courses, he also supervises three PhD candidates. One is working on speaker verification anti-spoofing, trying to spot when someone is using a fake voice to try and break through privacy protections. Another is researching speech decoding from EEG signals, whose aim is to use a cap with electrodes, to collect EEG signals from the brain: “So the person can imagine speech, and without saying a single word their imagined speech will automatically be processed by a neural network that will try to regenerate or generate the speech. Someone, for example, who has lost their ability to speak, might be able to speak again.” His third PhD student is working on musical information retrieval. He’s producing a system which allows you to sing a song in English, and will then automatically generate the same song with lyrics translated into Japanese.

“I was swamped when I was in Spain,” López-Espejo adds. “I couldn't find the time to run another test. So going to Ireland was a great opportunity to just focus on it.” He was, moreover, very pleased with the results. He successfully ran the tests with 14 male and 14 female subjects, which was extra satisfying as at the very beginning he was struggling with recruiting male subjects. “And all of them were native English speakers with good hearing. So I could check the results from all of them, and it was very useful as they made sense individually, and when you take a look at the aggregated results across all subjects they also support our hypothesis.” He is now in the process of sharing his results with his colleagues so they can start both the analysis and writing processes. They plan to submit their findings to the Journal of the Acoustical Society of America and look forward to further exploring whether these trends can be generalised across languages, by looking at both their English and Norwegian results.

“I would love to return to Dublin,” López-Espejo concludes. “It's been a very nice experience. The research environment is great. It actually reminds me of my time in Aalborg in Denmark. And I love the campus, it’s beautiful.” And because the technology is rapidly improving, López-Espejo knows that there will be plenty more research to be carried out on the future of speech enhancement technology. “Perhaps we will need another change or another shift in paradigm, in the same way we moved from classical models to neural networks. Honestly, we don't know yet.”

- Profile by Dr Sarah Cullen

 

Iván López-Espejo

Iván López-Espejo received M.Sc. degrees in Telecommunications Engineering and Electronics Engineering (2011 and 2013, respectively), and a Ph.D. in Information and Communications Technology (2017), all from the University of Granada, Spain. In 2018, he led the speech technology team at Veridas, a biometrics company in Pamplona, Spain. From 2019 to 2022, he was a postdoctoral researcher in the Artificial Intelligence and Sound section of the Department of Electronic Systems at Aalborg University, Denmark. He subsequently served as an Assistant Professor at Aalborg University (late 2022–early 2024) and a Visiting Scholar at the University of Texas at Dallas, USA, where he was also the Principal Investigator (PI) of a research project funded by a Marie Curie Global Fellowship. He is currently a Ramón y Cajal Fellow and PI at the University of Granada. His research interests span speech enhancement, perceptual aspects of signal processing, music information retrieval, robust speech recognition and keyword spotting, multi-channel speech processing, and speaker verification, including anti-spoofing techniques.