Jackie Lin1,2 Jiaqi Su2 Nishit Anand2,3 Zeyu Jin2 Minje Kim1 Paris Smaragdis4
Corresponding author's E-mail: jackiel4@illinois.edu.
1University of Illinois Urbana-Champaign
2Adobe Research
3University of Maryland
4MIT
Gencho: Room Impulse Responses Generation from Reverberant Speech and Text via Diffusion Transformers
Abstract Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
The audio examples below are the target RIRs, input reverberant speech, and the generated RIRs and resynthesized reverberant speech.
We evaluate the performance of Gencho on RIRs from unseen datasets, specifically using RIRs from the OpenAIR Library dataset and unseen clean speech utterances from DAPS dataset. The groups of audio examples are organized by environment, and for each pair of rows, the top row is the impulse response and the second row is the corresponding reverberant speech.
| Dry Speech |
Input & Target |
FiNS | FiNS +LN |
FiNS +LN+AS |
Gencho | Gencho +AS |
|---|
We evaluate the performance of Gencho on real-world recordings. We use the Device and Produced Speech
(DAPS) dataset to evaluate the performance of Gencho on real-world recordings.
In DAPS, varied recording devices were placed in various rooms, and clean speech was played aloud in each room.
We first use an audio source separation model to separate recordings into background noise (BK), early speech (ES), and late speech (LS). Depending on the model,
the inputs are either the single channel reverberant mixture (ES+LS) or two channels ((ES), (ES+LS)).
Below, we show the results of each model. Half of each audio sample is the original recording concatenated with the resynthesized audio mix using
the clean DAPS speech, the generated IR, and the noise that was separated in the source separation step.
The order of ground truth audio mixture and resynthesied audio mixture is randomized. Try guessing which half of the audio sample is which!
| Location (Device) | Clean | FiNS | FiNS+LN | FiNS+LN+AS | Gencho | Gencho+AS | FiNS+LN+AS -> Gencho+AS | Mixture |
|---|
Our diffusion-based IR generator enables the creation of diverse, novel impulse responses under weak guidance and can be extended to a broad range of multi-modal applications. We adapt the model for text-controllable IR generation (i.e., text-to-IR) by replacing the audio encoder with a Flan-T5-XXL2 text encoder. The diffusion model cross-attends to the text embedding sequence.
Below are the results of text-controllable IR generation. We generate four variations of IRs for each text prompt. Note the diverstiy of generated IRs for the same text prompt.
| Dry Speech |
Text Prompt | Gencho Text2IR | Gencho Text2IR Speech |
|---|
Clean speech and real-world recordings are from DAPs Clean Speech Corpus. IRs are from OpenAIR.