Jackie Lin1,2 Jiaqi Su2 Nishit Anand2,3 Zeyu Jin2 Minje Kim1 Paris Smaragdis4
Corresponding author's E-mail: jackiel4@illinois.edu.
1University of Illinois Urbana-Champaign
2Adobe Research
3University of Maryland
4MIT
Gencho: Room Impulse Responses Generation from Reverberant Speech and Text via Diffusion Transformers
Abstract Blind room impulse response estimation (IR) is a core task for capturing and transferring acoustic properties, yet existing methods often suffer from limited flexibility and mode collapse under unseen conditions. Furthermore, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a complex-spectrogram diffusion-transformer model that estimates RIRs directly from reverberant speech. A structure-aware encoder separates early and late reflections to provide robust conditioning, while the diffusion decoder generates diverse and perceptually realistic responses. Gencho integrates modularly with standard speech-processing pipelines for end-to-end acoustic matching. Results show improved generalization and richer generated IRs than non-generative baselines. Lastly, we develop a text-controllable IR generation proof-of-concept, demonstrating Gencho’s versatility for controllable acoustic simulation and generative audio applications.
The audio examples below are the target RIRs, input reverberant speech, and the generated RIRs and resynthesized reverberant speech.
We evaluate the performance of Gencho on RIRs from unseen datasets, specifically using RIRs from the OpenAIR Library dataset and unseen clean speech utterances from DAPS dataset. The groups of audio examples are organized by environment, and for each pair of rows, the top row is the impulse response and the second row is the corresponding reverberant speech.
| Dry Speech |
Input & Target |
FiNS | FiNS +LN |
FiNS +LN+AS |
Gencho | Gencho +AS |
|---|
We evaluate the performance of Gencho on real-world recordings. We use the Device and Produced Speech
(DAPS) dataset to evaluate the performance of Gencho on real-world recordings.
In DAPS, varied recording devices were placed in various rooms, and clean speech was played aloud in each room.
We first use an audio source separation model to separate recordings into background noise (BK), early speech (ES), and late speech (LS). Depending on the model,
the inputs are either the single channel reverberant mixture (ES+LS) or two channels ((ES), (ES+LS)).
Below, we show the results of each model. Half of each audio sample is the original recording concatenated with the resynthesized audio mix using
the clean DAPS speech, the generated IR, and the noise that was separated in the source separation step.
The order of ground truth audio mixture and resynthesied audio mixture is randomized. Try guessing which half of the audio sample is which!
| Location (Device) | Clean | FiNS | FiNS+LN | FiNS+LN+AS | Gencho | Gencho+AS | FiNS+LN+AS -> Gencho+AS | Mixture |
|---|
Our diffusion-based IR generator enables the creation of diverse, novel impulse responses under weak guidance and can be extended to a broad range of multi-modal applications. We adapt the model for text-controllable IR generation (i.e., text-to-IR) by replacing the audio encoder with a Flan-T5-XXL2 text encoder. The diffusion model cross-attends to the text embedding sequence.
Below are the results of text-controllable IR generation. We generate four variations of IRs for each text prompt. Note the diverstiy of generated IRs for the same text prompt.
| Dry Speech |
Text Prompt | Gencho Text2IR | Gencho Text2IR Speech |
|---|
Clean speech and real-world recordings are from DAPs Clean Speech Corpus. IRs are from OpenAIR.