Gencho: Room Impulse Responses Generation from Reverberant Speech and Text via Diffusion Transformers

Abstract

Abstract Blind room impulse response estimation (IR) is a core task for capturing and transferring acoustic properties, yet existing methods often suffer from limited flexibility and mode collapse under unseen conditions. Furthermore, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a complex-spectrogram diffusion-transformer model that estimates RIRs directly from reverberant speech. A structure-aware encoder separates early and late reflections to provide robust conditioning, while the diffusion decoder generates diverse and perceptually realistic responses. Gencho integrates modularly with standard speech-processing pipelines for end-to-end acoustic matching. Results show improved generalization and richer generated IRs than non-generative baselines. Lastly, we develop a text-controllable IR generation proof-of-concept, demonstrating Gencho’s versatility for controllable acoustic simulation and generative audio applications.

Results

The audio examples below are the target RIRs, input reverberant speech, and the generated RIRs and resynthesized reverberant speech.

I. RIR Generation

We evaluate the performance of Gencho on RIRs from unseen datasets, specifically using RIRs from the OpenAIR Library dataset and unseen clean speech utterances from DAPS dataset. The groups of audio examples are organized by environment, and for each pair of rows, the top row is the impulse response and the second row is the corresponding reverberant speech.

Legend

Dry Speech - Clean, non-reverberant speech recordings
Input & Target - Ground truth IR and reverberant speech
FiNS - FiNS baseline model for blind IR estimation
FiNS+LN - Updated FiNS with Layer Normalization
FiNS+LN+AS - Updated FiNS with Layer Normalization and Audio Source separation (AS). This model takes two channels of input: early reverberation speech and full reverberation speech.
Gencho (ours) - "Gen(erative) (e)cho" Proposed diffusion-transformer blind IR estimation model (single-channel). Shares FiNS+LN encoder architecture.
Gencho+AS (ours) - Gencho with Audio Source separation (AS) as described above. Shares FiNS+LN+AS encoder architecture.

Dry Speech	Input & Target	FiNS	FiNS +LN	FiNS +LN+AS	Gencho	Gencho +AS

II. Blind estimation on real-world recordings

We evaluate the performance of Gencho on real-world recordings. We use the Device and Produced Speech (DAPS) dataset to evaluate the performance of Gencho on real-world recordings. In DAPS, varied recording devices were placed in various rooms, and clean speech was played aloud in each room.
We first use an audio source separation model to separate recordings into background noise (BK), early speech (ES), and late speech (LS). Depending on the model, the inputs are either the single channel reverberant mixture (ES+LS) or two channels ((ES), (ES+LS)).

Below, we show the results of each model. Half of each audio sample is the original recording concatenated with the resynthesized audio mix using the clean DAPS speech, the generated IR, and the noise that was separated in the source separation step. The order of ground truth audio mixture and resynthesied audio mixture is randomized. Try guessing which half of the audio sample is which!

Legend

Location (Device) - Name of the recording location and device type
Clean - Clean dry speech that was played aloud in the room
FiNS - FiNS baseline model for blind IR estimation. 1ch input - (ES+LS)
FiNS+LN - Updated FiNS with Layer Normalization. 1ch input - (ES+LS)
FiNS+LN+AS - Updated FiNS with Layer Normalization. 2ch input - (ES, ES+LS)
Gencho (ours) - Proposed diffusion-transformer blind IR estimation model. Shares FiNS+LN encoder architecture. 1ch input - (ES+LS)
Gencho+AS (ours) - Proposed diffusion-transformer blind IR estimation model. Shares FiNS+LN+AS encoder architecture. 2ch input - (ES, ES+LS)
FiNS+LN+AS -> Gencho+AS - FiNS+LN+AS -> Gencho+AS. This model is the same as Gencho+AS, but an additional input is the first 5 ms of the output of FiNS+LN+AS. This is to test improvements by using both models, first to generate the early reflections as a prompt for the diffusion model.
Mixture - BK + ES + LS

Location (Device)	Clean	FiNS	FiNS+LN	FiNS+LN+AS	Gencho	Gencho+AS	FiNS+LN+AS -> Gencho+AS	Mixture

III. Text-Controllable RIR Generation

Our diffusion-based IR generator enables the creation of diverse, novel impulse responses under weak guidance and can be extended to a broad range of multi-modal applications. We adapt the model for text-controllable IR generation (i.e., text-to-IR) by replacing the audio encoder with a Flan-T5-XXL2 text encoder. The diffusion model cross-attends to the text embedding sequence.

Below are the results of text-controllable IR generation. We generate four variations of IRs for each text prompt. Note the diverstiy of generated IRs for the same text prompt.

Dry Speech	Text Prompt	Gencho Text2IR	Gencho Text2IR Speech

Clean speech and real-world recordings are from DAPs Clean Speech Corpus. IRs are from OpenAIR.