LUCIER'S ROOM
Research Notes
Living document. Written first for myself, then for anyone curious. Updated as the work changes.
The original
Alvin Lucier's I Am Sitting in a Room (1969). A voice is recorded, played back into a room, re-recorded, repeated 32 times. The room's resonant frequencies destroy the speech and replace it with themselves. The room is passive. It doesn't interpret. It just has a geometry.
What is the LLM analog of this?
The problem with the metaphor/analogy
A physical room is often modeled as a fixed filter - but that's an approximation that holds only under stable conditions. Fixed source, fixed receiver, stable temperature, no movement. In practice room acoustics vary with all of these.
What Lucier's piece actually relies on is that the room is passive. No goal. No interpretation. It reshapes whatever passes through it consistently enough that the resonant frequencies accumulate across 32 iterations.
An LLM is not passive. Its weights are fixed within a single run - same parameters, same architecture. But what those weights compute depends entirely on what you feed in. The same model, different input, completely different computational path through its layers. The room changes shape depending on what you say to it. And if you change models - different training run, different size, different fine-tune - you have a different room altogether.
Two projects
The first treats the model as a room with a worldview. Same text fed in every iteration as a constant input. The model's own previous outputs accumulate as context. Over 32 iterations the response drifts. What the model amplifies, what it drops, what it converges toward - that's the room's resonant signature.
Even with no instructions, no system prompt, the model cannot receive text without "interpreting" it. Lucier's room had no semantics. This room has absorbed the entirety of its training data and has opinions. Whether that makes this a failed analog or an honest description of a different kind of room - a room that doesn't filter but interprets - I haven't resolved it. It's the most interesting open question in the work.
The second goes inside the model. Instead of capturing what the model outputs as text, it captures what happens in the model's internal mathematics at each iteration - residual stream activations, attention patterns, logit distributions, how the representation of the original text drifts through the model's geometry over 32 passes.
This is LUCIER'S ROOM proper.
What going inside means
A language model processes text as sequences of numbers. Each word (or fragment of a word) becomes a vector - a point in a very high-dimensional space. As the text passes through the model's layers those vectors are transformed, mixed, and projected. By the final layer the model produces a probability distribution over its entire vocabulary: a ranking of what word it thinks comes next.
What gets captured at each iteration is the state of those internal vectors - the residual stream - at different points in the model's depth, and the shape of the probability distribution at the end. The residual stream is the running total of everything the model has computed so far; it accumulates contributions from each layer.
The key measurement is cosine distance (assuming that's a good enough measure): how far has the model's internal representation of the original text drifted from one iteration to the next? A cosine distance of 0 means the representation is identical. A cosine distance approaching 1 means the model is now processing something geometrically unrelated to what it started with.
The models as rooms
Different models produce different rooms. This turned out to matter more than expected.
GPT-2 was trained on web text from 2019 - Reddit links, primarily. Its resonant frequencies are conversational, forum-like. It collapses toward a particular kind of internet register fast.
Pythia is more distributed. Smaller models converge to attractors faster; larger ones take longer.
Qwen2.5 Instruct models are fine-tuned to be helpful assistants. Even without explicit instructions they drift toward response-like outputs.
Loop modes
Five different feedback structures, each producing mechanistically different activation patterns:
chain - output of iteration 1 becomes input to iteration 2. Seed appears once and never again. Direct Lucier structure.
accumulate - seed kept at the start of every iteration, outputs accumulate as context. The seed never disappears but gets buried.
attention-morph - words in the input are selectively replaced by words from the output, based on which words the model attended to most strongly. A threshold rises over iterations. The model is editing its own input.
instruct-compress - generate a long output (600 tokens), then apply an instruction: summarise in one paragraph. On instruct-tuned models this works - and importantly, the chat template is applied correctly so the model receives the instruction in the format it was trained on. On base models (GPT-2, Pythia) the instruction has no effect - the model treats it as more text to continue from.
kv-compress - works directly on the model's internal state. Captures the key-value cache, prunes it by attention weight, injects the pruned version back in. Compression happens on the computation, not the words.
The audio expansion - three paths
The third phase adds audio. A voice recording is fed in alongside the text. At each of the 32 iterations the audio is transformed using information derived from the model's internal state.
How do you turn a language model's internal geometry into an acoustic filter?
Path 1 derives a single impulse response from the model's weight matrices before any text is processed. Computed once, applied identically across all 32 iterations. Closest structural analog to Lucier. One room, one fixed signature.
Path 2 derives a new impulse response at every iteration from the model's activation state as it processes that iteration's text. Because the model's computation genuinely changes with each input, these 32 IRs are 32 different rooms. The drift in the IR is the content.
Path 3 maps the model's probability distribution over its vocabulary to a spectral envelope - a shaping of the audio's frequency content. As the model converges toward its attractor vocabulary the distribution narrows. The voice is progressively constrained by the model's statistical certainty.
All All three run simultaneously on the same model pass, producing three parallel audio chains from the same 32 iterations.
Voice input
Two methods. Upload a WAV file, or use F5-TTS voice cloning - upload a short reference recording, the system clones the voice and synthesizes the model's text output in it at each iteration. The model generates text; the text is synthesized as speech in the voice of the person who started the process; that audio is convolved with an IR derived from the model's own geometry.
The voice is a variable in a way it isn't in Lucier. In Lucier it is his voice, in his room. Here it can be a recording made in a specific room (adding a second acoustic layer before the model touches it), or a synthesized voice driven by the model's own output.
Why these four hooks
hook_resid_post is the right primary probe - the residual stream after both attention and MLP have contributed. The most information-dense single read point per layer. It's what produces the drift curves, the per-layer fan lines, and the seed token heatmap.
hook_pattern gives the attention distribution at the final token position - which tokens the model is attending to when predicting what comes next. This drives the attention-morph mode (words the model attends to strongly survive; others are replaced) and the attention-to-seed metric.
hook_k and hook_v are there for a specific instrumental reason: the kv-compress loop mode needs them. It captures the full KV cache, prunes positions by attention weight, and re-injects the pruned state. Without intercepting K and V, that mode doesn't work. They're not hooked for observation - they're hooked because the system needs to manipulate them.
hook_q is not hooked because nothing in the app needs to intercept it. hook_pattern already gives the result of the Q-K dot products after softmax - the attention distribution is visible without capturing the raw queries. Q without K doesn't tell you much. Adding it would only make sense for sonification work that wanted to compute Q-K compatibility scores directly rather than reading the resulting pattern - a future expansion maybe, not a current need.
Trajectory mode
In standard mode, activations are captured once per iteration on the full context. In trajectory mode, the residual stream is also captured at each token step during generation - one measurement per generated token, not just per iteration.
This gives a much higher-resolution picture of how the model's internal state moves as it generates each word. The per-iteration measurement shows the room's signature after each full pass. Trajectory mode shows the moment-to-moment drift within a single generation step.
It's slower - one extra forward pass per token - and only available on modes that use streaming generation.
Audio technical specifics
Format: WAV, 44100Hz, mono, 32-bit float throughout.
Input: Any WAV (mono or stereo, any sample rate), converted internally to 44100Hz mono on load.
IR length: 4096 samples = 93ms at 44100Hz. All kernels are padded or trimmed to this length before convolution. For mean-residual-direct on small models (d_model=768, e.g. pythia-160m and gpt2), the kernel is 768 samples and zero-padded to 4096.
Convolution: scipy.signal.fftconvolve(audio, kernel, mode='full'), output trimmed to input length. Path 3 uses overlap-add STFT rather than direct convolution - the spectral envelope is applied in the frequency domain.
Generation defaults: temperature 0.8, max_new_tokens defaults to seed token length (~115 tokens depending on model tokenizer).
What the interface shows
Both LUCIER'S ROOM and Lucier Expanded use the same three-column layout: controls left, text drift centre, visualisations right. Four panels on the right, all reading from the same activation snapshot:
- Hidden state drift - the primary signal. Solid line: cosine distance of the mean residual stream from the seed reference. Fan lines: per-layer drift. Dashed line: fraction of attention directed at seed token positions. In trajectory mode, purple dots show per-token drift during generation.
- IR waveform / envelope - time-domain display of the current IR kernel (paths 1 and 2) or spectral envelope (path 3). Path 1 is static across iterations; path 2 updates every iteration.
- Seed token drift - heatmap. Each row is one token position in the seed. Column is iteration. Brightness encodes how far that position's representation has drifted.
- Logit distribution - top-k token probability heatmap. Rows are tokens, columns are iterations. Tokens stabilizing across iterations are the model's attractor vocabulary.
When audio from a completed run plays back, the corresponding words in the seed text highlight in real time, synchronized to the audio duration. The body and the text run on the same timeline.
A score view (View C) shows all 32 iterations as a table: full output text, cosine distance from seed, top predicted tokens.
What the signals actually tell you
Three of the four captured signals I track:
Residual stream drift - direct measurement of how the model's internal representation of the original text changes as context accumulates. If the curve plateaus, the model has found an attractor. Whether that's meaningful or a repetition artifact of a small fine-tuned model is not always clear from the curve alone.
Logit distribution - the model's probability mass over its vocabulary. Watching it narrow over 32 iterations is watching the room's resonant frequencies assert themselves.
Seed token drift - how much attention the model pays to specific positions in the original text across iterations. Which words survive; which decay.
KV cache - I'm less certain. It's a computational artifact. Drift in KV cache state tells you something about how the model's processing has changed but what exactly is harder to interpret cleanly.
Open questions
Is the room that interprets a failed analog or just a different piece? A model cannot receive text without doing something semantic with it. Lucier's room had no semantics. I haven't resolved whether that makes this work a broken version of the same idea or an honest description of a different kind of space.
What is the voice? In Lucier it's his voice, in his room. Here it can be a recording made in a specific room, which adds a second acoustic layer before the model touches it. Or a voice synthesized from a reference, driven by the model's own output. Each is a different claim and I haven't decided which one I want to make.
When is a plateau an attractor and when is it collapse? Small instruct models sometimes lock into repetitive output on the first iteration. The measurements don't clearly distinguish a genuine attractor from a fine-tuning artifact. I want better tools for this.
References
- Lucier, Alvin. I Am Sitting in a Room. 1969. Performed at the Guggenheim Museum, New York.
- Nanda, Neel et al. TransformerLens. https://github.com/TransformerLensOrg/TransformerLens
- Biderman, Stella et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. 2022. https://arxiv.org/abs/2101.00027
- EleutherAI. Pythia: A Suite for Analyzing Large Language Models. 2023. https://arxiv.org/abs/2304.01373
- Kowalczyk, Konrad et al. An Efficient Parameterization of the Room Transfer Function. 2015. https://arxiv.org/pdf/1505.04385 - on why the LTI assumption for rooms is an approximation that breaks in practice
- Prawda, Karolina et al. Time Variance in Measured Room Impulse Responses. FA2023. https://dael.euracoustics.org/confs/fa2023/data/articles/000398.pdf
Last updated: 2026-05-30. Ongoing.