A Dual-Stream GRU-Conformer Architecture for Brain-to-Text Decoding from Utah Array Recordings

Published:

Anonymous submission to Interspeech 2026

Abstract

Decoding intended speech from intracortical neural signals is critical for restoring communication in individuals with Amyotrophic Lateral Sclerosis (ALS). Existing approaches concatenate multiunit threshold crossings and spike band power into a single stream, conflating two complementary neural modalities.

We propose a Dual-Stream GRU-Conformer that encodes each feature independently through parallel bidirectional GRU branches, exchanges cross-modal information via cross-stream attention, and fuses representations through a Conformer encoder. At inference, a triple-LM rescoring pipeline combining a 5-gram language model, Whisper-large-v3, and Qwen2.5-72B-Instruct reranks beam search hypotheses.

Evaluated on the Brain-to-Text Benchmark ‘24, our system achieves 9.38% WER, outperforming the NPTL baseline (9.76%).

Index Terms: human-computer interaction, Utah Array, brain-to-text

Proposed Method

System Overview

The system follows the cascade paradigm: a neural encoder maps intracortical spike sequences to per-timestep phoneme logits via CTC, which are then decoded using a language model pipeline.

Input Preprocessing

Each input trial is represented as X ∈ ℝ^(T×256), where T is the number of 10ms time bins and 256 is the number of electrode channels:

  • First 128 channels: multiunit threshold crossings (tx1) — proxy for spike count
  • Remaining 128 channels: spike band power (spikePow) — high-frequency local field potential energy

A per-day affine calibration layer is applied to compensate for inter-session non-stationarities. Both streams are temporally compressed using a sliding window (kernel K=32, stride S=4 bins), producing ~25 Hz effective frame rate.

Dual-Stream GRU Encoder

Rather than concatenating the two feature modalities, each stream is processed independently:

  • H^tx1 = BiGRU_tx1(X^tx1) — 5 layers, hidden size 512 per direction
  • H^sp = BiGRU_sp(X^sp) — 5 layers, hidden size 512 per direction

Cross-Stream Attention

A bidirectional cross-stream attention module allows each modality to attend to information in the other:

  • H̃^tx1 = H^tx1 + MHA(H^tx1, H^sp, H^sp)
  • H̃^sp = H^sp + MHA(H^sp, H^tx1, H^tx1)

Conformer Encoder

The cross-attended streams are concatenated, projected to dimension D_c = 512, and passed through N=2 Conformer blocks (8-head MHSA, depthwise convolution kernel size 15 with GLU). Output is projected to C+1 = 41 logits for CTC.

Triple-LM Rescoring

At inference, beam search under a Kaldi 5-gram LM produces 100-best hypotheses. Each hypothesis is rescored by linearly combining:

  • CTC log-likelihood
  • 5-gram LM score
  • Whisper-large-v3 log-probability (speech-domain prior)
  • Log-perplexity under Qwen2.5-72B-Instruct (quantized to 4-bit NF4)

Experiments

Dataset

Brain-to-Text Benchmark ‘24: 12,100 sentences of intended speech recorded from a single ALS participant via 256 Utah Array electrodes in speech-related motor cortex (area 6V). Split: 8,800 training / 880 test samples. Evaluated using WER on 1,200 held-out sentences.

Results

MethodWER (%)
NPTL PyTorch Baseline (5-gram + OPT-6B)9.76
Dual-stream GRU-Conformer (ours)9.38

Data Augmentation

Four-stage augmentation pipeline:

  1. Additive noise: white noise + per-channel constant offset
  2. Temporal CutMix: pastes contiguous temporal segments across samples
  3. Input Mixup: linearly interpolates pairs of training samples (Beta(0.3, 0.3))
  4. SpecAugment: masks up to 20 random time spans and 2 random channel spans (up to 40 channels each)

Optimization

  • AdamW (β1=0.9, β2=0.98, ε=10^-8, weight decay=10^-4)
  • 5,000-step linear warmup to 3×10^-4, cosine decay to 1%
  • Trained on 8× A100 40GB GPUs

Technical Report: Dual-Stream GRU-Conformer (SMU Research Framing - Do Tri Nhan, March 2026)

Dataset Overview

  • Source: Dryad Repository — High-performance speech neuroprosthesis dataset
  • Baseline: 9.1% WER (50-word vocab) / 23.8% WER (125,000-word vocab)
  • Total: 10,880 trials — Train: 8,800 / Test: 880 / Competition Holdout: 1,200 (unlabeled)
  • Neural activity from motor cortex at 20ms bin size; features: 128-ch tx1 + 128-ch spikePow
FeatureMeaningDetails
spikePowSpike band powerMean squared voltage, high-pass filtered at 250 Hz, linear regression denoised
tx1–tx4Threshold crossingCounts of voltage crossings at -3.5, -4.5, -5.5, -6.5 × RMS threshold

Baseline Approach (Hybrid)

  1. Acoustic Model: RNN (GRU) + CTC loss
  2. First Pass: Beam Search + 5-gram Language Model
  3. Second Pass: N-best hypotheses rescoring using LLMs (OPT 6.7B or Llama)

Experimental Log

IDModel/BackboneApproachResult/WER
Sub 1Baseline TextLeaderboard submission15%
Sub 3GRU + Llama RescoreReproduced 2024 Baseline9.76%
Sub 4ZipformerHeavy backbone experiment17%
Sub 5Conformer (Deep)170M paramsOverfit
Sub 6a-dConformer (Scaled)Reduced params (33M), added layersSlow convergence
Sub 7aSwiGLU + RoPE + MixUpAdvanced RegularizationInstability
Sub 7bParameter-LimitedParams ≤ N×3000 (24M)CER 25% (Stalled)
Sub 8Dual-Stream Cross-Attentiontx1/spikePow attention fusionFailed
Sub 8a/bDual-Stream GRUPipeline parity with architecture swap9.38%
Sub 14WhisperBrainWhisper decoder as LMFailed (>200%)
Sub 16Dual Conformer CTCGaussian smoothing + LayerDropFailed
Sub 17Enhanced Dual-StreamCross-stream attention + AdamWProgressing…

Evaluation Strategies

  1. Standard Eval: 5-gram LM + OPT-6.7B rescoring
  2. Whisper LM: Whisper decoder as speech-domain language model fed silent audio
  3. Strong LLM: Two-stage — Llama-3.1-8B perplexity scoring + Instruction Correction for phoneme confusion
  4. Triple-LM: Ensemble of 5-gram + Whisper + Llama-70B (requires 2×40GB GPUs)

Future Directions

Focus AreaProposed Hypothesis / Implementation
Feature DecouplingSeparate tx1 and spikePow into parallel encoders to learn distinct temporal dynamics before fusion
Constrained DecodingApply 50-word fixed vocabulary constraint during beam search
Backbone EvolutionTransition from GRU to Conformer backbone while maintaining CTC pipeline stability
Ensemble MethodsCross-model ensembling combining multiple LLM rescorers (Llama + OPT)
Augmentation LogicFormulate theoretical justification for each augmentation choice

Backlog: Test-Time Augmentation (TTA) with white noise averaging (potential 5–10% WER reduction) · WandB integration · Phoneme-to-Phoneme mapping codecs · Constrained beam search for closed 50-vocabulary

Discussion: Why Whisper ASR Fails?

Frozen Whisper decoder yields WER >200% (Sub 14/15). Hypotheses:

  1. Latent neural feature distribution significantly differs from speech-derived Mel-spectrograms Whisper was trained on
  2. 8,800 samples insufficient to re-align Whisper decoder without catastrophic forgetting or extreme overfitting

Research Proposal: Biosignal-Enabled Speech Synthesis (SMU PhD)

Research Scope

Focuses on Biosignal-Enabled Speech Synthesis — enabling spoken communication by predicting acoustic speech signals from biosignals, for:

  • Individuals after total laryngectomy (permanent loss of vocal fold function)
  • Patients with ALS or locked-in syndrome
  • Individuals with neurological/neuromuscular disorders

Comparative Taxonomy: BCI vs SSI

InterfaceBrain-Computer Interface (BCI)Silent Speech Interfaces (SSI)
Signal InputEEG, ECoG, fMRI, MEG, Utah array, NeuralinkEMA, EMG, High-Speed Nasopharyngoscopy, Lip video
Data AttributeVery low amplitude (10–100 µV), low SNRHigher amplitude (0.1–5 mV), superior SNR
Physiological OriginCentral nervous systemPeripheral nervous system
FrequencyNon-invasive: 0.5–40 Hz; Invasive: kHz20–500 Hz

Silent Speech Interfaces (SSI) with EMG

The Ground-Truth Paradox (Gaddy et al.): Models trained on vocalized facial EMG must generalize to silent facial EMG — a severe domain shift (WER 66% baseline).

Transfer Strategy: Silent EMG → Vocalized EMG → Audio mapping achieves ~3.6% WER on closed vocabulary (20% relative reduction open vocabulary).

MONA + LISA (2024 SOTA):

Test SetPrevious SOTA (WER)MONA + LISA (WER)
Silent Speech (Open Vocab)28.8%12.2%
Vocalized EMG23.3%3.7%

Brain-Computer Interface (BCI) with Invasive Utah Array

Dataset: Brain-to-Text Benchmark 24 — 80GB, 12,100 sentences, 256-electrode Utah array, ALS participant, area 6V motor cortex.

Leaderboard:

MethodWER (%)
Baseline (Willett et al.)9.1% (50-word) / 23.8% (125k-word)
Innerspeech10.08
Stanford LISA8.93
Okubo Lab (CIBR)8.26
UCLA NECL5.68
BraIn-to-Text (BIT)5.10

BCI with Non-Invasive EEG

Speech Modalities:

  • Overt/Spoken Speech: Clear ground truth but prone to muscle artifacts
  • Auditory Perception: Analyzing neural responses to heard speech (Brennan dataset)
  • Covert/Imagined Speech: Most complex, lowest SNR — core focus for assistive communication

Technical Bottlenecks: Low SNR · Generalization deficit · Temporal alignment paradox (no physical onset in imagined speech)

Key Datasets: Chisco (largest open-source imagined speech) · MSEEG · VocalMind

Architecture Evolution:

  • Traditional: Raw EEG → Feature Extraction → Neural Decoder → Mel-spectrogram → Vocoder → Waveform
  • Modern (2024–2025): Diffusion-based generation (Brain2Speech Diffusion on ECoG/EEG) · Hybrid Fusion with DTW aligning imagined neural sequences with ASR-TTS pipelines