Papers
arxiv:2510.09095

Neural Codecs as Biosignal Tokenizers

Published on Oct 10, 2025
Authors:
,
,
,
,
,
,

Abstract

Neurophysiological recordings such as EEG yield high-dimensional, noisy time-series data that require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

AI-generated summary

Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09095 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09095 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09095 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.