Last Updated on September 11, 2022
Current technology used for decoding brain signals into natural language involves invasively placing receptors onto the surface of the brain. These allow for the decoding of phonemes, speech sounds, hand gestures, and articulatory movements, and can process around 15 words per minute.
Invasive devices allow for a high signal to noise ratio as they are placed close to the source. However, using them requires brain surgery and is not a long term solution as they are not easy to maintain.
Non-invasive devices however can be implemented as a wearable device, removing many risks from the process and allowing for easy maintenance. The draw back of non-invasive devices is the introduction of noise, which makes the brain signals harder to read. This noise varies massively between sessions and between individuals.
Previous per-individual approaches to decoding this noise were time consuming and limited. Instead, Meta introduced a convolutional neural network that is trained with a contrastive objective to predict the deep representations of the audio waveform. A contrastive objective is a training objective used in neural networks that encourages the model to learn representations that are useful for discriminating between similar inputs.
This means that given a sample of brain activity, the model predicts the latent representation of speech. The paper validates the approach on four public M/EEG datasets by decoding 3 second audio segments from the brain activity of 169 participants. Regression is a common neural network technique used for predicting a continuous value, such as a price or quantity, from a set of data. Contrastive loss is a technique for training a neural network to distinguish between two classes of data, such as between a picture of a dog and a picture of a cat. Meta opted for the use of a contrastive loss technique to match the two latent representations of M/EEG recordings and natural language.
Meta collected 56k hours of speech and M/EEG recordings from participants who listened to audiobooks. With a sample of 3 seconds of M/EEG signals, the model is able to identify the matching audio segment with up to 72.5% top–10 accuracy for MEG and up to 19.1% top–10 accuracy for EEG.
On potential societal impacts, Meta offers this statement. “This will be difficult for people to adapt to decode brain signals from non-consenting participants as teeth clenching, eye blinks and other muscle movements are known to massively corrupt these signals, and thus presumably provide a simple way to counter downstream analyses.”