Transformer (Neural Network)

The Transformer model is a foundational architecture in NLP and has influenced various other domains. It was introduced in the 2017 paper "Attention is All You Need". It is built upon the concept of Attention that was introduced in another paper 3 years earlier.

Modern LLMs are based on a transformer architecture, which allows them to process and generate text efficiently. Transformers use mechanisms like self-attention to weigh the importance of different words in a sentence, enabling them to capture context better than previous models.

Key features of Transformers:

Self-Attention Mechanism: Can capture contextual relationships
Parallelization: Can process entire sequences of data simultaneously
Positional Encoding: Provide information about position of each word in input
Encoder-Decoder Structure: Layers of self-attention and feed-forward NNs
Scalability: Can be scaled up effectively

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable.

– Andrej Karpathy

andersch.dev

Transformer (Neural Network)

Diagram

Resources