LLM (Large Language Model)
Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.
A model consists of its parameters (or weights) and an executable that runs on those parameters.
Notable examples:
- GPT-4 (OpenAI)
- Claude (Anthropic)
- LLaMA (Meta)
- DeepSeek (DeepSeek)
- Grok (xAI)
- Gemini (Google)
- Mistral (Mistral AI)
Model Training
The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:
Pretraining: Train model on a giant dataset in an expensive and lengthy process
- Download and preprocess the internet.
- Involves crawling, extracting, filtering, etc.
- E.g. FineWeb Dataset
- Tokenization: Represent tokens of text as unique IDs
- Trade-off between sequence length and symbol size
- Run byte-pair encoding to mint new symbols for common byte sequences
- See e.g. tiktokenizer
- Training
- Take windows/sequences of tokens from the training set
- Try to predict next token for this context
- Result: Base model. LLM is now a "document generator".
Fine-tuning: LLM becomes an assistant that can answer questions
- Trained on manually collected datasets from people
- People write Q&A responses based on labeling instructions
- Alternatively: Humans can compare answers and select the best one
- Focus on quality over quantity and alignment, i.e. changing the format
- Cheap and quick process, happens every week or so
- Result: assistant model
Reinforcement Learning
- Practice on problems and find solutions that work best for the LLM
Reading a model card
Example: ibm-granite/granite-3.1-8b-instruct
metric | 8b Dense | explanation |
---|---|---|
Embedding Size | 4096 | embedding vector dimension for input tokens |
# layers | 40 | 40 Transformer blocks |
# attention heads | 32 | 32 heads in Attention |
Attention head size | 128 | 128 dimensions per head, 4096 = 32x128 |
# KV heads | 8 | Key-Value projection pairs |
MLP hidden size | 12800 | hidden layer size of Multi-Layer-Perceptron or FNN |
MLP activation | SwiGLU | Activation function for MLP (Swish-Gated Linear Unit) |
# experts | - | specialized NNs in Mixture-of-Experts (MoE) |
MoE TopK | - | Number of experts activated per token in MoE |
Initialization std | Standard deviation used for parameter initialization | |
Sequence length | 128k | Context window (max tokens model can process at once) |
Position embedding | RoPE | Rotary Position Embedding for token position encoding |
# Parameters | 8.1B | total parameters/weights |
# Active parameters | 8.1B | Parameters used during a single forward pass |
# Training tokens | 12T | 12 trillion training tokens |