LLM (Large Language Model)
Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.
A model consists of its parameters (or weights) and an executable that runs on those parameters.
Notable examples:
- GPT-4 (OpenAI)
- Claude (Anthropic)
- LLaMA (Meta)
- DeepSeek (DeepSeek)
- Grok (xAI)
- Gemini (Google)
- Mistral (Mistral AI)
Model Training
The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:
Pretraining: Train model on a giant dataset in an expensive and lengthy process
- Download and preprocess the internet.
- Involves crawling, extracting, filtering, etc.
- E.g. FineWeb Dataset
- Tokenization: Represent tokens of text as unique IDs
- Trade-off between sequence length and symbol size
- Run byte-pair encoding to mint new symbols for common byte sequences
- See e.g. tiktokenizer
- Training
- Take windows/sequences of tokens from the training set
- Try to predict next token for this context
- Result: Base model. LLM is now a "document generator".
Fine-tuning: LLM becomes an assistant that can answer questions
- Trained on manually collected datasets from people
- People write Q&A responses based on labeling instructions
- Alternatively: Humans can compare answers and select the best one
- Focus on quality over quantity and alignment, i.e. changing the format
- Cheap and quick process, happens every week or so
- Result: assistant model
Reinforcement Learning
- Practice on problems and find solutions that work best for the LLM
Reading a model card
Example: ibm-granite/granite-3.1-8b-instruct
| metric | 8b Dense | explanation |
|---|---|---|
| Embedding Size | 4096 | embedding vector dimension for input tokens |
| # layers | 40 | 40 Transformer blocks |
| # attention heads | 32 | 32 heads in Attention |
| Attention head size | 128 | 128 dimensions per head, 4096 = 32x128 |
| # KV heads | 8 | Key-Value projection pairs |
| MLP hidden size | 12800 | hidden layer size of Multi-Layer-Perceptron or FNN |
| MLP activation | SwiGLU | Activation function for MLP (Swish-Gated Linear Unit) |
| # experts | - | specialized NNs in Mixture-of-Experts (MoE) |
| MoE TopK | - | Number of experts activated per token in MoE |
| Initialization std | Standard deviation used for parameter initialization | |
| Sequence length | 128k | Context window (max tokens model can process at once) |
| Position embedding | RoPE | Rotary Position Embedding for token position encoding |
| # Parameters | 8.1B | total parameters/weights |
| # Active parameters | 8.1B | Parameters used during a single forward pass |
| # Training tokens | 12T | 12 trillion training tokens |