LLM (Large Language Model)

Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.

A model consists of its parameters (or weights) and an executable that runs on those parameters.

Notable examples:

GPT-4 (OpenAI)
Claude (Anthropic)
LLaMA (Meta)
DeepSeek (DeepSeek)
Grok (xAI)
Gemini (Google)
Mistral (Mistral AI)

Model Training

The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:

Pretraining: Train model on a giant dataset in an expensive and lengthy process

Download and preprocess the internet.
- Involves crawling, extracting, filtering, etc.
- E.g. FineWeb Dataset
Tokenization: Represent tokens of text as unique IDs
- Trade-off between sequence length and symbol size
- Run byte-pair encoding to mint new symbols for common byte sequences
- See e.g. tiktokenizer
Training
- Take windows/sequences of tokens from the training set
- Try to predict next token for this context
Result: Base model. LLM is now a "document generator".

Fine-tuning: LLM becomes an assistant that can answer questions

Trained on manually collected datasets from people
People write Q&A responses based on labeling instructions
Alternatively: Humans can compare answers and select the best one
Focus on quality over quantity and alignment, i.e. changing the format
Cheap and quick process, happens every week or so
Result: assistant model

Reinforcement Learning

Practice on problems and find solutions that work best for the LLM

Reading a model card

Example: ibm-granite/granite-3.1-8b-instruct

metric	8b Dense	explanation
Embedding Size	4096	embedding vector dimension for input tokens
# layers	40	40 Transformer blocks
# attention heads	32	32 heads in Attention
Attention head size	128	128 dimensions per head, 4096 = 32x128
# KV heads	8	Key-Value projection pairs
MLP hidden size	12800	hidden layer size of Multi-Layer-Perceptron or FNN
MLP activation	SwiGLU	Activation function for MLP (Swish-Gated Linear Unit)
# experts	-	specialized NNs in Mixture-of-Experts (MoE)
MoE TopK	-	Number of experts activated per token in MoE
Initialization std		Standard deviation used for parameter initialization
Sequence length	128k	Context window (max tokens model can process at once)
Position embedding	RoPE	Rotary Position Embedding for token position encoding
# Parameters	8.1B	total parameters/weights
# Active parameters	8.1B	Parameters used during a single forward pass
# Training tokens	12T	12 trillion training tokens

andersch.dev

LLM (Large Language Model)

Model Training

Reading a model card

Resources