andersch.dev

<2024-10-26 Sat>
[ ai ]

LLM (Large Language Model)

Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.

A model consists of its parameters (or weights) and an executable that runs on those parameters.

Notable examples:

Model Training

The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:

Pretraining: Train model on a giant dataset in an expensive and lengthy process

  1. Download and preprocess the internet.
  2. Tokenization: Represent tokens of text as unique IDs
    • Trade-off between sequence length and symbol size
    • Run byte-pair encoding to mint new symbols for common byte sequences
    • See e.g. tiktokenizer
  3. Training
    • Take windows/sequences of tokens from the training set
    • Try to predict next token for this context
  4. Result: Base model. LLM is now a "document generator".

Fine-tuning: LLM becomes an assistant that can answer questions

  • Trained on manually collected datasets from people
  • People write Q&A responses based on labeling instructions
  • Alternatively: Humans can compare answers and select the best one
  • Focus on quality over quantity and alignment, i.e. changing the format
  • Cheap and quick process, happens every week or so
  • Result: assistant model

Reinforcement Learning

  • Practice on problems and find solutions that work best for the LLM

Reading a model card

Example: ibm-granite/granite-3.1-8b-instruct

metric 8b Dense explanation
Embedding Size 4096 embedding vector dimension for input tokens
# layers 40 40 Transformer blocks
# attention heads 32 32 heads in Attention
Attention head size 128 128 dimensions per head, 4096 = 32x128
# KV heads 8 Key-Value projection pairs
MLP hidden size 12800 hidden layer size of Multi-Layer-Perceptron or FNN
MLP activation SwiGLU Activation function for MLP (Swish-Gated Linear Unit)
# experts - specialized NNs in Mixture-of-Experts (MoE)
MoE TopK - Number of experts activated per token in MoE
Initialization std   Standard deviation used for parameter initialization
Sequence length 128k Context window (max tokens model can process at once)
Position embedding RoPE Rotary Position Embedding for token position encoding
# Parameters 8.1B total parameters/weights
# Active parameters 8.1B Parameters used during a single forward pass
# Training tokens 12T 12 trillion training tokens

Resources