andersch.dev

<2024-10-26 Sat>
[ ai ]

LLM (Large Language Model)

Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.

A model consists of its parameters (or weights) and an executable that runs on those parameters.

Notable LLMs

  • ChatGPT (OpenAI)
  • Claude (Anthropic)
  • LLaMA (Meta)
  • DeepSeek (DeepSeek)
  • Grok (xAI)
  • Gemini (Google)
  • Mistral (Mistral AI)

Model Training

The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:

Pretraining: Train model on a giant dataset in an expensive and lengthy process

  1. Download and preprocess the internet.
  2. Tokenization: Represent tokens of text as unique IDs
    • Trade-off between sequence length and symbol size
    • Run byte-pair encoding to mint new symbols for common byte sequences
    • See e.g. tiktokenizer
  3. Training
    • Take windows/sequences of tokens from the training set
    • Try to predict next token for this context
  4. Result: Base model. LLM is now a "document generator".

Fine-tuning: LLM becomes an assistant that can answer questions

  • Trained on manually collected datasets from people
  • People write Q&A responses based on labeling instructions
  • Alternatively: Humans can compare answers and select the best one
  • Focus on quality over quantity and alignment, i.e. changing the format
  • Cheap and quick process, happens every week or so
  • Result: assistant model

Reinforcement Learning

  • Practice on problems and find solutions that work best for the LLM

Reading a model card

Example: ibm-granite/granite-3.1-8b-instruct

metric 8b Dense explanation
Embedding Size 4096 embedding vector dimension for input tokens
# layers 40 40 Transformer blocks
# attention heads 32 32 heads in Attention
Attention head size 128 128 dimensions per head, 4096 = 32x128
# KV heads 8 Key-Value projection pairs
MLP hidden size 12800 hidden layer size of Multi-Layer-Perceptron or FNN
MLP activation SwiGLU Activation function for MLP (Swish-Gated Linear Unit)
# experts - specialized NNs in Mixture-of-Experts (MoE)
MoE TopK - Number of experts activated per token in MoE
Initialization std   Standard deviation used for parameter initialization
Sequence length 128k Context window (max tokens model can process at once)
Position embedding RoPE Rotary Position Embedding for token position encoding
# Parameters 8.1B total parameters/weights
# Active parameters 8.1B Parameters used during a single forward pass
# Training tokens 12T 12 trillion training tokens

AI/LLM Agents

AI agents are autonomous systems where LLMs direct themselves independently over extended periods, in order to accomplish complex tasks.

Or: Agents run tools in a loop to achieve a goal

Examples

  • Browser Use - Browser controlling tool for AI agents
  • Manus - General AI agent
  • OpenManus - Open-source implementation of Manus
  • gptme - AI agent in the terminal

API Price Comparison

See LLM pricing calculator

model provider 1m/in 1m/out month context output rpm tpm rpd
Claude 3.7 Sonnet Anthropic $3 $15 $0 200k 64k 50 20k  
Claude 3.5 Sonnet Anthropic $3 $15 $0 200k 8k 50 40k  
DeepSeek-R1 DeepSeek $0.55 $2.19 $0 64k 8k  
GPT-4.5 OpenAI $75 $150 $0 128k 16k      
Gemini 2.5 Flash Google       1m   10 250k 500
Gemini 2.5 Pro Google       1m   N/A N/A N/A
Gemini 2.5 Flash Google $0.15 $3.5   1m   1000 1m 10k
Gemini 2.5 Pro Google $2.5 $15   1m   1500 2m 1000
  OpenRouter                
  groq                
  glhf.chat                
  Cerebras                
  DeepInfra                
  together.ai                
  Replicate                
  Nebius                

Benchmarks

Benchmarks aims to evaluate the performance of LLMs to measure their capabilities in areas such as reasoning, understanding, and generation.

Prominent Examples:

  • Chatbot Arena: users rate LLM responses in conversations (crowdsourced)
  • MMLU (Massive Multitask Language Understanding): Broad knowledge tests
  • HumanEval: Coding ability benchmark, checked against test cases

Resources:

Security

Types of attacks:

  • Jailbreaks: Circumventing alignment, e.g. sending a harmful prompt in base64
  • Prompt injection: Embedding harmful prompt in unexpected places
  • Data poisoning/Backdoor: Poisoning (finetuning) dataset, e.g. with a code word
  • Adversarial inputs
  • Insecure output handling
  • Data extraction & privacy
  • Data reconstruction
  • Denial of service
  • Escalation
  • Watermarking & evasion
  • Model theft

Model Context Protocol (MCP)

Model Context Protocol (MCP) is an open protocol that specifies how an LLM can retrieve data from different sources into its context window and can be used to provide custom tool usage to a model. It was introduced by Anthropic in 2024.

Similar to Retrieval Augmented Generation (RAG), MCP is concerned with giving LLMs access to data that was not part of their original training data set.

Components

  • MCP Host: Program that wants to access resources (e.g. Claude Desktop)
  • MCP Client: On the host; maintains connections with servers
  • MCP Servers: Provides resources, prompts and tools
  • Resources: Local (filesystem, databases), or remote (behind API)

Resources

LLM Inference Software

  • llama.cpp - LLM inference in C/C++
  • vLLM - Inference and serving engine for LLMs
  • ExLlamaV2 - Library for local inference on modern consumer-class GPUs

Claude Prompting Guide

General tips

  • Be clear, specific, provide context and break complex tasks down
  • Provide examples of the kind of output you're looking for
  • Encourage thinking (ask to "think step-by-step" or "explain your reasoning.)
  • Iterative refinement (ask for clarifications/modifications)
  • Leverage Claude's knowledge (ask for explanations or background information)
  • Use role-playing (ask to adopt a specific role or perspective)

Content Creation

  • Specify your audience
  • Define the tone and style
  • Define output structure (Provide basic outline of points you want covered)
  • Specify the desired output format

Document summary and Q&A

  • Refer to attached documents by name.
  • Ask for citations from the document in its answers

Brainstorming

  • Use Claude to generate ideas by asking for a list of possibilities
  • Request responses in specific formats (bullet points, lists, or tables)

Troubleshooting, minimizing hallucinations, and maximizing performance

  • Allow Claude to acknowledge uncertainty ("If unsure, say you don't know")
  • Break down complex tasks into smaller steps
  • Include all contextual information for new conversations

Resources