LLM (Large Language Model)
Large Language Models (LLMs) are a form of advanced neural network models designed to understand, generate, and process natural language. They are trained on vast amounts of text using unsupervised or semi-supervised learning techniques to learn to predict the next word in a sentence, which helps them understand patterns, grammar, context, and semantics.
A model consists of its parameters (or weights) and an executable that runs on those parameters.
Notable LLMs
- ChatGPT (OpenAI)
- Claude (Anthropic)
- LLaMA (Meta)
- DeepSeek (DeepSeek)
- Grok (xAI)
- Gemini (Google)
- Mistral (Mistral AI)
Model Training
The Neural Network of an LLM tries to predict the next word in a sequence. Due to the relation of prediction and compression, training a model can be thought of as a lossy compression of the given dataset. The training stages are:
Pretraining: Train model on a giant dataset in an expensive and lengthy process
- Download and preprocess the internet.
- Involves crawling, extracting, filtering, etc.
- E.g. FineWeb Dataset
- Tokenization: Represent tokens of text as unique IDs
- Trade-off between sequence length and symbol size
- Run byte-pair encoding to mint new symbols for common byte sequences
- See e.g. tiktokenizer
- Training
- Take windows/sequences of tokens from the training set
- Try to predict next token for this context
- Result: Base model. LLM is now a "document generator".
Fine-tuning: LLM becomes an assistant that can answer questions
- Trained on manually collected datasets from people
- People write Q&A responses based on labeling instructions
- Alternatively: Humans can compare answers and select the best one
- Focus on quality over quantity and alignment, i.e. changing the format
- Cheap and quick process, happens every week or so
- Result: assistant model
Reinforcement Learning
- Practice on problems and find solutions that work best for the LLM
Reading a model card
Example: ibm-granite/granite-3.1-8b-instruct
| metric | 8b Dense | explanation |
|---|---|---|
| Embedding Size | 4096 | embedding vector dimension for input tokens |
| # layers | 40 | 40 Transformer blocks |
| # attention heads | 32 | 32 heads in Attention |
| Attention head size | 128 | 128 dimensions per head, 4096 = 32x128 |
| # KV heads | 8 | Key-Value projection pairs |
| MLP hidden size | 12800 | hidden layer size of Multi-Layer-Perceptron or FNN |
| MLP activation | SwiGLU | Activation function for MLP (Swish-Gated Linear Unit) |
| # experts | - | specialized NNs in Mixture-of-Experts (MoE) |
| MoE TopK | - | Number of experts activated per token in MoE |
| Initialization std | Standard deviation used for parameter initialization | |
| Sequence length | 128k | Context window (max tokens model can process at once) |
| Position embedding | RoPE | Rotary Position Embedding for token position encoding |
| # Parameters | 8.1B | total parameters/weights |
| # Active parameters | 8.1B | Parameters used during a single forward pass |
| # Training tokens | 12T | 12 trillion training tokens |
AI/LLM Agents
AI agents are autonomous systems where LLMs direct themselves independently over extended periods, in order to accomplish complex tasks.
Or: Agents run tools in a loop to achieve a goal
Examples
- Browser Use - Browser controlling tool for AI agents
- Manus - General AI agent
- OpenManus - Open-source implementation of Manus
- gptme - AI agent in the terminal
API Price Comparison
| model | provider | 1m/in | 1m/out | month | context | output | rpm | tpm | rpd |
|---|---|---|---|---|---|---|---|---|---|
| Claude 3.7 Sonnet | Anthropic | $3 | $15 | $0 | 200k | 64k | 50 | 20k | |
| Claude 3.5 Sonnet | Anthropic | $3 | $15 | $0 | 200k | 8k | 50 | 40k | |
| DeepSeek-R1 | DeepSeek | $0.55 | $2.19 | $0 | 64k | 8k | ∞ | ∞ | |
| GPT-4.5 | OpenAI | $75 | $150 | $0 | 128k | 16k | |||
| Gemini 2.5 Flash | 1m | 10 | 250k | 500 | |||||
| Gemini 2.5 Pro | 1m | N/A | N/A | N/A | |||||
| Gemini 2.5 Flash | $0.15 | $3.5 | 1m | 1000 | 1m | 10k | |||
| Gemini 2.5 Pro | $2.5 | $15 | 1m | 1500 | 2m | 1000 | |||
| OpenRouter | |||||||||
| groq | |||||||||
| glhf.chat | |||||||||
| Cerebras | |||||||||
| DeepInfra | |||||||||
| together.ai | |||||||||
| Replicate | |||||||||
| Nebius |
Benchmarks
Benchmarks aims to evaluate the performance of LLMs to measure their capabilities in areas such as reasoning, understanding, and generation.
Prominent Examples:
- Chatbot Arena: users rate LLM responses in conversations (crowdsourced)
- MMLU (Massive Multitask Language Understanding): Broad knowledge tests
- HumanEval: Coding ability benchmark, checked against test cases
Resources:
- Aider LLM Leaderboards: LLM Coding Benchmark
Security
Types of attacks:
- Jailbreaks: Circumventing alignment, e.g. sending a harmful prompt in base64
- Prompt injection: Embedding harmful prompt in unexpected places
- Data poisoning/Backdoor: Poisoning (finetuning) dataset, e.g. with a code word
- Adversarial inputs
- Insecure output handling
- Data extraction & privacy
- Data reconstruction
- Denial of service
- Escalation
- Watermarking & evasion
- Model theft
Model Context Protocol (MCP)
Model Context Protocol (MCP) is an open protocol that specifies how an LLM can retrieve data from different sources into its context window and can be used to provide custom tool usage to a model. It was introduced by Anthropic in 2024.
Similar to Retrieval Augmented Generation (RAG), MCP is concerned with giving LLMs access to data that was not part of their original training data set.
Components
- MCP Host: Program that wants to access resources (e.g. Claude Desktop)
- MCP Client: On the host; maintains connections with servers
- MCP Servers: Provides resources, prompts and tools
- Resources: Local (filesystem, databases), or remote (behind API)
Resources
Claude Prompting Guide
General tips
- Be clear, specific, provide context and break complex tasks down
- Provide examples of the kind of output you're looking for
- Encourage thinking (ask to "think step-by-step" or "explain your reasoning.)
- Iterative refinement (ask for clarifications/modifications)
- Leverage Claude's knowledge (ask for explanations or background information)
- Use role-playing (ask to adopt a specific role or perspective)
Content Creation
- Specify your audience
- Define the tone and style
- Define output structure (Provide basic outline of points you want covered)
- Specify the desired output format
Document summary and Q&A
- Refer to attached documents by name.
- Ask for citations from the document in its answers
Brainstorming
- Use Claude to generate ideas by asking for a list of possibilities
- Request responses in specific formats (bullet points, lists, or tables)
Troubleshooting, minimizing hallucinations, and maximizing performance
- Allow Claude to acknowledge uncertainty ("If unsure, say you don't know")
- Break down complex tasks into smaller steps
- Include all contextual information for new conversations
Resources
- LLM Resources Hub
- LLM Visualizer
- A ChatGPT clone, in 3000 bytes of C, backed by GPT-2
- Andrej Karpathy - LLM training in simple, raw C/CUDA
- Andrej Karpathy - Deep Dive into LLMs Like ChatGPT
- The Illustrated DeepSeek-R1
- DeepSeek's open-source week and why it's a big deal
- Everything I've learned so far about running local LLMs
- llama.cpp guide - Running LLMs locally, on any hardware, from scratch
- Llama 3.1 Nuts and Bolts: Understanding how Llama works (code + documentation)
- How I Use Every Claude Code Feature
- https://www.gilesthomas.com/2024/12/llm-from-scratch-1][Writing an LLM from scratch, part 1">Giles' blog]]
- llm.c by hand ✍️ C meets Transformer
- Antrophic - Building effective agents
- Vibe engineering
- Agentic Coding Questions and Current Workflow - YouTube