LLaMA

LLaMA (Large Language Model Meta AI) is an open-weights LLM developed by Meta. It is available in a range of sizes: from 7 billion to 65 billion parameters.

`llama.cpp`

llama.cpp provides a way to do CPU/GPU inference of the LLaMA model (and others) in C/C++. CPU inference can work reasonably up to ~10B parameter models, but should only be used if GPU inference is impractical. GPU inference is not worth it below 8GB of VRAM.

The provided llama-server is an HTTP server with a chat UI and APIs. Using the executable and a .gguf (the model), you can invoke it like this:

$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf

The context size is the largest number of tokens the LLM can handle at once, input plus output. Contexts typically range from 8K to 128K tokens, and depending on the model's tokenizer, normal English text is ~1.6 tokens per word as counted by wc -w.

If the model supports a large context you may run out of memory. If so, set a smaller context size, like –ctx-size $((1<<13)) (i.e. 8K tokens).

Resources

llama.cpp guide - Running LLMs locally, on any hardware, from scratch
Llama 3.1 Nuts and Bolts: Understanding how Llama works (code + documentation)