LLaMA
LLaMA (Large Language Model Meta AI) is an open-weights LLM developed by Meta. It is available in a range of sizes: from 7 billion to 65 billion parameters.
llama.cpp
llama.cpp provides a way to do CPU/GPU inference of the LLaMA model (and others) in C/C++. CPU inference can work reasonably up to ~10B parameter models, but should only be used if GPU inference is impractical. GPU inference is not worth it below 8GB of VRAM.
The provided llama-server
is an HTTP server with a chat UI and APIs. Using the
executable and a .gguf
(the model), you can invoke it like this:
$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf
The context size is the largest number of tokens the LLM can handle at once, input plus output. Contexts typically range from 8K to 128K tokens, and depending on the model's tokenizer, normal English text is ~1.6 tokens per word as counted by wc -w.
If the model supports a large context you may run out of memory. If so, set a smaller context size, like –ctx-size $((1<<13)) (i.e. 8K tokens).
Resources
- llama.cpp guide - Running LLMs locally, on any hardware, from scratch
- Llama 3.1 Nuts and Bolts: Understanding how Llama works (code + documentation)