-
Jacek Strzalkowski authored
- Rewrite for Llama architecture: RMSNorm, GQA (32Q/8KV heads), RoPE, SwiGLU - Separate Q/K/V/output/gate/up/down projections (7 per block, was 4) - No biases on linear layers, no position embeddings - Add tokenizer_gguf.py: BPE tokenizer extracted from GGUF metadata - Fix 64-bit offset in llama_set_ptr (8GB+ weights file) - Fix _ftelli64 portability (MSVC vs GCC/Clang) - KV cache at N_KV_DIM=1024 (4x memory savings vs full embed)
ae1e33f2
Loading