README.md · 9715c974d6eef107c1b5ac39a102fd70774f423e · Everyone / DeepseekR1Compiler

Port pipeline from GPT-2 to DeepSeek-R1-Distill-Llama-8B · ae1e33f2

Jacek Strzalkowski authored May 06, 2026

- Rewrite for Llama architecture: RMSNorm, GQA (32Q/8KV heads), RoPE, SwiGLU
- Separate Q/K/V/output/gate/up/down projections (7 per block, was 4)
- No biases on linear layers, no position embeddings
- Add tokenizer_gguf.py: BPE tokenizer extracted from GGUF metadata
- Fix 64-bit offset in llama_set_ptr (8GB+ weights file)
- Fix _ftelli64 portability (MSVC vs GCC/Clang)
- KV cache at N_KV_DIM=1024 (4x memory savings vs full embed)

ae1e33f2