← Back to posts
Trying to run AI models locally
4 min read
Tech
Run inference directly in the terminal:
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
Lol, I'm thinking of buying an RTX Pro 5000 Blackwell with the 72GB VRAM one now to run big models locally lololol.
Base Command
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE
🧠 Thinking Mode (Deep Reasoning)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
- No
--chat-template-kwargs= thinking mode on- Lower temp
0.6for more deterministic reasoning- Larger ctx for long chain-of-thought traces
💻 Coding Tasks (No Thinking)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.2 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
- Low temp
0.2= precise, deterministic code outputmin-p 0.01filters low-probability garbage tokens- 16k ctx is enough for most code files + context
⚡ Fast Spew / Chat / Brainstorm (No Thinking)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 4096 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
- Small ctx
4096= less KV cache memory = faster tokens/sec- High temp = creative, loose responses
- Good for: quick Q&A, drafting, ideation
📄 Long Document / RAG / Summarization
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 65536 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
- Max ctx for stuffing large documents
- Low temp for faithful summarization
- ⚠️ Requires significant VRAM/RAM for large KV cache
🎨 Creative Writing (No Thinking)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 1.1 --top-p 0.98 --top-k 40 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
- Higher temp +
top-k 40= more varied, expressive output- Good for: stories, dialogue, roleplay
📊 Quick Reference Table
| Mode | --ctx-size | --temp | --top-k | Thinking |
|---|---|---|---|---|
| Deep Reasoning | 32768 | 0.6 | 20 | ✅ On |
| Coding | 16384 | 0.2 | 20 | ❌ Off |
| Fast Chat | 4096 | 1.0 | 20 | ❌ Off |
| Long Doc / RAG | 65536 | 0.3 | 20 | ❌ Off |
| Creative Writing | 8192 | 1.1 | 40 | ❌ Off |
💡 Tip: Smaller
--ctx-sizedirectly improves speed. Only increase it when you actually need the context window.
🌐 Translation Tasks (No Thinking)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
- Low temp
0.3= faithful, accurate translations (not creative)min-p 0.01reduces hallucinated words/phrases- 8k ctx handles most documents comfortably
Variant: Large Document Translation
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
- Bumped ctx to
32768for translating long files or books in one pass
Variant: Literary / Nuanced Translation (some creativity)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.7 --top-p 0.95 --top-k 30 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
- Slightly higher temp
0.7allows stylistic flexibility- Good for: poetry, literature, idiomatic expressions where word-for-word is wrong
Hocky's canto benchmark
Prompt:
can you translate this in natural spoken hongkong cantonese, where you'd find in chats and normal conversations
- give me 4 pairs of chopsticks
- can you hold (book) this item for me, i'll return in monday and ill buy it later
- can you chase that butterfly flying near that bush?
- please stop near that orange building in front
- dont eat to fast or youll choke yourself
Model used:
- Qwen 3.5 397B
- Grok 4.1 Expert
- Opus 4.6
- Gemini 3.1 Pro
Give me 4 pairs of chopsticks
- 唔該畀四對筷子
- 畀我4對筷子啦
- 畀四雙筷子我
- 唔該畀四對筷子我呀
Can you hold this item for me, I'll return on Monday and buy it later
- 可唔可以幫我留低呢樣嘢?我禮拜一返嚟買
- 可唔可以幫我book住呢樣嘢呀?我星期一返嚟先買
- 可唔可以幫我留住呢樣嘢先?我星期一返嚟再買
- 可唔可以幫我留起呢件嘢呀?我禮拜一會過嚟買
Can you chase that butterfly flying near that bush?
- 可唔可以追下隻喺個叢度飛緊嘅蝴蝶?
- 你可唔可以追嗰隻飛緊喺嗰個bush附近嘅蝴蝶呀?
- 你去追吓嗰隻喺嗰個樹叢度飛緊嘅蝴蝶啦
- 你去追下草叢飛緊嗰隻蝴蝶吖
Please stop near that orange building in front
- 唔該前面近嗰棟橙色樓度停
- 唔該喺前面嗰棟橙色大廈附近停低啦
- 唔該前面嗰棟橙色嘅大廈停一停
- 唔該前面橙色嗰棟樓停吖
Don't eat too fast or you'll choke yourself
- 唔好食得咁快呀,費事噎親
- 唔好食咁快呀,會噎親自己㗎
- 唔好食咁快啦,會噎親㗎
- 唔好食咁快啦,費事哽親 / 食慢啲啦,一陣哽親呀