← Back to posts

Trying to run AI models locally

4 min read
Tech

Run inference directly in the terminal:

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"

Lol, I'm thinking of buying an RTX Pro 5000 Blackwell with the 72GB VRAM one now to run big models locally lololol.

Base Command

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

🧠 Thinking Mode (Deep Reasoning)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
  • No --chat-template-kwargs = thinking mode on
  • Lower temp 0.6 for more deterministic reasoning
  • Larger ctx for long chain-of-thought traces

💻 Coding Tasks (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.2 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Low temp 0.2 = precise, deterministic code output
  • min-p 0.01 filters low-probability garbage tokens
  • 16k ctx is enough for most code files + context

⚡ Fast Spew / Chat / Brainstorm (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 4096 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Small ctx 4096 = less KV cache memory = faster tokens/sec
  • High temp = creative, loose responses
  • Good for: quick Q&A, drafting, ideation

📄 Long Document / RAG / Summarization

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 65536 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Max ctx for stuffing large documents
  • Low temp for faithful summarization
  • ⚠️ Requires significant VRAM/RAM for large KV cache

🎨 Creative Writing (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 1.1 --top-p 0.98 --top-k 40 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Higher temp + top-k 40 = more varied, expressive output
  • Good for: stories, dialogue, roleplay

📊 Quick Reference Table

Mode--ctx-size--temp--top-kThinking
Deep Reasoning327680.620✅ On
Coding163840.220❌ Off
Fast Chat40961.020❌ Off
Long Doc / RAG655360.320❌ Off
Creative Writing81921.140❌ Off

💡 Tip: Smaller --ctx-size directly improves speed. Only increase it when you actually need the context window.


🌐 Translation Tasks (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Low temp 0.3 = faithful, accurate translations (not creative)
  • min-p 0.01 reduces hallucinated words/phrases
  • 8k ctx handles most documents comfortably

Variant: Large Document Translation

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Bumped ctx to 32768 for translating long files or books in one pass

Variant: Literary / Nuanced Translation (some creativity)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.7 --top-p 0.95 --top-k 30 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"
  • Slightly higher temp 0.7 allows stylistic flexibility
  • Good for: poetry, literature, idiomatic expressions where word-for-word is wrong

Hocky's canto benchmark

Prompt:

can you translate this in natural spoken hongkong cantonese, where you'd find in chats and normal conversations
- give me 4 pairs of chopsticks
- can you hold (book) this item for me, i'll return in monday and ill buy it later
- can you chase that butterfly flying near that bush?
- please stop near that orange building in front
- dont eat to fast or youll choke yourself

Model used:

  1. Qwen 3.5 397B
  2. Grok 4.1 Expert
  3. Opus 4.6
  4. Gemini 3.1 Pro

Give me 4 pairs of chopsticks

  1. 唔該畀四對筷子
  2. 畀我4對筷子啦
  3. 畀四雙筷子我
  4. 唔該畀四對筷子我呀

Can you hold this item for me, I'll return on Monday and buy it later

  1. 可唔可以幫我留低呢樣嘢?我禮拜一返嚟買
  2. 可唔可以幫我book住呢樣嘢呀?我星期一返嚟先買
  3. 可唔可以幫我留住呢樣嘢先?我星期一返嚟再買
  4. 可唔可以幫我留起呢件嘢呀?我禮拜一會過嚟買

Can you chase that butterfly flying near that bush?

  1. 可唔可以追下隻喺個叢度飛緊嘅蝴蝶?
  2. 你可唔可以追嗰隻飛緊喺嗰個bush附近嘅蝴蝶呀?
  3. 你去追吓嗰隻喺嗰個樹叢度飛緊嘅蝴蝶啦
  4. 你去追下草叢飛緊嗰隻蝴蝶吖

Please stop near that orange building in front

  1. 唔該前面近嗰棟橙色樓度停
  2. 唔該喺前面嗰棟橙色大廈附近停低啦
  3. 唔該前面嗰棟橙色嘅大廈停一停
  4. 唔該前面橙色嗰棟樓停吖

Don't eat too fast or you'll choke yourself

  1. 唔好食得咁快呀,費事噎親
  2. 唔好食咁快呀,會噎親自己㗎
  3. 唔好食咁快啦,會噎親㗎
  4. 唔好食咁快啦,費事哽親 / 食慢啲啦,一陣哽親呀

Comments