Trying to run AI models locally

February 26, 20264 min read

Tech

Run inference directly in the terminal:

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"

Lol, I'm thinking of buying an RTX Pro 5000 Blackwell with the 72GB VRAM one now to run big models locally lololol.

Base Command

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

🧠 Thinking Mode (Deep Reasoning)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00

No --chat-template-kwargs = thinking mode on

Lower temp 0.6 for more deterministic reasoning

Larger ctx for long chain-of-thought traces

💻 Coding Tasks (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.2 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"

Low temp 0.2 = precise, deterministic code output

min-p 0.01 filters low-probability garbage tokens

16k ctx is enough for most code files + context

⚡ Fast Spew / Chat / Brainstorm (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 4096 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"

Small ctx 4096 = less KV cache memory = faster tokens/sec

High temp = creative, loose responses

Good for: quick Q&A, drafting, ideation

📄 Long Document / RAG / Summarization

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 65536 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"

Max ctx for stuffing large documents

Low temp for faithful summarization

⚠️ Requires significant VRAM/RAM for large KV cache

🎨 Creative Writing (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 1.1 --top-p 0.98 --top-k 40 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"

Higher temp + top-k 40 = more varied, expressive output

Good for: stories, dialogue, roleplay

📊 Quick Reference Table

Mode	`--ctx-size`	`--temp`	`--top-k`	Thinking
Deep Reasoning	`32768`	`0.6`	`20`	✅ On
Coding	`16384`	`0.2`	`20`	❌ Off
Fast Chat	`4096`	`1.0`	`20`	❌ Off
Long Doc / RAG	`65536`	`0.3`	`20`	❌ Off
Creative Writing	`8192`	`1.1`	`40`	❌ Off

💡 Tip: Smaller --ctx-size directly improves speed. Only increase it when you actually need the context window.

🌐 Translation Tasks (No Thinking)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"

Low temp 0.3 = faithful, accurate translations (not creative)

min-p 0.01 reduces hallucinated words/phrases

8k ctx handles most documents comfortably

Variant: Large Document Translation

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 32768 --temp 0.3 --top-p 0.90 --top-k 20 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"

Bumped ctx to 32768 for translating long files or books in one pass

Variant: Literary / Nuanced Translation (some creativity)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 8192 --temp 0.7 --top-p 0.95 --top-k 30 --min-p 0.01 --chat-template-kwargs "{\"enable_thinking\": false}"

Slightly higher temp 0.7 allows stylistic flexibility

Good for: poetry, literature, idiomatic expressions where word-for-word is wrong

Hocky's canto benchmark

Prompt:

can you translate this in natural spoken hongkong cantonese, where you'd find in chats and normal conversations
- give me 4 pairs of chopsticks
- can you hold (book) this item for me, i'll return in monday and ill buy it later
- can you chase that butterfly flying near that bush?
- please stop near that orange building in front
- dont eat to fast or youll choke yourself

Model used:

Qwen 3.5 397B
Grok 4.1 Expert
Opus 4.6
Gemini 3.1 Pro

Give me 4 pairs of chopsticks

唔該畀四對筷子
畀我4對筷子啦
畀四雙筷子我
唔該畀四對筷子我呀

Can you hold this item for me, I'll return on Monday and buy it later

可唔可以幫我留低呢樣嘢？我禮拜一返嚟買
可唔可以幫我book住呢樣嘢呀？我星期一返嚟先買
可唔可以幫我留住呢樣嘢先？我星期一返嚟再買
可唔可以幫我留起呢件嘢呀？我禮拜一會過嚟買

Can you chase that butterfly flying near that bush?

可唔可以追下隻喺個叢度飛緊嘅蝴蝶？
你可唔可以追嗰隻飛緊喺嗰個bush附近嘅蝴蝶呀？
你去追吓嗰隻喺嗰個樹叢度飛緊嘅蝴蝶啦
你去追下草叢飛緊嗰隻蝴蝶吖

Please stop near that orange building in front

唔該前面近嗰棟橙色樓度停
唔該喺前面嗰棟橙色大廈附近停低啦
唔該前面嗰棟橙色嘅大廈停一停
唔該前面橙色嗰棟樓停吖

Don't eat too fast or you'll choke yourself

唔好食得咁快呀，費事噎親
唔好食咁快呀，會噎親自己㗎
唔好食咁快啦，會噎親㗎
唔好食咁快啦，費事哽親 / 食慢啲啦，一陣哽親呀