You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(llama): Metal crash on exit, KV cache reuse, thread tuning, date context optimization
- Fix ggml_metal_device_free crash on quit by using _exit(0) after saving
conversations, skipping C++ static destructors (known llama.cpp issue)
- Fix FolderManager() created in SwiftUI Menu body causing 800+ disk reads
- Move date/time from system prompt to userContext block for stable system
prompt (enables KV cache prefix reuse for pure Transformer models)
- Add KV cache prefix reuse for non-hybrid/non-recurrent models
- Disable KV cache prefix reuse for hybrid models (Qwen3.5 Mamba+Transformer)
where recurrent state buffer cannot be partially cleared
- Tune thread counts for Apple Silicon Metal offload (fewer threads, less
contention when GPU does the heavy lifting)
- Enable flash attention, q8_0 KV cache quantization, ubatch sizing
- Fix thread assignment: n_threads for generation (fewer), n_threads_batch
for prompt processing (more)
- Add EOG text detection for MLX streaming (gemma <end_of_turn> etc.)
- Add strong reference to EndpointManager in AppDelegate
- Handle unknown model endpoint types gracefully
ctx_params.n_threads_batch = promptThreads // Batch/prompt processing (compute-bound, more threads)
384
387
385
388
/// CRITICAL FIX: Explicitly initialize samplers to NULL to prevent segfault
386
389
/// Recent llama.cpp versions validate sampler chains during context init.
@@ -389,10 +392,25 @@ actor LlamaContext {
389
392
ctx_params.samplers =nil
390
393
ctx_params.n_samplers =0
391
394
392
-
/// PERFORMANCE: Offload KV cache to GPU (LMStudio "Offload KV Cache to GPU") This stores the key/value cache on GPU instead of CPU RAM Dramatically improves performance for long contexts.
/// Clear KV cache before each new completion Without this, llama.cpp fails with "sequence positions remain consecutive" error because it thinks sequence 0 already has tokens from previous request Use llama_memory_seq_rm with positions -1 to -1 to clear entire sequence.
449
-
letmemory=llama_get_memory(context)
450
-
_ =llama_memory_seq_rm(memory,0,-1,-1)
451
-
452
466
/// Reset completion state.
453
467
is_done =false
454
468
n_decode =0
455
-
n_cur =0
456
469
accumulatedText =""
457
470
458
471
/// Start performance tracking.
459
472
generationStartTime =Date()
460
473
tokensGenerated =0
461
474
475
+
/// KV cache prefix reuse: find how many tokens match the previous prompt.
476
+
/// This avoids reprocessing the system prompt + tools on every message.
477
+
/// NOTE: Prefix reuse is disabled for hybrid Mamba/Transformer models (qwen35, etc.)
478
+
/// because their recurrent state (RS buffer) cannot be partially cleared.
479
+
/// The RS buffer tracks state across all positions and becomes inconsistent
480
+
/// after llama_memory_seq_rm, causing decode failures.
/// Process prompt in batches if it exceeds batch size batch was initialized with batchSize (2048), but prompt might be larger Need to decode in chunks to avoid overflow.
473
-
llamaLogger.info("Processing prompt with \(tokens_list.count)tokens, batch_size=\(batchSize)")
539
+
llamaLogger.info("Processing prompt with \(tokens_list.count- startIndex) new tokens (of \(tokens_list.count) total), batch_size=\(batchSize)")
474
540
475
541
/// Adaptive batch sizing based on prompt length and context capacity The crash occurs when KV cache runs out of memory slots Root cause: Processing large prompts (8611 tokens) exhausts KV cache after 2 batches Solution: Reduce batch size dynamically when approaching context limits.
476
542
477
543
letmaxContextTokens=Int32(contextSize)
478
-
vartokenIndex=0
544
+
vartokenIndex=startIndex
479
545
varconsecutiveDecodeFailures=0
480
546
letmaxConsecutiveFailures=3
481
547
@@ -689,9 +755,26 @@ actor LlamaContext {
689
755
maxTokensLimit = limit
690
756
}
691
757
758
+
/// Explicitly free all Metal/llama resources. Call before app exit to avoid
759
+
/// static destructor crashes in ggml_metal_device_free.
760
+
func destroy(){
761
+
llamaLogger.info("LlamaContext destroy: Explicitly freeing all resources")
762
+
previousPromptTokens =[]
763
+
llama_sampler_free(sampling)
764
+
llama_batch_free(batch)
765
+
llama_free(context)
766
+
llama_model_free(model)
767
+
/// Nil out the pointers so deinit doesn't double-free.
768
+
/// (These are let properties in the actor, so we use a flag instead.)
0 commit comments