Description
Hi,
Issue:
I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system:
Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
But I encountered the same error at the same step for both:
stdout:
$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: llama.block_count u32 = 22
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 5: llama.attention.head_count u32 = 32
llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: llama.vocab_size u32 = 32000
llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.pre str = default
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MiB
llm_load_tensors: CPU buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CPU output buffer size = 0.13 MiB
llama_new_context_with_model: CPU compute buffer size = 66.50 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
Instruction non permise (core dumped)
llama.log content:
$ cat llama.log
warming up the model with an empty run
lscpu
It seems to be CPU related, so here is my lscpu:
$ lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Identifiant constructeur : GenuineIntel
Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Famille de processeur : 6
Modèle : 42
Thread(s) par cœur : 1
Cœur(s) par socket : 4
Socket(s) : 1
Révision : 7
Vitesse maximale du processeur en MHz : 3700,0000
Vitesse minimale du processeur en MHz : 1600,0000
BogoMIPS : 6619.18
Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
i md_clear flush_l1d
Virtualization features:
Virtualisation : VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
Nœud(s) NUMA : 1
Nœud NUMA 0 de processeur(s) : 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.