Illegal Instruction when running a llamafile

Hi,

### Issue:
I tried to run [llava-v1.5-7b-q4.llamafile](https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true) or [TinyLlama-1.1B-Chat-v1.0.F16.llamafile](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile?download=true) on my system:
_Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr  4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux_

But I encountered the same error at the same step for both:

### stdout:
**$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile**
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                          llama.block_count u32              = 22
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   5:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   6:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   7:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   8:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                          general.file_type u32              = 1
llama_model_loader: - kv  10:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  11:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type  f16:  156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 2.05 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors:        CPU buffer size =  2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.13 MiB
llama_new_context_with_model:        CPU compute buffer size =    66.50 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 1
Instruction non permise (core dumped)





### llama.log content:

**$ cat llama.log** 
warming up the model with an empty run

### lscpu
It seems to be CPU related, so here is my lscpu:

**$ lscpu**
Architecture :                              x86_64
  Mode(s) opératoire(s) des processeurs :   32-bit, 64-bit
  Address sizes:                            36 bits physical, 48 bits virtual
  Boutisme :                                Little Endian
Processeur(s) :                             4
  Liste de processeur(s) en ligne :         0-3
Identifiant constructeur :                  GenuineIntel
  Nom de modèle :                           Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
    Famille de processeur :                 6
    Modèle :                                42
    Thread(s) par cœur :                    1
    Cœur(s) par socket :                    4
    Socket(s) :                             1
    Révision :                              7
    Vitesse maximale du processeur en MHz : 3700,0000
    Vitesse minimale du processeur en MHz : 1600,0000
    BogoMIPS :                              6619.18
    Drapaux :                               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
                                            e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni 
                                            pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
                                            save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
                                            i md_clear flush_l1d
Virtualization features:                    
  Virtualisation :                          VT-x
Caches (sum of all):                        
  L1d:                                      128 KiB (4 instances)
  L1i:                                      128 KiB (4 instances)
  L2:                                       1 MiB (4 instances)
  L3:                                       6 MiB (1 instance)
NUMA:                                       
  Nœud(s) NUMA :                            1
  Nœud NUMA 0 de processeur(s) :            0-3
Vulnerabilities:                            
  Gather data sampling:                     Not affected
  Itlb multihit:                            KVM: Mitigation: VMX disabled
  L1tf:                                     Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                                      Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:                                 Mitigation; PTI
  Mmio stale data:                          Unknown: No mitigations
  Retbleed:                                 Not affected
  Spec rstack overflow:                     Not affected
  Spec store bypass:                        Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                               Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                               Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                                    Not affected
  Tsx async abort:                          Not affected



I saw a similar issue with a similar CPU: [Support broken on old Intel/Amd CPUs #25](https://github.com/Mozilla-Ocho/llamafile/issues/25). But as it does not crash at the same step, I was wondering if it could be related.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Illegal Instruction when running a llamafile #413

Issue:

stdout:

llama.log content:

lscpu

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Illegal Instruction when running a llamafile #413

Description

Issue:

stdout:

llama.log content:

lscpu

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions