cd ~/JetStream/benchmarks
pip install -r requirements.in
cd ~/data
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024
python benchmark_serving.py \
--tokenizer <llama3 tokenizer path> \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--model llama-3
Please use --save-request-outputs
flag to save predictions to a file.
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--save-request-outputs
To automatically evaluate the outputs against the ROUGE evaluation metric, add the --run-eval true
flag.
Note: If --save-result
is used, the evaluation scores will be saved as well.
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--save-request-outputs \
--run-eval true
python JetStream/benchmarks/benchmark_serving.py \
--tokenizer ~/maxtext/assets/tokenizer.llama2 \
--warmup-first true \
--save-result \
--save-request-outputs \
--request-outputs-file-path outputs.json \
--num-prompts 1000 \
--max-output-length 1024 \
--dataset openorca
If you used --save-request-outputs
, you can separately evaluate against the generated outputs.
python eval_accuracy.py outputs.json
With openorca dataset and llama2-chat models (used by MLPerf), here are the reference accuracy numbers:
llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998}
llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162}