benchmarks

JetStream Benchmark And Eval

Install Dependencies

cd ~/JetStream/benchmarks
pip install -r requirements.in

Benchmark with shareGPT

Prepare DataSet

cd ~/data
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Run Benchmark with maxtext tokenizer

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024

Run Benchmark for Llama 3

python benchmark_serving.py \
--tokenizer <llama3 tokenizer path> \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--model llama-3

Save request outputs in Benchmark

Please use --save-request-outputs flag to save predictions to a file.

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs

Automatically run evaluation after Benchmark

To automatically evaluate the outputs against the ROUGE evaluation metric, add the --run-eval true flag. Note: If --save-result is used, the evaluation scores will be saved as well.

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs \
--run-eval true

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.llama2  \
--warmup-first true   \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs.json   \
--num-prompts 1000   \
--max-output-length 1024   \
--dataset openorca

Standalone Evaluation Run

If you used --save-request-outputs, you can separately evaluate against the generated outputs.

python eval_accuracy.py outputs.json

With openorca dataset and llama2-chat models (used by MLPerf), here are the reference accuracy numbers:

llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998}
llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162}

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
benchmark_serving.py		benchmark_serving.py
eval_accuracy.py		eval_accuracy.py
open_orca_gpt4_tokenized_llama.calibration_1000.pkl		open_orca_gpt4_tokenized_llama.calibration_1000.pkl
requirements.in		requirements.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks

benchmarks

README.md

JetStream Benchmark And Eval

Install Dependencies

Benchmark with shareGPT

Prepare DataSet

Run Benchmark with maxtext tokenizer

Run Benchmark for Llama 3

Save request outputs in Benchmark

Automatically run evaluation after Benchmark

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

Standalone Evaluation Run

Files

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

JetStream Benchmark And Eval

Install Dependencies

Benchmark with shareGPT

Prepare DataSet

Run Benchmark with maxtext tokenizer

Run Benchmark for Llama 3

Save request outputs in Benchmark

Automatically run evaluation after Benchmark

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

Standalone Evaluation Run