Allow Different Compute Layout for Attention #709

morgandu · 2024-06-17T20:23:14Z

Checklist

This PR introduced compute layout control to allowed a different compute layout for attention

Attention Unit Tests for different compute layout
Microbenchmark - Performance
E2E Serving - Accuracy and Performance

Setup

v5e8
llama2-7b
quant_mode
- w-bf16-kv-bf16
- for w-i8-kv-i8 see Allow Quantize KV Cache over Multiple Dimensions #708
model_mode
- base for performance
- chat for accuracy

Results and Analysis

The goal of introducing the new compute layout is to potentially avoid cache layout tuning, though we still can tune the cache layout to seek and verify for the best performance.

Annotation

b: batch
t: query_length
h: query_heads
d: kv_dimension
s: kv_length
k: kv_heads

Layout

0123 | bthd | bskd
0213 | bhtd | bksd
1203 | thbd | skbd

Summary

Existing attention compute layout is 0123, and we introduced a different compute layout 0213, which is of a layout that's TPU friendly.

We introduced 0213 compute layout to verify:
- if and how much 0213 has direct impact on performance on the default cache layout, i.e. same layout as compute layout
- if and how much different compute layout and different cache layouts have a composite impact on performance

Performance

Existing compute layout 0123 and its history

Cache layout 1203-1203

With the existing cache layout was 1203-1203, with throughput 2591.642232 tokens/s, this was improved from the default cache layout 0123-0123 about 3x.

Cache layout 2013-2013

After layout tuning, we got optimal prefill-ar cache layout as 2013-2013, with throughput 3347.180221 tokens/s, which was 29% improvement.

New compute layout 0213

Cache layout 0213-0213

With the two cache in the same layout as compute, i.e. 0213-0213 (xprof: https://xprof.corp.google.com/overview_page/morgandu-12159058496322304249), we got 3273.96 tokens/s, this is of the top performance after we verified with layout tuning.

Cache layout 0213-0132

The tuned cache layout that give us the best throughput 3329.45 tokens/s is 0213-0132 (xprof: https://xprof.corp.google.com/overview_page/morgandu-5743582688063478644)

Accuracy

No regression on Rouge scores between 0123 and 0213

{'rouge1': 42.1738, 'rouge2': 19.6973, 'rougeL': 26.9088, 'rougeLsum': 39.6794, 'gen_len': 1144204, 'gen_num': 995}

vipannalla

Looks good to me

vipannalla · 2024-06-17T20:36:30Z

.vscode/launch.json

+ "name": "Debug MaxText Inference Microbenchmark",
+ "type": "python",
+ "request": "launch",
+ "console": "integratedTerminal",
+ "justMyCode": false,
+ "python": "python3",
+ "program": "${workspaceFolder}/MaxText/inference_microbenchmark.py",
+ "args": [
+ "MaxText/configs/base.yml",
+ "model_name=llama2-7b",
+ "tokenizer_path=assets/tokenizer.llama2",
+ "weight_dtype=bfloat16",
+ "scan_layers=false",
+ "attention=dot_product",
+ "max_prefill_predict_length=1024",
+ "max_target_length=2048",
+ "ici_fsdp_parallelism=1",
+ "ici_tensor_parallelism=-1",
+ "ici_autoregressive_parallelism=1",
+ "inference_microbenchmark_prefill_lengths=32,64,128,256,512,1024",
+ "inference_microbenchmark_stages=generate",
+ "inference_microbenchmark_loop_iters=1",
+ "run_name=runner_$(date +%Y-%m-%d-%H-%M)", 
+ "base_output_directory=gs://test-maxtext-output",
+ "prefill_cache_axis_order=0,2,1,3",
+ "ar_cache_axis_order=0,2,1,3",
+ "compute_axis_order=0,2,1,3",
+ "reshape_q=true",
+ "per_device_batch_size=24",
+ "quantization=int8",
+ "quantize_kvcache=True",
+ ]
+ },


n00b question -- Is this to auto run for every local commit/amend on vscode? or just for convinience?

no, I added this when I was debugging on my local vscode, which I think will be helpful for other engineers too. You may need to change the flags depends on your run though.

vipannalla · 2024-06-17T20:39:10Z

MaxText/pyconfig.py

@@ -52,6 +52,11 @@ def string_to_bool(s: str) -> bool:
 _yaml_types_to_parser = {str: str, int: int, float: float, bool: string_to_bool}


+def validate_compute_axis_order(s: str) -> None:
+ valid_compute_axis_order = ("0,1,2,3", "0,2,1,3")


Why not allow other layouts? Does the code break with others or just run slow? I'd allow other layouts similar to prefill_cache_axis_order/ar_cache_axis_order.
Maybe remove the exception and just print a warning that others are untested?

nvm, I noticed in attention.py you specifically look for those two.

vipannalla · 2024-06-17T20:39:32Z

MaxText/tests/attention_test.py

- prefill_cache_axis_order=(1,2,0,3),
- ar_cache_axis_order=(1,2,0,3)
- )
+ def test_dot_product_cache_axis_order(self):


Thanks for adding unitests!

vipannalla · 2024-06-17T20:40:59Z

MaxText/layers/attentions.py

@@ -68,6 +68,12 @@
 # pytype: disable=attribute-error


+def validate_compute_axis_order(s: str) -> None:


duplicate, remove. I see this method in pyconfig.py as well, which is right place to validate at the beginning of the run.

Actually this is a different validate. Since the base.yml only allows string, so the pyconfig validates the flag's string value. In attention module, the axis_orders are really used as AxisIdxes (tuple), and it can be hard code / overwrite in different model class initiation, i.e. https://github.com/google/maxtext/blob/main/MaxText/layers/gpt3.py#L229

I don't want to risk with possible hard coded axis_orders passing from somewhere other than yaml flags.

vipannalla · 2024-06-17T20:51:05Z

MaxText/layers/attentions.py

+ if model_mode == common_types.MODEL_MODE_TRAIN or self.compute_axis_order == (0,1,2,3):
+ query = jnp.reshape(query, (b, t, n_kv, n // n_kv, d))
+ if self.reshape_q and q_seq_len == 1:
+ query = jnp.broadcast_to(query, (b, 2, n_kv, n // n_kv, d))
+ result = jnp.einsum("btkgd,bskd->bkgts", query, key)
+ elif self.compute_axis_order == (0,2,1,3):
+ query = jnp.transpose(query, axes=self.compute_axis_order)
+ key = jnp.transpose(key, axes=self.compute_axis_order)
+ query = jnp.reshape(query, (b, n_kv, n // n_kv, t, d))
+ if self.reshape_q and q_seq_len == 1:
+ query = jnp.broadcast_to(query, (b, n_kv, n // n_kv, 2, d))
+ result = jnp.einsum("bkgtd,bksd->bkgts", query, key)


This lgtm for now. Since we already know 0 = b, 1 = t, 2= n etc.., can this code be made generic to support all layouts?

A lot of other places needs to be changed to support more layouts, I have experienced some loud and silent bug when I was trying to add one more layout till I gave up.

Also, I don't think it's necessary to support all layouts. Since it was recommended that bh** is one of the friendly layout. I'd say if we end up really need to include new layouts then let's revisit it!

MaxText/configs/base.yml

morgandu requested review from rwitten and gobbleturk as code owners June 17, 2024 20:23

morgandu requested review from vipannalla and patemotter June 17, 2024 20:23

morgandu assigned gobbleturk and vipannalla Jun 17, 2024

morgandu force-pushed the mor--compute-axis-order branch 2 times, most recently from 2e5e8ac to c5ee451 Compare June 17, 2024 20:29

morgandu changed the title ~~Allow Different Compute Layout~~ Allow Different Compute Layout for Attention Jun 17, 2024

vipannalla reviewed Jun 17, 2024

View reviewed changes

morgandu force-pushed the mor--compute-axis-order branch from c5ee451 to aca8130 Compare June 17, 2024 21:24

morgandu mentioned this pull request Jun 18, 2024

Allow Quantize KV Cache over Multiple Dimensions #708

Open

2 tasks

gobbleturk approved these changes Jun 18, 2024

View reviewed changes

MaxText/configs/base.yml Show resolved Hide resolved

github-actions bot added the pull ready label Jun 18, 2024

add compute_axis_order

5cf2e57

morgandu force-pushed the mor--compute-axis-order branch from aca8130 to 5cf2e57 Compare June 18, 2024 04:50

copybara-service bot merged commit fe4bfdc into main Jun 18, 2024
13 checks passed

copybara-service bot deleted the mor--compute-axis-order branch June 18, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Different Compute Layout for Attention #709

Allow Different Compute Layout for Attention #709

morgandu commented Jun 17, 2024 •

edited

vipannalla left a comment

vipannalla Jun 17, 2024

morgandu Jun 17, 2024 •

edited

vipannalla Jun 17, 2024

vipannalla Jun 17, 2024

vipannalla Jun 17, 2024

morgandu Jun 17, 2024

vipannalla Jun 17, 2024

morgandu Jun 17, 2024 •

edited

vipannalla Jun 17, 2024

morgandu Jun 17, 2024

		@@ -68,6 +68,12 @@
		# pytype: disable=attribute-error


		def validate_compute_axis_order(s: str) -> None:

Allow Different Compute Layout for Attention #709

Allow Different Compute Layout for Attention #709

Conversation

morgandu commented Jun 17, 2024 • edited

Checklist

Setup

Results and Analysis

Annotation

Layout

Summary

Performance

Existing compute layout 0123 and its history

Cache layout 1203-1203

Cache layout 2013-2013

New compute layout 0213

Cache layout 0213-0213

Cache layout 0213-0132

Accuracy

vipannalla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgandu Jun 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgandu Jun 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgandu commented Jun 17, 2024 •

edited

morgandu Jun 17, 2024 •

edited

morgandu Jun 17, 2024 •

edited