feat: add rate limiting support for model providers #318

austinmw · 2025-06-30T23:24:51Z

Description

This PR adds rate limiting capabilities to Strands model providers using a token bucket algorithm to ensure compliance with API providers' RPM limits.

Motivation:

ReAct agents create unpredictable API call patterns—a simple task might need 2 calls while complex reasoning could chain 20+ calls. Here are a few scenarios where rate limiting helps:

Thundering Herd Prevention - When parallel workers hit limits simultaneously, Strands' retry mechanism causes synchronized backoff cycles where all workers pause, retry, and fail together, leading to significant slowdowns.
Multi-Tenant Isolation - Prevent "noisy neighbor" problems where one customer's heavy usage triggers rate limit retries for all other customers sharing the same model.
Cost Control - Set different rate limits for dev/staging/prod environments to prevent accidental runaway usage. For example, limiting development to 10 RPM ensures you can't accidentally burn through your budget while testing.

Basic Usage:

Function Wrapper

from strands import Agent
from strands.models import BedrockModel
from strands.models.rate_limiter import rate_limit_model

# Wrap any model with rate limiting
model = BedrockModel(model_id="...")
limited_model = rate_limit_model(model, rpm=60)

# Use it with an agent
agent = Agent(model=limited_model)
response = agent("Hello!")

Class Wrapper

# Create a rate-limited class
LimitedBedrockModel = rate_limit_model(BedrockModel, rpm=60)

# Then instantiate multiple times
model1 = LimitedBedrockModel(model_id="...")
model2 = LimitedBedrockModel(model_id="...")
# Both instances share the same rate limit

Configs

# Share rate limits across different model providers
config = {"rpm": 60, "bucket_key": "shared-api-limit"}

# Both models share the same 60 RPM bucket
bedrock_model = rate_limit_model(BedrockModel(model_id="claude-4-opus..."), **config)
litellm_model = rate_limit_model(LiteLLMModel(model_id="gpt-4o"), **config)

Key Features:

Token bucket algorithm: Allows burst capacity while maintaining average rate limit
Shared rate limits: Multiple agents can coordinate through shared buckets
Transparent wrapper: Zero breaking changes, completely opt-in
Thread-safe: Handles concurrent agent execution correctly
Flexible modes: Timeout (fail fast) or wait (block until ready)

Implementation Details:

Zero additional dependencies (uses only Python standard library)
Minimal overhead (<1ms per request)
Structured for future async support when Strands adds it

I'd appreciate any feedback if you have concerns about the feature or suggestions for improvement!

Related Issues

N/A

Documentation PR

Will add docs if feature is approved

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Implement token bucket algorithm for rate limiting - Add RateLimitedModel wrapper class - Support shared rate limit buckets across instances - Add comprehensive unit and integration tests - Support both timeout and wait modes

- Make RateLimitedModel generic to preserve wrapped model type - Add overloads for better type inference - Use WeakValueDictionary for automatic bucket cleanup - Fix race condition in get_or_create_bucket with atomic operations - Remove RateLimitedModel from public API exports - Make rate limiter integration test more reliable by using relative timing

feat: add rate limiting support for model providers

f6e4fc9

- Implement token bucket algorithm for rate limiting - Add RateLimitedModel wrapper class - Support shared rate limit buckets across instances - Add comprehensive unit and integration tests - Support both timeout and wait modes

austinmw requested a deployment to manual-approval June 30, 2025 23:25 — with GitHub Actions Waiting

austinmw requested a deployment to manual-approval July 1, 2025 15:44 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add rate limiting support for model providers #318

feat: add rate limiting support for model providers #318

austinmw commented Jun 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: add rate limiting support for model providers #318

Are you sure you want to change the base?

feat: add rate limiting support for model providers #318

Conversation

austinmw commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation:

Basic Usage:

Key Features:

Implementation Details:

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

austinmw commented Jun 30, 2025 •

edited

Loading