Skip to content

feat: add rate limiting support for model providers #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

austinmw
Copy link

@austinmw austinmw commented Jun 30, 2025

Description

This PR adds rate limiting capabilities to Strands model providers using a token bucket algorithm to ensure compliance with API providers' RPM limits.

Motivation:

ReAct agents create unpredictable API call patterns—a simple task might need 2 calls while complex reasoning could chain 20+ calls. Here are a few scenarios where rate limiting helps:

  1. Thundering Herd Prevention - When parallel workers hit limits simultaneously, Strands' retry mechanism causes synchronized backoff cycles where all workers pause, retry, and fail together, leading to significant slowdowns.
  2. Multi-Tenant Isolation - Prevent "noisy neighbor" problems where one customer's heavy usage triggers rate limit retries for all other customers sharing the same model.
  3. Cost Control - Set different rate limits for dev/staging/prod environments to prevent accidental runaway usage. For example, limiting development to 10 RPM ensures you can't accidentally burn through your budget while testing.

Basic Usage:

  1. Function Wrapper
from strands import Agent
from strands.models import BedrockModel
from strands.models.rate_limiter import rate_limit_model

# Wrap any model with rate limiting
model = BedrockModel(model_id="...")
limited_model = rate_limit_model(model, rpm=60)

# Use it with an agent
agent = Agent(model=limited_model)
response = agent("Hello!")
  1. Class Wrapper
# Create a rate-limited class
LimitedBedrockModel = rate_limit_model(BedrockModel, rpm=60)

# Then instantiate multiple times
model1 = LimitedBedrockModel(model_id="...")
model2 = LimitedBedrockModel(model_id="...")
# Both instances share the same rate limit
  1. Configs
# Share rate limits across different model providers
config = {"rpm": 60, "bucket_key": "shared-api-limit"}

# Both models share the same 60 RPM bucket
bedrock_model = rate_limit_model(BedrockModel(model_id="claude-4-opus..."), **config)
litellm_model = rate_limit_model(LiteLLMModel(model_id="gpt-4o"), **config)

Key Features:

  • Token bucket algorithm: Allows burst capacity while maintaining average rate limit
  • Shared rate limits: Multiple agents can coordinate through shared buckets
  • Transparent wrapper: Zero breaking changes, completely opt-in
  • Thread-safe: Handles concurrent agent execution correctly
  • Flexible modes: Timeout (fail fast) or wait (block until ready)

Implementation Details:

  • Zero additional dependencies (uses only Python standard library)
  • Minimal overhead (<1ms per request)
  • Structured for future async support when Strands adds it

I'd appreciate any feedback if you have concerns about the feature or suggestions for improvement!

Related Issues

N/A

Documentation PR

Will add docs if feature is approved

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

  - Implement token bucket algorithm for rate limiting
  - Add RateLimitedModel wrapper class
  - Support shared rate limit buckets across instances
  - Add comprehensive unit and integration tests
  - Support both timeout and wait modes
- Make RateLimitedModel generic to preserve wrapped model type
- Add overloads for better type inference
- Use WeakValueDictionary for automatic bucket cleanup
- Fix race condition in get_or_create_bucket with atomic operations
- Remove RateLimitedModel from public API exports
- Make rate limiter integration test more reliable by using relative timing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant