๐ธ๏ธ A Simple Proxy for Azure and OpenAI raised our GPT4 TPM limit by 24x
William Zeng - October 26th, 2023
OpenAI API rate limits are restrictive, with a top rate limit of 20K TPM for GPT4.
Even the most basic AI Agents make 3+ calls per request, and to effectively plan theyโll need to use RAG(retrieval augmented generation) (opens in a new tab). A minimally useful request should require an input of at least 500
tokens(~375 words ~ 2 paragraphs) worth of text, and generate roughly 125
tokens(half a paragraph?) of output.
Over a single request this equates to 1,875(3 * 625)
tokens, hitting the token limit with 10.67
requests/min.
For reference, a Sweep ticket makes 20
GPT4 calls consuming an average of 2000
tokens and a total usage of 40,000
tokens.
This leads to a "Sweep" request limit of 0.5
requests/min, needing 2 minutes to finish a single request.
Thereโs a solution though.
Microsoft Azure East US has a 20K TPM limit for us.
We can use a proxy like(https://github.com/diemus/azure-openai-proxy (opens in a new tab)) to load the balance between OpenAI and Azure, bringing our total TPM limit from 20k โ 40k.
This helps, but it only brings us up to 40k total TPM. Even better, we can balance between multiple Azure regions to further increase our rate limit.
Hereโs a table of the rate limits we found on each region:
Azure Region | GPT-4 Rate Limit(TPM) |
---|---|
East Canada | 100K |
Japan East | 100K |
East US 2 | 100K |
UK South | 100K |
Australia East | 40K |
Switzerland North | 40K (dynamic) |
East US | 20K |
France Central | 20K |
Sweden Central | NA |
West Europe | NA |
North Central US | NA |
Total | 480K/520K(dynamic) |
Using all of the Azure regions, we were able to increase our TPM rate limit from 20k
โ 40k
(2x increase!) โ 480K
(24x increase!!)
Hereโs a graph of our OpenAI usage before/after this change:
Here's the code:
Sweep's OpenAI Proxy
import random
import openai
from loguru import logger
from sweepai.config.server import (
AZURE_API_KEY,
MULTI_REGION_CONFIG,
OPENAI_API_BASE,
OPENAI_API_ENGINE_GPT4,
OPENAI_API_ENGINE_GPT4_32K,
OPENAI_API_ENGINE_GPT35,
OPENAI_API_KEY,
OPENAI_API_TYPE,
OPENAI_API_VERSION,
)
class OpenAIProxy:
def __init__(self):
pass
def call_openai(self, model, messages, max_tokens, temperature) -> str:
try:
engine = None
if (
model == "gpt-3.5-turbo-16k"
or model == "gpt-3.5-turbo-16k-0613"
and OPENAI_API_ENGINE_GPT35 is not None
):
engine = OPENAI_API_ENGINE_GPT35
elif (
model == "gpt-4"
or model == "gpt-4-0613"
and OPENAI_API_ENGINE_GPT4 is not None
):
engine = OPENAI_API_ENGINE_GPT4
elif (
model == "gpt-4-32k"
or model == "gpt-4-32k-0613"
and OPENAI_API_ENGINE_GPT4_32K is not None
):
engine = OPENAI_API_ENGINE_GPT4_32K
if OPENAI_API_TYPE is None or engine is None:
openai.api_key = OPENAI_API_KEY
openai.api_base = "https://api.openai.com/v1"
openai.api_version = None
openai.api_type = "open_ai"
logger.info(f"Calling {model} on OpenAI.")
response = openai.ChatCompletion.create(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response["choices"][0].message.content
# validity checks for MULTI_REGION_CONFIG
if (
MULTI_REGION_CONFIG is None
or not isinstance(MULTI_REGION_CONFIG, list)
or len(MULTI_REGION_CONFIG) == 0
or not isinstance(MULTI_REGION_CONFIG[0], list)
):
logger.info(
f"Calling {model} with engine {engine} on Azure url {OPENAI_API_BASE}."
)
openai.api_type = OPENAI_API_TYPE
openai.api_base = OPENAI_API_BASE
openai.api_version = OPENAI_API_VERSION
openai.api_key = AZURE_API_KEY
response = openai.ChatCompletion.create(
engine=engine,
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response["choices"][0].message.content
# multi region config is a list of tuples of (region_url, api_key)
# we will try each region in order until we get a response
# randomize the order of the list
SHUFFLED_MULTI_REGION_CONFIG = random.sample(
MULTI_REGION_CONFIG, len(MULTI_REGION_CONFIG)
)
for region_url, api_key in SHUFFLED_MULTI_REGION_CONFIG:
try:
logger.info(
f"Calling {model} with engine {engine} on Azure url {region_url}."
)
openai.api_key = api_key
openai.api_base = region_url
openai.api_version = OPENAI_API_VERSION
openai.api_type = OPENAI_API_TYPE
response = openai.ChatCompletion.create(
engine=engine,
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response["choices"][0].message.content
except SystemExit:
raise SystemExit
except Exception as e:
logger.exception(f"Error calling {region_url}: {e}")
raise Exception("No Azure regions available")
except SystemExit:
raise SystemExit
except Exception as e:
if OPENAI_API_KEY:
try:
openai.api_key = OPENAI_API_KEY
openai.api_base = "https://api.openai.com/v1"
openai.api_version = None
openai.api_type = "open_ai"
logger.info(f"Calling {model} with OpenAI.")
response = openai.ChatCompletion.create(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response["choices"][0].message.content
except SystemExit:
raise SystemExit
except Exception as _e:
logger.error(f"OpenAI API Key found but error: {_e}")
logger.error(f"OpenAI API Key not found and Azure Error: {e}")
raise e
MULTI_REGION_CONFIG
is a list of tuples of (region_url, api_key)
that we randomize over, and the other envvars are standard strings from OpenAI/Azure. Another requirement with our proxy is that each of the models(GPT3.5, GPT4, GPT4-32K) must have consistent naming across regions.
We have some potential improvements to our proxy:
- We randomize between the different Azure regions uniformly, so this could be improved by weighted randomization.
- We could decrease our expected latency further by keeping track of which regions have capacity.
- We could optimize further by trying to estimate the load on each region. We can perform a smart optimization by computing a simple function like:
estimated_latency = mean_over_regions((input_tokens + output_tokens) / latency_per_token_estimate)
estimated_load = estimated_latency - true_latency
- and then optimize the region-level randomization by negatively weighting the
estimated_load
Here's a link to our production code:
https://github.com/sweepai/sweep/blob/main/sweepai/utils/openai_proxy.py (opens in a new tab)
If you're looking for a more advanced version of this, check out LiteLLM's load balancer (opens in a new tab).
If you're interested in using Sweep/have questions about how we use LLMs in production, reach out at https://community.sweep.dev/ (opens in a new tab)!