Handling Multi-Session Load in FastAPI

Tuesday, 10 February 2026

In backend systems that serve many users simultaneously, one of the biggest challenges is ensuring that the server remains stable and responsive as the number of active sessions grows significantly. During high-traffic situations, the backend must not only process requests quickly, but also ensure that critical resources such as CPU, memory, and worker threads are not exhausted in a short period of time.

If session requests are allowed to run without any clear limitations, the system can become overloaded very quickly, because every active session continuously consumes server resources. In this kind of situation, the impact may include dramatically increased response times, uncontrolled request buildup, and even the possibility of the service going down entirely.

The requirements in this case are quite specific:

The system must only allow a maximum of 50 active sessions running concurrently at any given time.
If the number of active sessions exceeds 50, additional session requests should not be rejected immediately, but instead placed into a queue.
Sessions waiting inside the queue must follow an RTO (Recovery Time Objective) constraint, meaning they may wait for a maximum of 60 seconds.
If after 60 seconds the request still has not obtained an available slot, the session is considered expired.

Why Concurrency Limit Matters

Concurrency limiting is extremely important because without proper control, the backend will attempt to run all incoming requests at the same time, which eventually leads to resource exhaustion. When FastAPI workers become saturated, new requests will not receive execution slots, latency increases sharply, and the overall user experience deteriorates.

Some common negative effects when concurrency is not limited include:

FastAPI workers filling up quickly, preventing requests from being processed efficiently.
Response times increasing drastically due to excessive parallel workloads.
Memory usage growing uncontrollably because too many sessions are active.
The service becoming unstable and at higher risk of downtime.

Therefore, the solution implemented here includes:

A concurrency cap of 50 active sessions
A queue mechanism for overflow requests
A timeout guard using a 60-second RTO limit

High-Level Flow

At a high level, the intended system flow works as follows:

The client sends a request to create a new session.
The server checks the number of currently active sessions.
If a slot is still available (< 50), the session starts immediately.
If the session limit has been reached (>= 50), the request enters the queue.
The queue is processed as soon as active session slots become available.
If a request waits longer than 60 seconds, the session expires.

Core Configuration

The first step is defining the concurrency limit and RTO timeout values:

MAX_ACTIVE_SESSIONS = 50
RTO_LIMIT_SECONDS = 60

Session Manager (Semaphore + Queue)

For a simple implementation, we can combine asyncio.Semaphore as a concurrency limiter and asyncio.Queue to store additional requests when slots are full.

import asyncio

class SessionManager:
    def __init__(self):
        self.semaphore = asyncio.Semaphore(MAX_ACTIVE_SESSIONS)
        self.queue = asyncio.Queue()

    async def acquire_slot(self):
        await self.semaphore.acquire()

    def release_slot(self):
        self.semaphore.release()

    async def enqueue(self, payload):
        await self.queue.put(payload)

FastAPI Endpoint Implementation

The endpoint below starts sessions immediately if a slot is available. If the system is already at full capacity, the request is placed into a queue and waits for up to 60 seconds before expiring.

from fastapi import FastAPI, HTTPException
import asyncio
import time

app = FastAPI()
session_manager = SessionManager()

@app.post("/session")
async def create_session(data: dict):
    start_wait = time.time()

    # Slot available then start immediately
    if session_manager.semaphore.locked() is False:
        await session_manager.acquire_slot()
        return await run_session(data)

    # Slot full then enter queue
    await session_manager.enqueue(data)

    # Wait until slot becomes available or timeout occurs
    while time.time() - start_wait < RTO_LIMIT_SECONDS:
        if session_manager.semaphore.locked() is False:
            await session_manager.acquire_slot()
            return await run_session(data)

        await asyncio.sleep(1)

    # Timeout then RTO exceeded
    raise HTTPException(
        status_code=408,
        detail="Session expired: RTO limit (60s) exceeded"
    )

Running & Releasing Session Slot

It is critical that session slots are always released after execution so that queued requests can continue processing.

Using try/finally guarantees slot release even if errors occur.

async def run_session(data: dict):
    try:
        # Simulate session workload
        await asyncio.sleep(5)

        return {
            "status": "running",
            "data": data
        }

    finally:
        # Slot must always be released
        session_manager.release_slot()

Notes for Production Use

This approach is suitable for a single FastAPI instance in development or testing environments. However, for production systems running multiple workers or multiple instances, the solution should be upgraded with distributed infrastructure such as:

Redis-based queue for global request coordination
Distributed semaphore so concurrency limits remain consistent across instances
Background worker pools for processing sessions asynchronously
Monitoring queue depth and wait time to prevent overload

Conclusion

By implementing a concurrency cap, queue overflow handling, and a strict RTO timeout, FastAPI can safely manage multi-session workloads in a stable and predictable way.

The key requirements achieved are:

A maximum of 50 concurrent active sessions
Overflow requests are placed into a queue
Requests cannot wait longer than 60 seconds (RTO limit)

This design ensures the system remains stable even under heavy traffic and prevents uncontrolled overload conditions.