gh-k-dense-ai-claude-scient…/skills/modal/references/scaling.md

# Scaling Out on Modal

## Automatic Autoscaling

Every Modal Function corresponds to an autoscaling pool of containers. Modal's autoscaler:
- Spins up containers when no capacity available
- Spins down containers when resources idle
- Scales to zero by default when no inputs to process

Autoscaling decisions are made quickly and frequently.

## Parallel Execution with `.map()`

Run function repeatedly with different inputs in parallel:

```python
@app.function()
def evaluate_model(x):
    return x ** 2

@app.local_entrypoint()
def main():
    inputs = list(range(100))
    # Runs 100 inputs in parallel across containers
    for result in evaluate_model.map(inputs):
        print(result)
```

### Multiple Arguments with `.starmap()`

For functions with multiple arguments:

```python
@app.function()
def add(a, b):
    return a + b

@app.local_entrypoint()
def main():
    results = list(add.starmap([(1, 2), (3, 4)]))
    # [3, 7]
```

### Exception Handling

```python
@app.function()
def may_fail(a):
    if a == 2:
        raise Exception("error")
    return a ** 2

@app.local_entrypoint()
def main():
    results = list(may_fail.map(
        range(3),
        return_exceptions=True,
        wrap_returned_exceptions=False
    ))
    # [0, 1, Exception('error')]
```

## Autoscaling Configuration

Configure autoscaler behavior with parameters:

```python
@app.function(
    max_containers=100,      # Upper limit on containers
    min_containers=2,        # Keep warm even when inactive
    buffer_containers=5,     # Maintain buffer while active
    scaledown_window=60,     # Max idle time before scaling down (seconds)
)
def my_function():
    ...
```

Parameters:
- **max_containers**: Upper limit on total containers
- **min_containers**: Minimum kept warm even when inactive
- **buffer_containers**: Buffer size while function active (additional inputs won't need to queue)
- **scaledown_window**: Maximum idle duration before scale down (seconds)

Trade-offs:
- Larger warm pool/buffer → Higher cost, lower latency
- Longer scaledown window → Less churn for infrequent requests

## Dynamic Autoscaler Updates

Update autoscaler settings without redeployment:

```python
f = modal.Function.from_name("my-app", "f")
f.update_autoscaler(max_containers=100)
```

Settings revert to decorator configuration on next deploy, or are overridden by further updates:

```python
f.update_autoscaler(min_containers=2, max_containers=10)
f.update_autoscaler(min_containers=4)  # max_containers=10 still in effect
```

### Time-Based Scaling

Adjust warm pool based on time of day:

```python
@app.function()
def inference_server():
    ...

@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
def increase_warm_pool():
    inference_server.update_autoscaler(min_containers=4)

@app.function(schedule=modal.Cron("0 22 * * *", timezone="America/New_York"))
def decrease_warm_pool():
    inference_server.update_autoscaler(min_containers=0)
```

### For Classes

Update autoscaler for specific parameter instances:

```python
MyClass = modal.Cls.from_name("my-app", "MyClass")
obj = MyClass(model_version="3.5")
obj.update_autoscaler(buffer_containers=2)  # type: ignore
```

## Input Concurrency

Process multiple inputs per container with `@modal.concurrent`:

```python
@app.function()
@modal.concurrent(max_inputs=100)
def my_function(input: str):
    # Container can handle up to 100 concurrent inputs
    ...
```

Ideal for I/O-bound workloads:
- Database queries
- External API requests
- Remote Modal Function calls

### Concurrency Mechanisms

**Synchronous Functions**: Separate threads (must be thread-safe)

```python
@app.function()
@modal.concurrent(max_inputs=10)
def sync_function():
    time.sleep(1)  # Must be thread-safe
```

**Async Functions**: Separate asyncio tasks (must not block event loop)

```python
@app.function()
@modal.concurrent(max_inputs=10)
async def async_function():
    await asyncio.sleep(1)  # Must not block event loop
```

### Target vs Max Inputs

```python
@app.function()
@modal.concurrent(
    max_inputs=120,    # Hard limit
    target_inputs=100  # Autoscaler target
)
def my_function(input: str):
    # Allow 20% burst above target
    ...
```

Autoscaler aims for `target_inputs`, but containers can burst to `max_inputs` during scale-up.

## Scaling Limits

Modal enforces limits per function:
- 2,000 pending inputs (not yet assigned to containers)
- 25,000 total inputs (running + pending)

For `.spawn()` async jobs: up to 1 million pending inputs.

Exceeding limits returns `Resource Exhausted` error - retry later.

Each `.map()` invocation: max 1,000 concurrent inputs.

## Async Usage

Use async APIs for arbitrary parallel execution patterns:

```python
@app.function()
async def async_task(x):
    await asyncio.sleep(1)
    return x * 2

@app.local_entrypoint()
async def main():
    tasks = [async_task.remote.aio(i) for i in range(100)]
    results = await asyncio.gather(*tasks)
```

## Common Gotchas

**Incorrect**: Using Python's builtin map (runs sequentially)
```python
# DON'T DO THIS
results = map(evaluate_model, inputs)
```

**Incorrect**: Calling function first
```python
# DON'T DO THIS
results = evaluate_model(inputs).map()
```

**Correct**: Call .map() on Modal function object
```python
# DO THIS
results = evaluate_model.map(inputs)
```