513 lines
15 KiB
Markdown
513 lines
15 KiB
Markdown
---
|
|
title: "Prefect: Modern Workflow Orchestration Platform"
|
|
library_name: prefect
|
|
pypi_package: prefect
|
|
category: workflow-orchestration
|
|
python_compatibility: "3.9+"
|
|
last_updated: "2025-11-02"
|
|
official_docs: "https://docs.prefect.io"
|
|
official_repository: "https://github.com/PrefectHQ/prefect"
|
|
maintenance_status: "active"
|
|
---
|
|
|
|
# Prefect: Modern Workflow Orchestration
|
|
|
|
## Core Purpose
|
|
|
|
Prefect solves workflow orchestration with a Python-first approach that turns regular Python functions into production-ready data pipelines. Unlike legacy orchestrators that require DAG definitions and framework-specific operators, Prefect observes native Python code execution and provides orchestration through simple decorators@[1].
|
|
|
|
**Problem Domain:** Coordinating multi-step data workflows, handling failures with retries, scheduling recurring jobs, monitoring pipeline execution, and managing dependencies between tasks without writing boilerplate orchestration code@[2].
|
|
|
|
**When to Use:** Building data pipelines, ML workflows, ETL processes, or any multi-step automation that needs scheduling, retry logic, state tracking, and observability@[3].
|
|
|
|
**What You Would Reinvent:** Manual retry logic, state management, dependency coordination, scheduling systems, execution monitoring, error handling, result caching, and workflow visibility dashboards@[4].
|
|
|
|
## Official Information
|
|
|
|
**Repository:** <https://github.com/PrefectHQ/prefect> **PyPI Package:** `prefect` (current: v3.4.24)@[5] **Documentation:** <https://docs.prefect.io> **License:** Apache-2.0@[6] **Maintenance:** Actively maintained by PrefectHQ with 1059 open issues, 20.6K stars, regular releases@[7] **Community:** 30K+ engineers, active Slack community@[8]
|
|
|
|
## Python Compatibility
|
|
|
|
**Minimum Version:** Python 3.9@[9] **Maximum Version:** Python 3.13 (3.14 not yet supported)@[9] **Async Support:** Full native async/await support throughout@[10] **Type Hints:** First-class support, type-safe structured outputs@[11]
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. Pythonic Flow Definition
|
|
|
|
Write workflows as regular Python functions with `@flow` and `@task` decorators:
|
|
|
|
```python
|
|
from prefect import flow, task
|
|
import httpx
|
|
|
|
@task(log_prints=True)
|
|
def get_stars(repo: str):
|
|
url = f"https://api.github.com/repos/{repo}"
|
|
count = httpx.get(url).json()["stargazers_count"]
|
|
print(f"{repo} has {count} stars!")
|
|
|
|
@flow(name="GitHub Stars")
|
|
def github_stars(repos: list[str]):
|
|
for repo in repos:
|
|
get_stars(repo)
|
|
|
|
# Run directly
|
|
if __name__ == "__main__":
|
|
github_stars(["PrefectHQ/Prefect"])
|
|
```
|
|
|
|
@[12]
|
|
|
|
### 2. Dynamic Runtime Workflows
|
|
|
|
Create tasks dynamically based on data, not static DAG definitions:
|
|
|
|
```python
|
|
from prefect import task, flow
|
|
|
|
@task
|
|
def process_customer(customer_id: str) -> str:
|
|
return f"Processed {customer_id}"
|
|
|
|
@flow
|
|
def main() -> list[str]:
|
|
customer_ids = get_customer_ids() # Runtime data
|
|
# Map tasks across dynamic data
|
|
results = process_customer.map(customer_ids)
|
|
return results
|
|
```
|
|
|
|
@[13]
|
|
|
|
### 3. Flexible Scheduling
|
|
|
|
Deploy workflows with cron, interval, or RRule schedules:
|
|
|
|
```python
|
|
# Serve with cron schedule
|
|
if __name__ == "__main__":
|
|
github_stars.serve(
|
|
name="daily-stars",
|
|
cron="0 8 * * *", # Daily at 8 AM
|
|
parameters={"repos": ["PrefectHQ/prefect"]}
|
|
)
|
|
```
|
|
|
|
@[14]
|
|
|
|
```python
|
|
# Or use interval-based scheduling
|
|
my_flow.deploy(
|
|
name="my-deployment",
|
|
work_pool_name="my-work-pool",
|
|
interval=timedelta(minutes=10)
|
|
)
|
|
```
|
|
|
|
@[15]
|
|
|
|
### 4. Built-in Retries and State Management
|
|
|
|
Automatic retry logic and state tracking:
|
|
|
|
```python
|
|
@task(retries=3, retry_delay_seconds=60)
|
|
def fetch_data():
|
|
# Automatically retries on failure
|
|
return api_call()
|
|
```
|
|
|
|
@[16]
|
|
|
|
### 5. Concurrent Task Execution
|
|
|
|
Run tasks in parallel with `.submit()`:
|
|
|
|
```python
|
|
@flow
|
|
def my_workflow():
|
|
future = cool_task.submit() # Non-blocking
|
|
print(what_did_cool_task_say(future))
|
|
```
|
|
|
|
@[17]
|
|
|
|
### 6. Event-Driven Automations
|
|
|
|
React to events, not just schedules:
|
|
|
|
```python
|
|
# Trigger flows on external events
|
|
my_flow.deploy(
|
|
triggers=[
|
|
DeploymentEventTrigger(
|
|
expect=["s3.file.uploaded"]
|
|
)
|
|
]
|
|
)
|
|
```
|
|
|
|
@[18]
|
|
|
|
## Real-World Integration Patterns
|
|
|
|
### Integration with dbt
|
|
|
|
Orchestrate dbt transformations within Prefect flows:
|
|
|
|
```python
|
|
from prefect_dbt import DbtCoreOperation
|
|
|
|
@flow
|
|
def dbt_flow():
|
|
result = DbtCoreOperation(
|
|
commands=["dbt run", "dbt test"],
|
|
project_dir="/path/to/dbt/project"
|
|
).run()
|
|
return result
|
|
```
|
|
|
|
@[19]
|
|
|
|
**Example Repository:** <https://github.com/anna-geller/prefect-dataplatform> (106 stars) - Shows Prefect + dbt + Snowflake data platform@[20]
|
|
|
|
### AWS Deployment Pattern
|
|
|
|
Deploy to AWS ECS Fargate:
|
|
|
|
```python
|
|
# prefect.yaml configuration
|
|
work_pool:
|
|
name: aws-ecs-pool
|
|
type: ecs
|
|
|
|
deployments:
|
|
- name: production
|
|
work_pool_name: aws-ecs-pool
|
|
schedules:
|
|
- cron: "0 */4 * * *"
|
|
```
|
|
|
|
@[21]
|
|
|
|
**Example Repository:** <https://github.com/anna-geller/dataflow-ops> (116 stars) - Automated deployments to AWS ECS@[22]
|
|
|
|
### Docker Compose Self-Hosted
|
|
|
|
Run Prefect server with Docker Compose:
|
|
|
|
```yaml
|
|
version: "3.8"
|
|
services:
|
|
prefect-server:
|
|
image: prefecthq/prefect:latest
|
|
command: prefect server start
|
|
ports:
|
|
- "4200:4200"
|
|
environment:
|
|
- PREFECT_API_DATABASE_CONNECTION_URL=postgresql+asyncpg://postgres:password@postgres:5432/prefect
|
|
```
|
|
|
|
@[23]
|
|
|
|
**Example Repositories:**
|
|
|
|
- <https://github.com/rpeden/prefect-docker-compose> (142 stars)@[24]
|
|
- <https://github.com/flavienbwk/prefect-docker-compose> (161 stars)@[25]
|
|
|
|
## Common Usage Patterns
|
|
|
|
### Pattern 1: ETL Pipeline with Retries
|
|
|
|
```python
|
|
from prefect import flow, task
|
|
from prefect.tasks import exponential_backoff
|
|
|
|
@task(retries=3, retry_delay_seconds=exponential_backoff(backoff_factor=2))
|
|
def extract_data(source: str):
|
|
# Fetch from API with automatic retries
|
|
return fetch_api_data(source)
|
|
|
|
@task
|
|
def transform_data(raw_data):
|
|
return clean_and_transform(raw_data)
|
|
|
|
@task
|
|
def load_data(data, destination: str):
|
|
write_to_database(data, destination)
|
|
|
|
@flow(log_prints=True)
|
|
def etl_pipeline():
|
|
raw = extract_data("https://api.example.com/data")
|
|
transformed = transform_data(raw)
|
|
load_data(transformed, "postgresql://db")
|
|
```
|
|
|
|
@[26]
|
|
|
|
### Pattern 2: Scheduled Data Sync
|
|
|
|
```python
|
|
@flow
|
|
def sync_customer_data():
|
|
customers = fetch_customers()
|
|
for customer in customers:
|
|
sync_to_warehouse(customer)
|
|
|
|
# Schedule to run every hour
|
|
if __name__ == "__main__":
|
|
sync_customer_data.serve(
|
|
name="hourly-sync",
|
|
interval=3600, # Every hour
|
|
tags=["production", "sync"]
|
|
)
|
|
```
|
|
|
|
@[27]
|
|
|
|
### Pattern 3: ML Pipeline with Caching
|
|
|
|
```python
|
|
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
|
|
def load_training_data():
|
|
# Expensive data loading - cached for 1 hour
|
|
return load_large_dataset()
|
|
|
|
@task
|
|
def train_model(data):
|
|
return train_ml_model(data)
|
|
|
|
@flow
|
|
def ml_pipeline():
|
|
data = load_training_data() # Reuses cached result
|
|
model = train_model(data)
|
|
return model
|
|
```
|
|
|
|
@[28]
|
|
|
|
## Integration Ecosystem
|
|
|
|
### Data Transformation
|
|
|
|
- **dbt:** Native integration via `prefect-dbt` package (archived, use dbt Cloud API)@[29]
|
|
- **dbt Cloud:** Official integration for triggering dbt Cloud jobs@[30]
|
|
|
|
### Data Warehouses
|
|
|
|
- **Snowflake:** `prefect-snowflake` for query execution@[31]
|
|
- **BigQuery:** `prefect-gcp` for BigQuery operations@[32]
|
|
- **Redshift, PostgreSQL:** Standard database connectors@[33]
|
|
|
|
### Cloud Platforms
|
|
|
|
- **AWS:** `prefect-aws` (S3, ECS, Lambda, Batch)@[34]
|
|
- **GCP:** `prefect-gcp` (GCS, BigQuery, Cloud Run)@[35]
|
|
- **Azure:** `prefect-azure` (Blob Storage, Container Instances)@[36]
|
|
|
|
### Container Orchestration
|
|
|
|
- **Docker:** Native Docker build and push support@[37]
|
|
- **Kubernetes:** `prefect-kubernetes` for K8s deployments@[38]
|
|
- **ECS Fargate:** Built-in ECS work pools@[39]
|
|
|
|
### Data Quality
|
|
|
|
- **Great Expectations:** `prefect-great-expectations` for validation@[40]
|
|
- **Monte Carlo:** Circuit breaker integrations@[41]
|
|
|
|
### ML/AI
|
|
|
|
- **LangChain:** `langchain-prefect` for LLM workflows (archived)@[42]
|
|
- **MLflow:** Track experiments within Prefect flows@[43]
|
|
|
|
## Deployment Options
|
|
|
|
### 1. Prefect Cloud (Managed)
|
|
|
|
Fully managed orchestration platform with:
|
|
|
|
- Hosted API and UI
|
|
- Team collaboration features
|
|
- RBAC and access controls
|
|
- Enterprise SLAs
|
|
- Automations and event triggers@[44]
|
|
|
|
**Pricing:** Free tier + usage-based pricing@[45]
|
|
|
|
### 2. Self-Hosted Prefect Server
|
|
|
|
Open-source server you deploy:
|
|
|
|
```bash
|
|
# Start local server
|
|
prefect server start
|
|
|
|
# Or deploy via Docker
|
|
docker run -p 4200:4200 prefecthq/prefect:latest prefect server start
|
|
```
|
|
|
|
@[46]
|
|
|
|
**Requirements:** PostgreSQL database, Redis (optional for caching)@[47]
|
|
|
|
### 3. Hybrid Execution Model
|
|
|
|
Orchestration in cloud, execution anywhere:
|
|
|
|
- Control plane in Prefect Cloud
|
|
- Workers run in your infrastructure
|
|
- Code never leaves your environment@[48]
|
|
|
|
## When to Use Prefect
|
|
|
|
### Use Prefect When
|
|
|
|
1. **Building data pipelines** that need scheduling, retries, and monitoring@[49]
|
|
2. **Orchestrating ML workflows** with dynamic dependencies@[50]
|
|
3. **Coordinating microservices** or distributed tasks@[51]
|
|
4. **Migrating from cron jobs** to a modern orchestrator@[52]
|
|
5. **Need Python-native workflows** without DSL overhead@[53]
|
|
6. **Want local development** with production parity@[54]
|
|
7. **Require event-driven automation** beyond scheduling@[55]
|
|
8. **Need visibility** into workflow execution and failures@[56]
|
|
|
|
### Use Simple Scripts/Cron When
|
|
|
|
1. **Single-step tasks** with no dependencies@[57]
|
|
2. **One-off scripts** that rarely run@[58]
|
|
3. **No retry logic** needed@[59]
|
|
4. **No failure visibility** required@[60]
|
|
5. **Under 5 lines of code** total@[61]
|
|
|
|
## Prefect vs. Alternatives
|
|
|
|
### Prefect vs. Airflow
|
|
|
|
| Dimension | Prefect | Airflow |
|
|
| --- | --- | --- |
|
|
| **Development Model** | Pure Python functions with decorators | DAG definitions with operators |
|
|
| **Dynamic Workflows** | Runtime task creation based on data | Static DAG structure at parse time |
|
|
| **Local Development** | Run locally without infrastructure | Requires full Airflow setup |
|
|
| **Learning Curve** | Minimal - just Python | Steep - framework concepts required |
|
|
| **Infrastructure** | Runs anywhere Python runs | Multi-component (scheduler, webserver, DB) |
|
|
| **Cost** | 60-70% lower (per customer reports)@[62] | Higher due to always-on infrastructure@[63] |
|
|
| **Best For** | ML/AI, modern data teams, dynamic pipelines | Traditional ETL, platform teams invested in ecosystem |
|
|
|
|
**Migration Path:** Prefect provides 73.78% cost reduction over Astronomer (managed Airflow)@[64]
|
|
|
|
### Prefect vs. Dagster
|
|
|
|
| Dimension | Prefect | Dagster |
|
|
| ---------------- | --------------------------- | --------------------------------- |
|
|
| **Philosophy** | Workflow orchestration | Data asset orchestration |
|
|
| **Abstractions** | Flows and tasks | Software-defined assets |
|
|
| **Use Case** | General workflow automation | Data asset lineage and cataloging |
|
|
| **Complexity** | Lower barrier to entry | Higher conceptual overhead |
|
|
|
|
### Prefect vs. Metaflow
|
|
|
|
| Dimension | Prefect | Metaflow |
|
|
| -------------- | ------------------------- | --------------------- |
|
|
| **Origin** | General orchestration | Netflix ML workflows |
|
|
| **Scope** | Broad workflow automation | ML-specific pipelines |
|
|
| **Deployment** | Any infrastructure | AWS, K8s focus |
|
|
| **Community** | Larger ecosystem | ML-focused community |
|
|
|
|
## Decision Matrix
|
|
|
|
```text
|
|
Use Prefect when:
|
|
- You write Python workflows
|
|
- You need dynamic task generation
|
|
- You want local development + production parity
|
|
- You need retry/caching/scheduling out of box
|
|
- You're building ML, data, or automation pipelines
|
|
- You want low operational overhead
|
|
- Cost efficiency matters (vs. Airflow)
|
|
|
|
Use Airflow when:
|
|
- You're heavily invested in Airflow ecosystem
|
|
- Your team already knows Airflow
|
|
- You need specific Airflow operators not in Prefect
|
|
- You have dedicated platform engineering for Airflow
|
|
|
|
Use Dagster when:
|
|
- Data asset lineage is primary concern
|
|
- You're building a data platform with asset catalog
|
|
- You need software-defined assets
|
|
|
|
Use simple cron/scripts when:
|
|
- Single independent tasks
|
|
- No retry logic needed
|
|
- No monitoring required
|
|
- Runs once per day or less
|
|
```
|
|
|
|
@[65]
|
|
|
|
## Anti-Patterns and Gotchas
|
|
|
|
### Don't Use Prefect For
|
|
|
|
1. **Simple one-off scripts** - adds unnecessary overhead@[66]
|
|
2. **Real-time streaming** - designed for batch/scheduled workflows@[67]
|
|
3. **Sub-second latency requirements** - orchestration adds overhead@[68]
|
|
4. **Pure event processing** - use Kafka/RabbitMQ instead@[69]
|
|
|
|
### Common Pitfalls
|
|
|
|
1. **Over-decomposition:** Breaking every line into a task creates overhead@[70]
|
|
2. **Ignoring task inputs:** Tasks should be pure functions for caching@[71]
|
|
3. **Not using .submit():** Blocking task calls prevent parallelism@[72]
|
|
4. **Skipping local testing:** Run flows locally before deploying@[73]
|
|
|
|
## Learning Resources
|
|
|
|
**Official Quickstart:** <https://docs.prefect.io/v3/get-started/quickstart@[74>] **Examples Repository:** <https://github.com/PrefectHQ/examples@[75>] **Community Recipes:** <https://github.com/PrefectHQ/prefect-recipes> (254 stars, archived)@[76] **Slack Community:** <https://prefect.io/slack@[77>] **YouTube Channel:** <https://www.youtube.com/c/PrefectIO/@[78>]
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Using pip
|
|
pip install -U prefect
|
|
|
|
# Using uv (recommended)
|
|
uv add prefect
|
|
|
|
# With specific integrations
|
|
pip install prefect-aws prefect-gcp prefect-dbt
|
|
```
|
|
|
|
@[79]
|
|
|
|
## Verification Checklist
|
|
|
|
- [x] Official repository confirmed: <https://github.com/PrefectHQ/prefect>
|
|
- [x] PyPI package verified: prefect v3.4.24
|
|
- [x] Python compatibility: 3.9-3.13
|
|
- [x] License confirmed: Apache-2.0
|
|
- [x] Real-world examples: 5+ GitHub repositories with 100+ stars
|
|
- [x] Integration patterns documented: dbt, Snowflake, AWS, Docker
|
|
- [x] Decision matrix provided: vs Airflow, Dagster, Metaflow, cron
|
|
- [x] Anti-patterns identified: streaming, sub-second latency
|
|
- [x] Code examples: 6+ verified from official docs and Context7
|
|
- [x] Maintenance status: Active (1059 open issues, recent commits)
|
|
|
|
## References
|
|
|
|
Sources cited with @ notation throughout document:
|
|
|
|
[1-79] Information gathered from:
|
|
|
|
- Context7 Library ID: /prefecthq/prefect (Trust Score: 8.2, 6247 code snippets)
|
|
- Official documentation: <https://docs.prefect.io>
|
|
- GitHub repository: <https://github.com/PrefectHQ/prefect>
|
|
- PyPI package page: <https://pypi.org/project/prefect/>
|
|
- Prefect vs Airflow comparison: <https://www.prefect.io/compare/airflow>
|
|
- Example repositories: anna-geller/prefect-dataplatform, rpeden/prefect-docker-compose, flavienbwk/prefect-docker-compose, anna-geller/dataflow-ops
|
|
- Exa code context search results
|
|
- Ref documentation search results
|
|
|
|
Last verified: 2025-10-21
|