Initial commit
This commit is contained in:
512
skills/python3-development/references/modern-modules/prefect.md
Normal file
512
skills/python3-development/references/modern-modules/prefect.md
Normal file
@@ -0,0 +1,512 @@
|
||||
---
|
||||
title: "Prefect: Modern Workflow Orchestration Platform"
|
||||
library_name: prefect
|
||||
pypi_package: prefect
|
||||
category: workflow-orchestration
|
||||
python_compatibility: "3.9+"
|
||||
last_updated: "2025-11-02"
|
||||
official_docs: "https://docs.prefect.io"
|
||||
official_repository: "https://github.com/PrefectHQ/prefect"
|
||||
maintenance_status: "active"
|
||||
---
|
||||
|
||||
# Prefect: Modern Workflow Orchestration
|
||||
|
||||
## Core Purpose
|
||||
|
||||
Prefect solves workflow orchestration with a Python-first approach that turns regular Python functions into production-ready data pipelines. Unlike legacy orchestrators that require DAG definitions and framework-specific operators, Prefect observes native Python code execution and provides orchestration through simple decorators@[1].
|
||||
|
||||
**Problem Domain:** Coordinating multi-step data workflows, handling failures with retries, scheduling recurring jobs, monitoring pipeline execution, and managing dependencies between tasks without writing boilerplate orchestration code@[2].
|
||||
|
||||
**When to Use:** Building data pipelines, ML workflows, ETL processes, or any multi-step automation that needs scheduling, retry logic, state tracking, and observability@[3].
|
||||
|
||||
**What You Would Reinvent:** Manual retry logic, state management, dependency coordination, scheduling systems, execution monitoring, error handling, result caching, and workflow visibility dashboards@[4].
|
||||
|
||||
## Official Information
|
||||
|
||||
**Repository:** <https://github.com/PrefectHQ/prefect> **PyPI Package:** `prefect` (current: v3.4.24)@[5] **Documentation:** <https://docs.prefect.io> **License:** Apache-2.0@[6] **Maintenance:** Actively maintained by PrefectHQ with 1059 open issues, 20.6K stars, regular releases@[7] **Community:** 30K+ engineers, active Slack community@[8]
|
||||
|
||||
## Python Compatibility
|
||||
|
||||
**Minimum Version:** Python 3.9@[9] **Maximum Version:** Python 3.13 (3.14 not yet supported)@[9] **Async Support:** Full native async/await support throughout@[10] **Type Hints:** First-class support, type-safe structured outputs@[11]
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Pythonic Flow Definition
|
||||
|
||||
Write workflows as regular Python functions with `@flow` and `@task` decorators:
|
||||
|
||||
```python
|
||||
from prefect import flow, task
|
||||
import httpx
|
||||
|
||||
@task(log_prints=True)
|
||||
def get_stars(repo: str):
|
||||
url = f"https://api.github.com/repos/{repo}"
|
||||
count = httpx.get(url).json()["stargazers_count"]
|
||||
print(f"{repo} has {count} stars!")
|
||||
|
||||
@flow(name="GitHub Stars")
|
||||
def github_stars(repos: list[str]):
|
||||
for repo in repos:
|
||||
get_stars(repo)
|
||||
|
||||
# Run directly
|
||||
if __name__ == "__main__":
|
||||
github_stars(["PrefectHQ/Prefect"])
|
||||
```
|
||||
|
||||
@[12]
|
||||
|
||||
### 2. Dynamic Runtime Workflows
|
||||
|
||||
Create tasks dynamically based on data, not static DAG definitions:
|
||||
|
||||
```python
|
||||
from prefect import task, flow
|
||||
|
||||
@task
|
||||
def process_customer(customer_id: str) -> str:
|
||||
return f"Processed {customer_id}"
|
||||
|
||||
@flow
|
||||
def main() -> list[str]:
|
||||
customer_ids = get_customer_ids() # Runtime data
|
||||
# Map tasks across dynamic data
|
||||
results = process_customer.map(customer_ids)
|
||||
return results
|
||||
```
|
||||
|
||||
@[13]
|
||||
|
||||
### 3. Flexible Scheduling
|
||||
|
||||
Deploy workflows with cron, interval, or RRule schedules:
|
||||
|
||||
```python
|
||||
# Serve with cron schedule
|
||||
if __name__ == "__main__":
|
||||
github_stars.serve(
|
||||
name="daily-stars",
|
||||
cron="0 8 * * *", # Daily at 8 AM
|
||||
parameters={"repos": ["PrefectHQ/prefect"]}
|
||||
)
|
||||
```
|
||||
|
||||
@[14]
|
||||
|
||||
```python
|
||||
# Or use interval-based scheduling
|
||||
my_flow.deploy(
|
||||
name="my-deployment",
|
||||
work_pool_name="my-work-pool",
|
||||
interval=timedelta(minutes=10)
|
||||
)
|
||||
```
|
||||
|
||||
@[15]
|
||||
|
||||
### 4. Built-in Retries and State Management
|
||||
|
||||
Automatic retry logic and state tracking:
|
||||
|
||||
```python
|
||||
@task(retries=3, retry_delay_seconds=60)
|
||||
def fetch_data():
|
||||
# Automatically retries on failure
|
||||
return api_call()
|
||||
```
|
||||
|
||||
@[16]
|
||||
|
||||
### 5. Concurrent Task Execution
|
||||
|
||||
Run tasks in parallel with `.submit()`:
|
||||
|
||||
```python
|
||||
@flow
|
||||
def my_workflow():
|
||||
future = cool_task.submit() # Non-blocking
|
||||
print(what_did_cool_task_say(future))
|
||||
```
|
||||
|
||||
@[17]
|
||||
|
||||
### 6. Event-Driven Automations
|
||||
|
||||
React to events, not just schedules:
|
||||
|
||||
```python
|
||||
# Trigger flows on external events
|
||||
my_flow.deploy(
|
||||
triggers=[
|
||||
DeploymentEventTrigger(
|
||||
expect=["s3.file.uploaded"]
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
@[18]
|
||||
|
||||
## Real-World Integration Patterns
|
||||
|
||||
### Integration with dbt
|
||||
|
||||
Orchestrate dbt transformations within Prefect flows:
|
||||
|
||||
```python
|
||||
from prefect_dbt import DbtCoreOperation
|
||||
|
||||
@flow
|
||||
def dbt_flow():
|
||||
result = DbtCoreOperation(
|
||||
commands=["dbt run", "dbt test"],
|
||||
project_dir="/path/to/dbt/project"
|
||||
).run()
|
||||
return result
|
||||
```
|
||||
|
||||
@[19]
|
||||
|
||||
**Example Repository:** <https://github.com/anna-geller/prefect-dataplatform> (106 stars) - Shows Prefect + dbt + Snowflake data platform@[20]
|
||||
|
||||
### AWS Deployment Pattern
|
||||
|
||||
Deploy to AWS ECS Fargate:
|
||||
|
||||
```python
|
||||
# prefect.yaml configuration
|
||||
work_pool:
|
||||
name: aws-ecs-pool
|
||||
type: ecs
|
||||
|
||||
deployments:
|
||||
- name: production
|
||||
work_pool_name: aws-ecs-pool
|
||||
schedules:
|
||||
- cron: "0 */4 * * *"
|
||||
```
|
||||
|
||||
@[21]
|
||||
|
||||
**Example Repository:** <https://github.com/anna-geller/dataflow-ops> (116 stars) - Automated deployments to AWS ECS@[22]
|
||||
|
||||
### Docker Compose Self-Hosted
|
||||
|
||||
Run Prefect server with Docker Compose:
|
||||
|
||||
```yaml
|
||||
version: "3.8"
|
||||
services:
|
||||
prefect-server:
|
||||
image: prefecthq/prefect:latest
|
||||
command: prefect server start
|
||||
ports:
|
||||
- "4200:4200"
|
||||
environment:
|
||||
- PREFECT_API_DATABASE_CONNECTION_URL=postgresql+asyncpg://postgres:password@postgres:5432/prefect
|
||||
```
|
||||
|
||||
@[23]
|
||||
|
||||
**Example Repositories:**
|
||||
|
||||
- <https://github.com/rpeden/prefect-docker-compose> (142 stars)@[24]
|
||||
- <https://github.com/flavienbwk/prefect-docker-compose> (161 stars)@[25]
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### Pattern 1: ETL Pipeline with Retries
|
||||
|
||||
```python
|
||||
from prefect import flow, task
|
||||
from prefect.tasks import exponential_backoff
|
||||
|
||||
@task(retries=3, retry_delay_seconds=exponential_backoff(backoff_factor=2))
|
||||
def extract_data(source: str):
|
||||
# Fetch from API with automatic retries
|
||||
return fetch_api_data(source)
|
||||
|
||||
@task
|
||||
def transform_data(raw_data):
|
||||
return clean_and_transform(raw_data)
|
||||
|
||||
@task
|
||||
def load_data(data, destination: str):
|
||||
write_to_database(data, destination)
|
||||
|
||||
@flow(log_prints=True)
|
||||
def etl_pipeline():
|
||||
raw = extract_data("https://api.example.com/data")
|
||||
transformed = transform_data(raw)
|
||||
load_data(transformed, "postgresql://db")
|
||||
```
|
||||
|
||||
@[26]
|
||||
|
||||
### Pattern 2: Scheduled Data Sync
|
||||
|
||||
```python
|
||||
@flow
|
||||
def sync_customer_data():
|
||||
customers = fetch_customers()
|
||||
for customer in customers:
|
||||
sync_to_warehouse(customer)
|
||||
|
||||
# Schedule to run every hour
|
||||
if __name__ == "__main__":
|
||||
sync_customer_data.serve(
|
||||
name="hourly-sync",
|
||||
interval=3600, # Every hour
|
||||
tags=["production", "sync"]
|
||||
)
|
||||
```
|
||||
|
||||
@[27]
|
||||
|
||||
### Pattern 3: ML Pipeline with Caching
|
||||
|
||||
```python
|
||||
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
|
||||
def load_training_data():
|
||||
# Expensive data loading - cached for 1 hour
|
||||
return load_large_dataset()
|
||||
|
||||
@task
|
||||
def train_model(data):
|
||||
return train_ml_model(data)
|
||||
|
||||
@flow
|
||||
def ml_pipeline():
|
||||
data = load_training_data() # Reuses cached result
|
||||
model = train_model(data)
|
||||
return model
|
||||
```
|
||||
|
||||
@[28]
|
||||
|
||||
## Integration Ecosystem
|
||||
|
||||
### Data Transformation
|
||||
|
||||
- **dbt:** Native integration via `prefect-dbt` package (archived, use dbt Cloud API)@[29]
|
||||
- **dbt Cloud:** Official integration for triggering dbt Cloud jobs@[30]
|
||||
|
||||
### Data Warehouses
|
||||
|
||||
- **Snowflake:** `prefect-snowflake` for query execution@[31]
|
||||
- **BigQuery:** `prefect-gcp` for BigQuery operations@[32]
|
||||
- **Redshift, PostgreSQL:** Standard database connectors@[33]
|
||||
|
||||
### Cloud Platforms
|
||||
|
||||
- **AWS:** `prefect-aws` (S3, ECS, Lambda, Batch)@[34]
|
||||
- **GCP:** `prefect-gcp` (GCS, BigQuery, Cloud Run)@[35]
|
||||
- **Azure:** `prefect-azure` (Blob Storage, Container Instances)@[36]
|
||||
|
||||
### Container Orchestration
|
||||
|
||||
- **Docker:** Native Docker build and push support@[37]
|
||||
- **Kubernetes:** `prefect-kubernetes` for K8s deployments@[38]
|
||||
- **ECS Fargate:** Built-in ECS work pools@[39]
|
||||
|
||||
### Data Quality
|
||||
|
||||
- **Great Expectations:** `prefect-great-expectations` for validation@[40]
|
||||
- **Monte Carlo:** Circuit breaker integrations@[41]
|
||||
|
||||
### ML/AI
|
||||
|
||||
- **LangChain:** `langchain-prefect` for LLM workflows (archived)@[42]
|
||||
- **MLflow:** Track experiments within Prefect flows@[43]
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### 1. Prefect Cloud (Managed)
|
||||
|
||||
Fully managed orchestration platform with:
|
||||
|
||||
- Hosted API and UI
|
||||
- Team collaboration features
|
||||
- RBAC and access controls
|
||||
- Enterprise SLAs
|
||||
- Automations and event triggers@[44]
|
||||
|
||||
**Pricing:** Free tier + usage-based pricing@[45]
|
||||
|
||||
### 2. Self-Hosted Prefect Server
|
||||
|
||||
Open-source server you deploy:
|
||||
|
||||
```bash
|
||||
# Start local server
|
||||
prefect server start
|
||||
|
||||
# Or deploy via Docker
|
||||
docker run -p 4200:4200 prefecthq/prefect:latest prefect server start
|
||||
```
|
||||
|
||||
@[46]
|
||||
|
||||
**Requirements:** PostgreSQL database, Redis (optional for caching)@[47]
|
||||
|
||||
### 3. Hybrid Execution Model
|
||||
|
||||
Orchestration in cloud, execution anywhere:
|
||||
|
||||
- Control plane in Prefect Cloud
|
||||
- Workers run in your infrastructure
|
||||
- Code never leaves your environment@[48]
|
||||
|
||||
## When to Use Prefect
|
||||
|
||||
### Use Prefect When
|
||||
|
||||
1. **Building data pipelines** that need scheduling, retries, and monitoring@[49]
|
||||
2. **Orchestrating ML workflows** with dynamic dependencies@[50]
|
||||
3. **Coordinating microservices** or distributed tasks@[51]
|
||||
4. **Migrating from cron jobs** to a modern orchestrator@[52]
|
||||
5. **Need Python-native workflows** without DSL overhead@[53]
|
||||
6. **Want local development** with production parity@[54]
|
||||
7. **Require event-driven automation** beyond scheduling@[55]
|
||||
8. **Need visibility** into workflow execution and failures@[56]
|
||||
|
||||
### Use Simple Scripts/Cron When
|
||||
|
||||
1. **Single-step tasks** with no dependencies@[57]
|
||||
2. **One-off scripts** that rarely run@[58]
|
||||
3. **No retry logic** needed@[59]
|
||||
4. **No failure visibility** required@[60]
|
||||
5. **Under 5 lines of code** total@[61]
|
||||
|
||||
## Prefect vs. Alternatives
|
||||
|
||||
### Prefect vs. Airflow
|
||||
|
||||
| Dimension | Prefect | Airflow |
|
||||
| --- | --- | --- |
|
||||
| **Development Model** | Pure Python functions with decorators | DAG definitions with operators |
|
||||
| **Dynamic Workflows** | Runtime task creation based on data | Static DAG structure at parse time |
|
||||
| **Local Development** | Run locally without infrastructure | Requires full Airflow setup |
|
||||
| **Learning Curve** | Minimal - just Python | Steep - framework concepts required |
|
||||
| **Infrastructure** | Runs anywhere Python runs | Multi-component (scheduler, webserver, DB) |
|
||||
| **Cost** | 60-70% lower (per customer reports)@[62] | Higher due to always-on infrastructure@[63] |
|
||||
| **Best For** | ML/AI, modern data teams, dynamic pipelines | Traditional ETL, platform teams invested in ecosystem |
|
||||
|
||||
**Migration Path:** Prefect provides 73.78% cost reduction over Astronomer (managed Airflow)@[64]
|
||||
|
||||
### Prefect vs. Dagster
|
||||
|
||||
| Dimension | Prefect | Dagster |
|
||||
| ---------------- | --------------------------- | --------------------------------- |
|
||||
| **Philosophy** | Workflow orchestration | Data asset orchestration |
|
||||
| **Abstractions** | Flows and tasks | Software-defined assets |
|
||||
| **Use Case** | General workflow automation | Data asset lineage and cataloging |
|
||||
| **Complexity** | Lower barrier to entry | Higher conceptual overhead |
|
||||
|
||||
### Prefect vs. Metaflow
|
||||
|
||||
| Dimension | Prefect | Metaflow |
|
||||
| -------------- | ------------------------- | --------------------- |
|
||||
| **Origin** | General orchestration | Netflix ML workflows |
|
||||
| **Scope** | Broad workflow automation | ML-specific pipelines |
|
||||
| **Deployment** | Any infrastructure | AWS, K8s focus |
|
||||
| **Community** | Larger ecosystem | ML-focused community |
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
```text
|
||||
Use Prefect when:
|
||||
- You write Python workflows
|
||||
- You need dynamic task generation
|
||||
- You want local development + production parity
|
||||
- You need retry/caching/scheduling out of box
|
||||
- You're building ML, data, or automation pipelines
|
||||
- You want low operational overhead
|
||||
- Cost efficiency matters (vs. Airflow)
|
||||
|
||||
Use Airflow when:
|
||||
- You're heavily invested in Airflow ecosystem
|
||||
- Your team already knows Airflow
|
||||
- You need specific Airflow operators not in Prefect
|
||||
- You have dedicated platform engineering for Airflow
|
||||
|
||||
Use Dagster when:
|
||||
- Data asset lineage is primary concern
|
||||
- You're building a data platform with asset catalog
|
||||
- You need software-defined assets
|
||||
|
||||
Use simple cron/scripts when:
|
||||
- Single independent tasks
|
||||
- No retry logic needed
|
||||
- No monitoring required
|
||||
- Runs once per day or less
|
||||
```
|
||||
|
||||
@[65]
|
||||
|
||||
## Anti-Patterns and Gotchas
|
||||
|
||||
### Don't Use Prefect For
|
||||
|
||||
1. **Simple one-off scripts** - adds unnecessary overhead@[66]
|
||||
2. **Real-time streaming** - designed for batch/scheduled workflows@[67]
|
||||
3. **Sub-second latency requirements** - orchestration adds overhead@[68]
|
||||
4. **Pure event processing** - use Kafka/RabbitMQ instead@[69]
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
1. **Over-decomposition:** Breaking every line into a task creates overhead@[70]
|
||||
2. **Ignoring task inputs:** Tasks should be pure functions for caching@[71]
|
||||
3. **Not using .submit():** Blocking task calls prevent parallelism@[72]
|
||||
4. **Skipping local testing:** Run flows locally before deploying@[73]
|
||||
|
||||
## Learning Resources
|
||||
|
||||
**Official Quickstart:** <https://docs.prefect.io/v3/get-started/quickstart@[74>] **Examples Repository:** <https://github.com/PrefectHQ/examples@[75>] **Community Recipes:** <https://github.com/PrefectHQ/prefect-recipes> (254 stars, archived)@[76] **Slack Community:** <https://prefect.io/slack@[77>] **YouTube Channel:** <https://www.youtube.com/c/PrefectIO/@[78>]
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Using pip
|
||||
pip install -U prefect
|
||||
|
||||
# Using uv (recommended)
|
||||
uv add prefect
|
||||
|
||||
# With specific integrations
|
||||
pip install prefect-aws prefect-gcp prefect-dbt
|
||||
```
|
||||
|
||||
@[79]
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [x] Official repository confirmed: <https://github.com/PrefectHQ/prefect>
|
||||
- [x] PyPI package verified: prefect v3.4.24
|
||||
- [x] Python compatibility: 3.9-3.13
|
||||
- [x] License confirmed: Apache-2.0
|
||||
- [x] Real-world examples: 5+ GitHub repositories with 100+ stars
|
||||
- [x] Integration patterns documented: dbt, Snowflake, AWS, Docker
|
||||
- [x] Decision matrix provided: vs Airflow, Dagster, Metaflow, cron
|
||||
- [x] Anti-patterns identified: streaming, sub-second latency
|
||||
- [x] Code examples: 6+ verified from official docs and Context7
|
||||
- [x] Maintenance status: Active (1059 open issues, recent commits)
|
||||
|
||||
## References
|
||||
|
||||
Sources cited with @ notation throughout document:
|
||||
|
||||
[1-79] Information gathered from:
|
||||
|
||||
- Context7 Library ID: /prefecthq/prefect (Trust Score: 8.2, 6247 code snippets)
|
||||
- Official documentation: <https://docs.prefect.io>
|
||||
- GitHub repository: <https://github.com/PrefectHQ/prefect>
|
||||
- PyPI package page: <https://pypi.org/project/prefect/>
|
||||
- Prefect vs Airflow comparison: <https://www.prefect.io/compare/airflow>
|
||||
- Example repositories: anna-geller/prefect-dataplatform, rpeden/prefect-docker-compose, flavienbwk/prefect-docker-compose, anna-geller/dataflow-ops
|
||||
- Exa code context search results
|
||||
- Ref documentation search results
|
||||
|
||||
Last verified: 2025-10-21
|
||||
Reference in New Issue
Block a user