Initial commit

2025-11-29 18:24:07 +08:00
commit 330645cc39
19 changed files with 4991 additions and 0 deletions
--- a/commands/system-design.md
+++ b/commands/system-design.md
@@ -0,0 +1,301 @@
+---
+model: claude-opus-4-1
+allowed-tools: Task, Read, Bash, Grep, Glob, Write
+argument-hint: <system-name> [--depth=standard|deep] [--focus=architecture|scalability|trade-offs] [--generate-diagram=true|false]
+description: Design complete systems with WHY, WHAT, HOW, CONSIDERATIONS, and DEEP-DIVE framework
+---
+
+# System Design Interview Coach
+
+Complete framework for designing systems from problem to implementation. Includes WHY/WHAT/HOW structure, trade-off analysis, mermaid diagrams, and deep-dive optimizations.
+
+## Interview Flow (60 minutes)
+
+### Phase 1: Requirements & Context (5 minutes)
+**Your goal**: Understand the problem deeply before designing
+
+Ask clarifying questions:
+- Scale: users, requests per second, data volume?
+- Availability: SLA requirements (99.9%, 99.99%)?
+- Latency: response time targets?
+- Consistency: strong or eventual?
+- Features: read-heavy, write-heavy, or balanced?
+- Growth: expected growth rate?
+
+**Interviewer's watching**:
+- Do you ask the right questions?
+- Do you understand the constraints?
+- Can you estimate numbers?
+
+### Phase 2: High-Level Architecture (10 minutes)
+**Your goal**: Outline the system at 30,000 feet
+
+Cover:
+- Major components (load balancer, services, databases, caches)
+- Communication patterns (sync/async, protocols)
+- Data flow from user request to response
+- Rough scalability approach
+
+Draw simple diagram showing component interactions.
+
+**Interviewer's watching**:
+- Do you think in systems?
+- Can you structure complexity?
+- Do you know when to keep it simple?
+
+### Phase 3: Detailed Component Design (20 minutes)
+**Your goal**: Explain key components with confidence
+
+Pick 2-3 components to discuss:
+- How does this component work?
+- Why this technology choice?
+- What are the constraints it handles?
+- How does it scale?
+
+**Interviewer's watching**:
+- Do you have technical depth?
+- Can you justify decisions?
+- Do you know trade-offs?
+
+### Phase 4: Scalability & Trade-Offs (15 minutes)
+**Your goal**: Show senior-level thinking
+
+Discuss:
+- Bottlenecks: What breaks first at 10x growth?
+- Consistency: Strong vs eventual? Why?
+- Reliability: Failure modes and recovery?
+- Cost: What drives operational expense?
+- Complexity: Is this operationally feasible?
+
+**Interviewer's watching**:
+- Do you think like a Staff engineer?
+- Can you make principled trade-offs?
+- Do you understand operational reality?
+
+### Phase 5: Extensions & Deep-Dives (8 minutes)
+**Your goal**: Demonstrate mastery
+
+Address follow-up questions:
+- "How would you handle [new requirement]?"
+- "What's the hardest part of operating this?"
+- "What would you optimize for [metric]?"
+- "How would you debug this in production?"
+
+**Interviewer's watching**:
+- Are you thinking ahead?
+- Can you handle surprises?
+- Do you know what you don't know?
+
+## System Design Framework
+
+### WHY: Problem & Context
+**What to cover**:
+- Problem statement (1-2 sentences)
+- Primary use cases (top 3-5)
+- User base and growth expectations
+- Non-functional requirements (scale, latency, availability)
+- Business context (why does this matter?)
+
+**For interviewer's benefit**:
+- Shows you understand the problem before solving it
+- Demonstrates customer empathy
+- Proves you can estimate and scope
+
+### WHAT: Components & Data Model
+**What to cover**:
+- Core entities (Users, Posts, Comments, etc.)
+- Entity relationships
+- Storage requirements (how much data?)
+- Major services (Authentication, Feed, Search, etc.)
+- API contracts (what endpoints do we need?)
+
+**For interviewer's benefit**:
+- Shows you think about data structure
+- Demonstrates you can decompose systems
+- Proves you understand component boundaries
+
+### HOW: Architecture & Patterns
+**What to cover**:
+- Request flow (from user → response)
+- Service architecture (monolith vs microservices decision)
+- Communication patterns (synchronous, asynchronous, pub-sub)
+- Storage topology (where does data live?)
+- Caching strategy (where, what, how long?)
+- Replication and failover
+
+**For interviewer's benefit**:
+- Shows you know architectural patterns
+- Demonstrates systems thinking
+- Proves you can make principled decisions
+
+### CONSIDERATIONS: Trade-Offs & Reality
+**What to analyze**:
+
+**Consistency**
+- Strong: Always get latest data (high latency, low availability)
+- Eventual: Might get stale data (low latency, high availability)
+- Your choice: "For [reason], we accept [consistency model]"
+
+**Scalability**
+- Vertical: Big machines (simpler, limited)
+- Horizontal: More machines (complex, unlimited)
+- Your choice: "We scale [direction] because [reason]"
+
+**Reliability**
+- Single point of failure? (bad)
+- Replication strategy? (multiple copies)
+- Disaster recovery? (backup and restore procedure)
+- Your choice: "We replicate [this way] to handle [failure]"
+
+**Cost**
+- Storage: What's the cost per GB?
+- Compute: What's the cost of this many servers?
+- Bandwidth: What's the egress cost?
+- Your choice: "This costs [X] but solves [Y]"
+
+**Operational Complexity**
+- How many different technologies?
+- How hard is debugging?
+- What's the on-call pain?
+- Your choice: "We keep it simple: [reason]"
+
+### DEEP-DIVE: Component Optimization
+**For each major component**, be prepared to discuss:
+
+1. **Bottleneck Analysis**
+   - What's the scaling limit?
+   - Where would we hit the wall first?
+   - How do we know?
+
+2. **Optimization Opportunities**
+   - What could we do to handle more load?
+   - What are the trade-offs?
+   - When is this optimization worth doing?
+
+3. **Failure Modes**
+   - What if [component] fails?
+   - How do we detect it?
+   - How do we recover?
+
+4. **Operational Concerns**
+   - How do we monitor this?
+   - What metrics matter?
+   - How do we debug issues?
+
+5. **Alternative Approaches**
+   - What's another way to design this?
+   - When would you choose it?
+   - What problems does it have?
+
+## Mermaid Diagram Strategy
+
+Create diagrams that show:
+1. **Architecture Diagram**: Components and communication
+2. **Data Flow**: Request path through the system
+3. **Database Schema**: Key entities and relationships
+
+**Tips**:
+- Keep diagrams simple initially
+- Add detail when asked
+- Label important decisions
+- Annotate bottlenecks
+
+## Example: Design Facebook Feed
+
+### WHY
+- **Problem**: Show users their friends' posts in a personalized, real-time feed
+- **Use Cases**:
+  1. User opens app → see recent posts from friends
+  2. Friend posts → appears in followers' feeds quickly
+  3. Massive scale: billions of posts, minutes of latency acceptable
+- **Requirements**:
+  - Read-heavy (100:1 read to write ratio)
+  - Latency: Feed load < 200ms
+  - Availability: 99.99%
+  - Consistency: Eventual OK (a few minutes lag acceptable)
+
+### WHAT
+- **Entities**: User, Post, Friendship, Like, Comment
+- **Relationships**: User → Post (1:many), User → Friend (many:many)
+- **Storage**: Posts: 100s of billions, User data: billions
+- **Services**: Auth, Post Creation, Feed Service, Search
+- **APIs**:
+  - POST /posts (create)
+  - GET /feed (get user's feed)
+  - POST /posts/{id}/like (like post)
+
+### HOW
+- Load balancers distribute requests
+- Stateless web servers handle auth and routing
+- Post service writes posts to database
+- Feed service reads from cache first, database second
+- Cache layer (Redis) stores hot posts
+- Fanout on write: When user posts, push to all followers' feeds
+- Asynchronous: Queue for fanout, workers process
+
+### CONSIDERATIONS
+- **Consistency**: Eventual consistency (a few second lag OK)
+- **Scalability**: Horizontal—more servers as needed
+- **Reliability**: Multi-region replication for availability
+- **Cost**: Balance storage vs computation
+- **Complexity**: Fanout-on-write is complex but enables fast reads
+
+### DEEP-DIVE
+1. **Fanout Bottleneck**: Celebrity posts with 100M followers?
+   - Solution: Hybrid fanout—fanout for normal users, cache for celebrities
+2. **Feed Personalization**: How do we rank posts?
+   - Solution: ML model, but start with recency + engagement
+3. **Real-time Updates**: How do we push new posts?
+   - Solution: Long-polling, WebSockets, or event stream
+
+## Talking Points During Interview
+
+**When introducing your design**:
+- "Let me outline the system at a high level..."
+- "The key insight here is [insight]"
+- "This design makes [requirement] easy"
+
+**When defending a choice**:
+- "We chose [option] because [constraint] → [option] is better"
+- "The trade-off is [cost] for [benefit]"
+- "This would change if [different constraint]"
+
+**When asked about scaling**:
+- "Currently [component] is the bottleneck"
+- "We'd scale [direction] because [reason]"
+- "This approach works until [limit], then we'd [next evolution]"
+
+**When asked about failure**:
+- "If [component] fails, [other component] takes over"
+- "We'd detect it via [monitoring], then [recovery action]"
+- "This is why we replicate [data/component]"
+
+## Red Flags to Avoid
+
+❌ Diving into implementation details too early
+❌ Not asking clarifying questions
+❌ Designing for scale you don't need
+❌ Making technology choices without justification
+❌ Ignoring operational reality
+❌ Treating consistency/availability as separate concerns
+❌ Not discussing trade-offs
+
+✓ Start broad, add detail on request
+✓ Ask clarifying questions upfront
+✓ Design for the specified scale
+✓ Justify technology choices
+✓ Consider how humans operate it
+✓ Explicitly discuss trade-offs
+✓ Show you understand what you don't know
+
+## Success Criteria
+
+You're ready when you can:
+- ✓ Clarify ambiguous requirements with good questions
+- ✓ Outline architecture clearly on a whiteboard
+- ✓ Explain each component's role
+- ✓ Justify your technology choices
+- ✓ Discuss trade-offs explicitly
+- ✓ Handle "what if" questions with confidence
+- ✓ Show understanding of operational reality
+- ✓ Demonstrate Staff+ systems thinking