302 lines
9.6 KiB
Markdown
302 lines
9.6 KiB
Markdown
---
|
|
model: claude-opus-4-1
|
|
allowed-tools: Task, Read, Bash, Grep, Glob, Write
|
|
argument-hint: <system-name> [--depth=standard|deep] [--focus=architecture|scalability|trade-offs] [--generate-diagram=true|false]
|
|
description: Design complete systems with WHY, WHAT, HOW, CONSIDERATIONS, and DEEP-DIVE framework
|
|
---
|
|
|
|
# System Design Interview Coach
|
|
|
|
Complete framework for designing systems from problem to implementation. Includes WHY/WHAT/HOW structure, trade-off analysis, mermaid diagrams, and deep-dive optimizations.
|
|
|
|
## Interview Flow (60 minutes)
|
|
|
|
### Phase 1: Requirements & Context (5 minutes)
|
|
**Your goal**: Understand the problem deeply before designing
|
|
|
|
Ask clarifying questions:
|
|
- Scale: users, requests per second, data volume?
|
|
- Availability: SLA requirements (99.9%, 99.99%)?
|
|
- Latency: response time targets?
|
|
- Consistency: strong or eventual?
|
|
- Features: read-heavy, write-heavy, or balanced?
|
|
- Growth: expected growth rate?
|
|
|
|
**Interviewer's watching**:
|
|
- Do you ask the right questions?
|
|
- Do you understand the constraints?
|
|
- Can you estimate numbers?
|
|
|
|
### Phase 2: High-Level Architecture (10 minutes)
|
|
**Your goal**: Outline the system at 30,000 feet
|
|
|
|
Cover:
|
|
- Major components (load balancer, services, databases, caches)
|
|
- Communication patterns (sync/async, protocols)
|
|
- Data flow from user request to response
|
|
- Rough scalability approach
|
|
|
|
Draw simple diagram showing component interactions.
|
|
|
|
**Interviewer's watching**:
|
|
- Do you think in systems?
|
|
- Can you structure complexity?
|
|
- Do you know when to keep it simple?
|
|
|
|
### Phase 3: Detailed Component Design (20 minutes)
|
|
**Your goal**: Explain key components with confidence
|
|
|
|
Pick 2-3 components to discuss:
|
|
- How does this component work?
|
|
- Why this technology choice?
|
|
- What are the constraints it handles?
|
|
- How does it scale?
|
|
|
|
**Interviewer's watching**:
|
|
- Do you have technical depth?
|
|
- Can you justify decisions?
|
|
- Do you know trade-offs?
|
|
|
|
### Phase 4: Scalability & Trade-Offs (15 minutes)
|
|
**Your goal**: Show senior-level thinking
|
|
|
|
Discuss:
|
|
- Bottlenecks: What breaks first at 10x growth?
|
|
- Consistency: Strong vs eventual? Why?
|
|
- Reliability: Failure modes and recovery?
|
|
- Cost: What drives operational expense?
|
|
- Complexity: Is this operationally feasible?
|
|
|
|
**Interviewer's watching**:
|
|
- Do you think like a Staff engineer?
|
|
- Can you make principled trade-offs?
|
|
- Do you understand operational reality?
|
|
|
|
### Phase 5: Extensions & Deep-Dives (8 minutes)
|
|
**Your goal**: Demonstrate mastery
|
|
|
|
Address follow-up questions:
|
|
- "How would you handle [new requirement]?"
|
|
- "What's the hardest part of operating this?"
|
|
- "What would you optimize for [metric]?"
|
|
- "How would you debug this in production?"
|
|
|
|
**Interviewer's watching**:
|
|
- Are you thinking ahead?
|
|
- Can you handle surprises?
|
|
- Do you know what you don't know?
|
|
|
|
## System Design Framework
|
|
|
|
### WHY: Problem & Context
|
|
**What to cover**:
|
|
- Problem statement (1-2 sentences)
|
|
- Primary use cases (top 3-5)
|
|
- User base and growth expectations
|
|
- Non-functional requirements (scale, latency, availability)
|
|
- Business context (why does this matter?)
|
|
|
|
**For interviewer's benefit**:
|
|
- Shows you understand the problem before solving it
|
|
- Demonstrates customer empathy
|
|
- Proves you can estimate and scope
|
|
|
|
### WHAT: Components & Data Model
|
|
**What to cover**:
|
|
- Core entities (Users, Posts, Comments, etc.)
|
|
- Entity relationships
|
|
- Storage requirements (how much data?)
|
|
- Major services (Authentication, Feed, Search, etc.)
|
|
- API contracts (what endpoints do we need?)
|
|
|
|
**For interviewer's benefit**:
|
|
- Shows you think about data structure
|
|
- Demonstrates you can decompose systems
|
|
- Proves you understand component boundaries
|
|
|
|
### HOW: Architecture & Patterns
|
|
**What to cover**:
|
|
- Request flow (from user → response)
|
|
- Service architecture (monolith vs microservices decision)
|
|
- Communication patterns (synchronous, asynchronous, pub-sub)
|
|
- Storage topology (where does data live?)
|
|
- Caching strategy (where, what, how long?)
|
|
- Replication and failover
|
|
|
|
**For interviewer's benefit**:
|
|
- Shows you know architectural patterns
|
|
- Demonstrates systems thinking
|
|
- Proves you can make principled decisions
|
|
|
|
### CONSIDERATIONS: Trade-Offs & Reality
|
|
**What to analyze**:
|
|
|
|
**Consistency**
|
|
- Strong: Always get latest data (high latency, low availability)
|
|
- Eventual: Might get stale data (low latency, high availability)
|
|
- Your choice: "For [reason], we accept [consistency model]"
|
|
|
|
**Scalability**
|
|
- Vertical: Big machines (simpler, limited)
|
|
- Horizontal: More machines (complex, unlimited)
|
|
- Your choice: "We scale [direction] because [reason]"
|
|
|
|
**Reliability**
|
|
- Single point of failure? (bad)
|
|
- Replication strategy? (multiple copies)
|
|
- Disaster recovery? (backup and restore procedure)
|
|
- Your choice: "We replicate [this way] to handle [failure]"
|
|
|
|
**Cost**
|
|
- Storage: What's the cost per GB?
|
|
- Compute: What's the cost of this many servers?
|
|
- Bandwidth: What's the egress cost?
|
|
- Your choice: "This costs [X] but solves [Y]"
|
|
|
|
**Operational Complexity**
|
|
- How many different technologies?
|
|
- How hard is debugging?
|
|
- What's the on-call pain?
|
|
- Your choice: "We keep it simple: [reason]"
|
|
|
|
### DEEP-DIVE: Component Optimization
|
|
**For each major component**, be prepared to discuss:
|
|
|
|
1. **Bottleneck Analysis**
|
|
- What's the scaling limit?
|
|
- Where would we hit the wall first?
|
|
- How do we know?
|
|
|
|
2. **Optimization Opportunities**
|
|
- What could we do to handle more load?
|
|
- What are the trade-offs?
|
|
- When is this optimization worth doing?
|
|
|
|
3. **Failure Modes**
|
|
- What if [component] fails?
|
|
- How do we detect it?
|
|
- How do we recover?
|
|
|
|
4. **Operational Concerns**
|
|
- How do we monitor this?
|
|
- What metrics matter?
|
|
- How do we debug issues?
|
|
|
|
5. **Alternative Approaches**
|
|
- What's another way to design this?
|
|
- When would you choose it?
|
|
- What problems does it have?
|
|
|
|
## Mermaid Diagram Strategy
|
|
|
|
Create diagrams that show:
|
|
1. **Architecture Diagram**: Components and communication
|
|
2. **Data Flow**: Request path through the system
|
|
3. **Database Schema**: Key entities and relationships
|
|
|
|
**Tips**:
|
|
- Keep diagrams simple initially
|
|
- Add detail when asked
|
|
- Label important decisions
|
|
- Annotate bottlenecks
|
|
|
|
## Example: Design Facebook Feed
|
|
|
|
### WHY
|
|
- **Problem**: Show users their friends' posts in a personalized, real-time feed
|
|
- **Use Cases**:
|
|
1. User opens app → see recent posts from friends
|
|
2. Friend posts → appears in followers' feeds quickly
|
|
3. Massive scale: billions of posts, minutes of latency acceptable
|
|
- **Requirements**:
|
|
- Read-heavy (100:1 read to write ratio)
|
|
- Latency: Feed load < 200ms
|
|
- Availability: 99.99%
|
|
- Consistency: Eventual OK (a few minutes lag acceptable)
|
|
|
|
### WHAT
|
|
- **Entities**: User, Post, Friendship, Like, Comment
|
|
- **Relationships**: User → Post (1:many), User → Friend (many:many)
|
|
- **Storage**: Posts: 100s of billions, User data: billions
|
|
- **Services**: Auth, Post Creation, Feed Service, Search
|
|
- **APIs**:
|
|
- POST /posts (create)
|
|
- GET /feed (get user's feed)
|
|
- POST /posts/{id}/like (like post)
|
|
|
|
### HOW
|
|
- Load balancers distribute requests
|
|
- Stateless web servers handle auth and routing
|
|
- Post service writes posts to database
|
|
- Feed service reads from cache first, database second
|
|
- Cache layer (Redis) stores hot posts
|
|
- Fanout on write: When user posts, push to all followers' feeds
|
|
- Asynchronous: Queue for fanout, workers process
|
|
|
|
### CONSIDERATIONS
|
|
- **Consistency**: Eventual consistency (a few second lag OK)
|
|
- **Scalability**: Horizontal—more servers as needed
|
|
- **Reliability**: Multi-region replication for availability
|
|
- **Cost**: Balance storage vs computation
|
|
- **Complexity**: Fanout-on-write is complex but enables fast reads
|
|
|
|
### DEEP-DIVE
|
|
1. **Fanout Bottleneck**: Celebrity posts with 100M followers?
|
|
- Solution: Hybrid fanout—fanout for normal users, cache for celebrities
|
|
2. **Feed Personalization**: How do we rank posts?
|
|
- Solution: ML model, but start with recency + engagement
|
|
3. **Real-time Updates**: How do we push new posts?
|
|
- Solution: Long-polling, WebSockets, or event stream
|
|
|
|
## Talking Points During Interview
|
|
|
|
**When introducing your design**:
|
|
- "Let me outline the system at a high level..."
|
|
- "The key insight here is [insight]"
|
|
- "This design makes [requirement] easy"
|
|
|
|
**When defending a choice**:
|
|
- "We chose [option] because [constraint] → [option] is better"
|
|
- "The trade-off is [cost] for [benefit]"
|
|
- "This would change if [different constraint]"
|
|
|
|
**When asked about scaling**:
|
|
- "Currently [component] is the bottleneck"
|
|
- "We'd scale [direction] because [reason]"
|
|
- "This approach works until [limit], then we'd [next evolution]"
|
|
|
|
**When asked about failure**:
|
|
- "If [component] fails, [other component] takes over"
|
|
- "We'd detect it via [monitoring], then [recovery action]"
|
|
- "This is why we replicate [data/component]"
|
|
|
|
## Red Flags to Avoid
|
|
|
|
❌ Diving into implementation details too early
|
|
❌ Not asking clarifying questions
|
|
❌ Designing for scale you don't need
|
|
❌ Making technology choices without justification
|
|
❌ Ignoring operational reality
|
|
❌ Treating consistency/availability as separate concerns
|
|
❌ Not discussing trade-offs
|
|
|
|
✓ Start broad, add detail on request
|
|
✓ Ask clarifying questions upfront
|
|
✓ Design for the specified scale
|
|
✓ Justify technology choices
|
|
✓ Consider how humans operate it
|
|
✓ Explicitly discuss trade-offs
|
|
✓ Show you understand what you don't know
|
|
|
|
## Success Criteria
|
|
|
|
You're ready when you can:
|
|
- ✓ Clarify ambiguous requirements with good questions
|
|
- ✓ Outline architecture clearly on a whiteboard
|
|
- ✓ Explain each component's role
|
|
- ✓ Justify your technology choices
|
|
- ✓ Discuss trade-offs explicitly
|
|
- ✓ Handle "what if" questions with confidence
|
|
- ✓ Show understanding of operational reality
|
|
- ✓ Demonstrate Staff+ systems thinking
|