zhongwei/gh-tachyon-beep-skillpacks-plugins-yzmir-llm-specialist

Fork 0

Files

Zhongwei Li 725c187d17 Initial commit

2025-11-30 08:59:54 +08:00

20 KiB

Raw Blame History

Prompt Engineering Patterns

Context

You're writing prompts for an LLM and getting inconsistent or incorrect outputs. Common issues:

Vague instructions: Model guesses intent (inconsistent results)
No examples: Model infers task from description alone (ambiguous)
No output format: Model defaults to prose (unparsable)
No reasoning scaffolding: Model jumps to answer (errors in complex tasks)
System message misuse: Task instructions in system message (inflexible)

This skill provides effective prompt engineering patterns: specificity, few-shot examples, format specification, chain-of-thought, and proper message structure.

Core Principle: Be Specific

Vague prompts → Inconsistent outputs

Bad:

Analyze this review: "Product was okay."

Why bad:

"Analyze" is ambiguous (sentiment? quality? topics?)
No scale specified (1-5? positive/negative?)
No output format (text? JSON? number?)

Good:

Rate this review's sentiment on a scale of 1-5:
1 = Very negative
2 = Negative
3 = Neutral
4 = Positive
5 = Very positive

Review: "Product was okay."

Output ONLY the number (1-5):

Result: Consistent "3" every time

Specificity Checklist:

☐ Define the task clearly (classify, extract, generate, summarize) ☐ Specify the scale (1-5, 1-10, percentage, positive/negative/neutral) ☐ Define edge cases (null values, ambiguous inputs, relative dates) ☐ Specify output format (JSON, CSV, number only, yes/no) ☐ Set constraints (max length, required fields, allowed values)

Prompt Structure

Message Roles:

1. System Message:

system = """
You are an expert Python programmer with 10 years of experience.
You write clean, efficient, well-documented code.
You always follow PEP 8 style guidelines.
"""

Purpose:

Sets role/persona (expert, assistant, teacher)
Defines global behavior (concise, detailed, technical)
Applies to entire conversation

Best practices:

Keep it short (< 200 words)
Define WHO the model is, not WHAT to do
Set tone and constraints

2. User Message:

user = """
Write a Python function that calculates the Fibonacci sequence up to n terms.

Requirements:
- Use recursion with memoization
- Include docstring
- Handle edge cases (n <= 0)
- Return list of integers

Output only the code, no explanations.
"""

Purpose:

Specific task instructions (per-request)
Input data
Output format requirements

Best practices:

Be specific about requirements
Include examples if ambiguous
Specify output format explicitly

3. Assistant Message (in conversation):

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": "Calculate 2+2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "Now multiply that by 3"},
]

Purpose:

Conversation history
Shows model previous responses
Enables multi-turn conversations

Few-Shot Learning

Show, don't tell. Examples teach better than instructions.

0-Shot (No Examples):

Extract the person, company, and location from this text:

Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."

Issues:

Model guesses format (JSON? Key-value? List?)
Edge cases unclear (What if no person? Multiple companies?)

1-Shot (One Example):

Extract entities as JSON.

Example:
Text: "Satya Nadella spoke at Microsoft in Seattle."
Output: {"person": "Satya Nadella", "company": "Microsoft", "location": "Seattle"}

Now extract from:
Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."
Output:

Better! Model sees format and structure.

Few-Shot (3-5 Examples - BEST):

Extract entities as JSON.

Example 1:
Text: "Satya Nadella spoke at Microsoft in Seattle."
Output: {"person": "Satya Nadella", "company": "Microsoft", "location": "Seattle"}

Example 2:
Text: "Google announced Gemini in Mountain View."
Output: {"person": null, "company": "Google", "location": "Mountain View"}

Example 3:
Text: "The event took place online with no speakers."
Output: {"person": null, "company": null, "location": "online"}

Now extract from:
Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."
Output:

Why 3-5 examples?

1 example: Shows format
2-3 examples: Shows variation and edge cases
4-5 examples: Shows complex patterns
5 examples: Diminishing returns (uses more tokens)

Few-Shot Best Practices:

Cover edge cases:
- Null values (missing entities)
- Multiple values (list of people)
- Ambiguous cases (nickname vs full name)
Show desired format consistently:
- All examples use same structure
- Same field names
- Same data types
Order matters:
- Put most representative example first
- Put edge cases later
- Model learns from all examples
Balance examples:
- Show positive and negative cases
- Show simple and complex cases
- Avoid bias (don't show only easy examples)

Chain-of-Thought (CoT) Prompting

For reasoning tasks, request step-by-step thinking.

Without CoT (Direct):

Q: A farmer has 17 sheep. All but 9 die. How many sheep are left?
A:

Output: "8 sheep" (WRONG! Misread "all but 9")

With CoT:

Q: A farmer has 17 sheep. All but 9 die. How many sheep are left?

Think step-by-step:
1. Start with how many sheep
2. Understand what "all but 9 die" means
3. Calculate remaining sheep
4. State the answer

A:

Output:

1. The farmer starts with 17 sheep
2. "All but 9 die" means all sheep except 9 die
3. So 9 sheep remain alive
4. Answer: 9 sheep

Correct! CoT catches the trick.

When to Use CoT:

✅ Math word problems
✅ Logic puzzles
✅ Multi-step reasoning
✅ Complex decision-making
✅ Ambiguous questions

Not needed for:

❌ Simple classification (sentiment)
❌ Direct lookups (capital of France)
❌ Pattern matching (regex, entity extraction)

CoT Variants:

1. Explicit steps:

Solve step-by-step:
1. Identify what we know
2. Identify what we need to find
3. Set up the equation
4. Solve
5. Verify the answer

2. "Let's think step by step":

Q: [question]
A: Let's think step by step.

3. "Explain your reasoning":

Q: [question]
A: I'll explain my reasoning:

All three work! Pick what fits your use case.

Output Formatting

Specify format explicitly. Don't assume model knows what you want.

JSON Output:

Bad (no format specified):

Extract the name, age, and occupation from: "John is 30 years old and works as an engineer."

Output: "The person's name is John, who is 30 years old and works as an engineer."

Good (format specified):

Extract information as JSON:

Text: "John is 30 years old and works as an engineer."

Output in this format:
{
  "name": "<string>",
  "age": <number>,
  "occupation": "<string>"
}

JSON:

Output:

{
  "name": "John",
  "age": 30,
  "occupation": "engineer"
}

CSV Output:

Convert this data to CSV format with columns: name, age, city.

Data: John is 30 and lives in NYC. Mary is 25 and lives in LA.

CSV (with header):

Output:

name,age,city
John,30,NYC
Mary,25,LA

Structured Text:

Summarize this article in bullet points (max 5 points):

Article: [text]

Summary:
-

Output:

- Point 1
- Point 2
- Point 3
- Point 4
- Point 5

XML/HTML:

Format this data as HTML table:

Data: [data]

HTML:

Format Best Practices:

Show the schema:

{
  "field1": "<type>",
  "field2": <type>,
  ...
}

Specify data types: <string>, <number>, <boolean>, <array>
Show example output: Full example of expected output
Request validation: "Output valid JSON" or "Ensure CSV is parsable"

Temperature and Sampling

Temperature controls randomness. Adjust based on task.

Temperature = 0 (Deterministic):

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[...],
    temperature=0  # Deterministic, always same output
)

Use for:

✅ Classification (sentiment, category)
✅ Extraction (entities, data fields)
✅ Structured output (JSON, CSV)
✅ Factual queries (capital of X, date of Y)

Why: Need consistency and correctness, not creativity

Temperature = 0.7-1.0 (Creative):

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[...],
    temperature=0.8  # Creative, varied outputs
)

Use for:

✅ Creative writing (stories, poems)
✅ Brainstorming (ideas, alternatives)
✅ Conversational chat (natural dialogue)
✅ Content generation (marketing copy)

Why: Want variety and creativity, not determinism

Temperature = 1.5-2.0 (Very Random):

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[...],
    temperature=1.8  # Very random, surprising outputs
)

Use for:

✅ Experimental generation
✅ Highly creative tasks

Warning: May produce nonsensical outputs (use carefully)

Top-p (Nucleus Sampling):

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[...],
    temperature=0.7,
    top_p=0.9  # Consider top 90% probability mass
)

Alternative to temperature:

top_p = 1.0: Consider all tokens (default)
top_p = 0.9: Consider top 90% (filters low-probability tokens)
top_p = 0.5: Consider top 50% (more focused)

Best practice: Use temperature OR top_p, not both

Common Task Patterns

1. Classification:

Classify the sentiment of this review as 'positive', 'negative', or 'neutral'.
Output ONLY the label.

Review: "The product works great but shipping was slow."

Sentiment:

Key elements:

Clear categories ('positive', 'negative', 'neutral')
Output constraint ("ONLY the label")
Prompt ends with field name ("Sentiment:")

2. Extraction:

Extract all dates from this text. Output as JSON array.

Text: "Meeting on March 15, 2024. Follow-up on March 22."

Format:
["YYYY-MM-DD", "YYYY-MM-DD"]

Output:

Key elements:

Specific format (JSON array)
Date format specified (YYYY-MM-DD)
Shows example structure

3. Summarization:

Summarize this article in 50 words or less. Focus on the main conclusion and key findings.

Article: [long text]

Summary (max 50 words):

Key elements:

Length constraint (50 words)
Focus instruction (main conclusion, key findings)
Clear output label

4. Generation:

Write a product description for a wireless mouse with these features:
- Ergonomic design
- 1600 DPI sensor
- 6-month battery life
- Bluetooth 5.0

Style: Professional, concise (50-100 words)

Product Description:

Key elements:

Input data (features list)
Style guide (professional, concise)
Length constraint (50-100 words)

5. Transformation:

Convert this SQL query to Python (using pandas):

SQL:
SELECT name, age FROM users WHERE age > 30 ORDER BY age DESC

Python (pandas):

Key elements:

Clear source and target formats
Shows example input
Labels expected output

6. Question Answering:

Answer this question based ONLY on the provided context. If the answer is not in the context, say "I don't know."

Context: [document]

Question: What is the return policy?

Answer:

Key elements:

Constraint ("based ONLY on context")
Fallback instruction ("I don't know")
Prevents hallucination

Advanced Techniques

1. Self-Consistency:

Generate multiple outputs, take majority vote.

answers = []
for _ in range(5):
    response = llm.generate(prompt, temperature=0.7)
    answers.append(response)

# Take majority vote
final_answer = Counter(answers).most_common(1)[0][0]

Use for:

Complex reasoning (math, logic)
When single answer might be wrong
Accuracy > cost

Trade-off: 5× cost for 10-20% accuracy improvement

2. Tree-of-Thoughts:

Explore multiple reasoning paths, pick best.

Problem: [complex problem]

Let's consider 3 different approaches:

Approach 1: [reasoning path 1]
Approach 2: [reasoning path 2]
Approach 3: [reasoning path 3]

Which approach is best? Evaluate each:
[evaluation]

Best approach: [selection]

Now solve using the best approach:
[solution]

Use for:

Complex planning
Strategic decision-making
Multiple valid solutions

3. ReAct (Reasoning + Acting):

Interleave reasoning with actions (tool use).

Task: What's the weather in the city where the Eiffel Tower is located?

Thought: I need to find where the Eiffel Tower is located.
Action: Search "Eiffel Tower location"
Observation: The Eiffel Tower is in Paris, France.

Thought: Now I need the weather in Paris.
Action: Weather API call for Paris
Observation: 15°C, partly cloudy

Answer: It's 15°C and partly cloudy in Paris.

Use for:

Multi-step tasks with tool use
Search + reasoning
API interactions

4. Instruction Following:

Separate instructions from data.

Instructions:
- Extract all email addresses
- Validate format (user@domain.com)
- Remove duplicates
- Sort alphabetically

Data:
[text with emails]

Output (JSON array):

Best practice: Clearly separate "Instructions" from "Data"

Debugging Prompts

If output is wrong, diagnose systematically.

Problem 1: Inconsistent outputs

Diagnosis:

Instructions too vague?
No examples?
Temperature too high?

Fix:

Add specificity
Add 3-5 examples
Set temperature=0

Problem 2: Wrong format

Diagnosis:

Format not specified?
Example format missing?

Fix:

Specify format explicitly
Show example output structure
End prompt with format label ("JSON:", "CSV:")

Problem 3: Factual errors

Diagnosis:

Hallucination (model making up facts)?
No chain-of-thought?

Fix:

Add "based only on provided context"
Request "cite your sources"
Add "if unsure, say 'I don't know'"

Problem 4: Too verbose

Diagnosis:

No length constraint?
No "output only" instruction?

Fix:

Add word/character limit
Add "output ONLY the [X], no explanations"
Show concise examples

Problem 5: Misses edge cases

Diagnosis:

Edge cases not in examples?
Instructions don't cover edge cases?

Fix:

Add edge case examples (null, empty, ambiguous)
Explicitly mention edge case handling

Prompt Testing

Test prompts systematically before production.

1. Create test cases:

test_cases = [
    # Normal cases
    {"input": "...", "expected": "..."},
    {"input": "...", "expected": "..."},

    # Edge cases
    {"input": "", "expected": "null"},  # Empty input
    {"input": "...", "expected": "null"},  # Missing data

    # Ambiguous cases
    {"input": "...", "expected": "..."},
]

2. Run tests:

for case in test_cases:
    output = llm.generate(prompt.format(input=case["input"]))
    assert output == case["expected"], f"Failed on {case['input']}"

3. Measure metrics:

# Accuracy
correct = sum(1 for case in test_cases if output == case["expected"])
accuracy = correct / len(test_cases)

# Consistency (run same input 10 times)
outputs = [llm.generate(prompt) for _ in range(10)]
consistency = len(set(outputs)) == 1  # All outputs identical?

# Latency
import time
start = time.time()
output = llm.generate(prompt)
latency = time.time() - start

Prompt Optimization Workflow

Iterative improvement process:

Step 1: Baseline prompt (simple)

Classify sentiment: [text]

Step 2: Test and measure

accuracy = 65%  # Too low!
consistency = 40%  # Very inconsistent

Step 3: Add specificity

Classify sentiment as 'positive', 'negative', or 'neutral'.
Output ONLY the label.

Text: [text]
Sentiment:

Result: accuracy = 75%, consistency = 80%

Step 4: Add few-shot examples

Classify sentiment as 'positive', 'negative', or 'neutral'.

Examples:
[3 examples]

Text: [text]
Sentiment:

Result: accuracy = 88%, consistency = 95%

Step 5: Add edge case handling

[Include edge case examples in few-shot]

Result: accuracy = 92%, consistency = 98%

Step 6: Optimize for cost/latency

# Reduce examples from 5 to 3 (latency 400ms → 300ms)
# Accuracy still 92%

Final: accuracy = 92%, consistency = 98%, latency = 300ms

Prompt Libraries and Templates

Reusable templates for common tasks.

Template 1: Classification

Classify {item} as one of: {categories}.

{optional: 3-5 examples}

Output ONLY the category label.

{item}: {input}

Category:

Template 2: Extraction

Extract {fields} from the text. Output as JSON.

{optional: 3-5 examples showing format and edge cases}

Text: {input}

JSON:

Template 3: Summarization

Summarize this {content_type} in {length} words or less.
Focus on {aspects}.

{content_type}: {input}

Summary ({length} words max):

Template 4: Generation

Write {output_type} with these characteristics:
{characteristics}

Style: {style}
Length: {length}

{output_type}:

Template 5: Chain-of-Thought

{question}

Think step-by-step:
1. {step_1_prompt}
2. {step_2_prompt}
3. {step_3_prompt}

Answer:

Usage:

prompt = CLASSIFICATION_TEMPLATE.format(
    item="review",
    categories="'positive', 'negative', 'neutral'",
    input=review_text
)

Anti-Patterns

Anti-pattern 1: "The model is stupid"

Wrong: "The model doesn't understand. I need a better model."

Right: "My prompt is ambiguous. Let me add examples and specificity."

Principle: 90% of issues are prompt issues, not model issues.

Anti-pattern 2: "Just run it multiple times"

Wrong: "Run 10 times and take the average/majority."

Right: "Fix the prompt so it's consistent (temperature=0, specific instructions)."

Principle: Consistency should come from the prompt, not multiple runs.

Anti-pattern 3: "Parse the prose output"

Wrong: "I'll extract JSON from the prose with regex."

Right: "I'll request JSON output explicitly in the prompt."

Principle: Specify format in prompt, don't parse after the fact.

Anti-pattern 4: "System message for everything"

Wrong: Put task instructions in system message.

Right: System = role/behavior, User = task/instructions.

Principle: System message is global (all requests), user message is per-request.

Anti-pattern 5: "More tokens = better"

Wrong: "I'll write a 1000-word prompt with every detail."

Right: "I'll write a concise prompt with 3-5 examples."

Principle: Concise + examples > verbose instructions.

Summary

Core principles:

Be specific: Define scale, edge cases, constraints, output format
Use few-shot: 3-5 examples teach better than instructions
Specify format: JSON, CSV, structured text (explicit schema)
Request reasoning: Chain-of-thought for complex tasks
Correct message structure: System = role, User = task

Temperature:

0: Classification, extraction, structured output (deterministic)
0.7-1.0: Creative writing, brainstorming (varied)

Common patterns:

Classification: Specify categories, output constraint
Extraction: Format + examples + edge cases
Summarization: Length + focus areas
Generation: Features + style + length

Advanced:

Self-consistency: Multiple runs + majority vote
Tree-of-thoughts: Multiple reasoning paths
ReAct: Reasoning + action (tool use)

Debugging:

Inconsistent → Add specificity, examples, temperature=0
Wrong format → Specify format explicitly with examples
Factual errors → Add context constraints, chain-of-thought
Too verbose → Add length limits, "output only"

Key insight: Prompts are code. Treat them like code: test, iterate, optimize, version control.

20 KiB Raw Blame History Unescape Escape

Prompt Engineering Patterns

Context

Core Principle: Be Specific

Specificity Checklist:

Prompt Structure

Message Roles:

Few-Shot Learning

0-Shot (No Examples):

1-Shot (One Example):

Few-Shot (3-5 Examples - BEST):

Few-Shot Best Practices:

Chain-of-Thought (CoT) Prompting

Without CoT (Direct):

With CoT:

When to Use CoT:

CoT Variants:

Output Formatting

JSON Output:

CSV Output:

Structured Text:

XML/HTML:

Format Best Practices:

Temperature and Sampling

Temperature = 0 (Deterministic):

Temperature = 0.7-1.0 (Creative):

Temperature = 1.5-2.0 (Very Random):

Top-p (Nucleus Sampling):

Common Task Patterns

1. Classification:

2. Extraction:

3. Summarization:

4. Generation:

5. Transformation:

6. Question Answering:

Advanced Techniques

1. Self-Consistency:

2. Tree-of-Thoughts:

3. ReAct (Reasoning + Acting):

4. Instruction Following:

Debugging Prompts

Problem 1: Inconsistent outputs

Problem 2: Wrong format

Problem 3: Factual errors

Problem 4: Too verbose

Problem 5: Misses edge cases

Prompt Testing

1. Create test cases:

2. Run tests:

3. Measure metrics:

Prompt Optimization Workflow

Step 1: Baseline prompt (simple)

Step 2: Test and measure

Step 3: Add specificity

Step 4: Add few-shot examples

Step 5: Add edge case handling

Step 6: Optimize for cost/latency

Prompt Libraries and Templates

Template 1: Classification

Template 2: Extraction

Template 3: Summarization

Template 4: Generation

Template 5: Chain-of-Thought

Anti-Patterns

Anti-pattern 1: "The model is stupid"

Anti-pattern 2: "Just run it multiple times"

Anti-pattern 3: "Parse the prose output"

Anti-pattern 4: "System message for everything"

Anti-pattern 5: "More tokens = better"

Summary

20 KiB

Raw Blame History