Files
2025-11-29 17:52:13 +08:00

7.8 KiB

Example: Before and After C7Score Optimization

This example demonstrates the transformation of a typical README into an optimized version that scores higher on the c7score benchmark.


BEFORE Optimization (Low c7score)

# DataFlow

![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
![Version](https://img.shields.io/badge/version-2.3.1-blue)
Last updated: March 15, 2024

DataFlow is a Python library for processing data streams.

## Features

- Stream processing
- Data transformation
- Multiple output formats
- Error handling

## Installation

```bash
pip install dataflow

Usage

from dataflow import Stream
s = Stream()

API

See the API documentation for details.

License

MIT


### Issues with the BEFORE version:
- ❌ Metadata badges and timestamps (reduce c7score)
- ❌ Vague descriptions without concrete examples
- ❌ Code snippets lack context and imports
- ❌ No question-driven structure
- ❌ Missing error handling examples
- ❌ No progressive complexity
- ❌ Features listed without demonstrations
- ❌ Incomplete quickstart example

---

## AFTER Optimization (High c7score)

```markdown
# DataFlow

A Python library for processing data streams with real-time transformations and multiple output formats.

## Quick Start

### How do I install DataFlow?

```bash
pip install dataflow

For development installation:

git clone https://github.com/example/dataflow.git
cd dataflow
pip install -e ".[dev]"

How do I process my first data stream?

from dataflow import Stream, Processor

# Create a stream and add data
stream = Stream()
stream.add_data([1, 2, 3, 4, 5])

# Process the stream
processor = Processor()
result = processor.transform(stream, lambda x: x * 2)

print(result.to_list())  # Output: [2, 4, 6, 8, 10]

How do I handle different data sources?

from dataflow import Stream, FileSource, APISource

# Load from a file
file_stream = Stream.from_source(
    FileSource('data.csv', format='csv')
)

# Load from an API
api_stream = Stream.from_source(
    APISource('https://api.example.com/data', auth_token='your_token')
)

# Process both streams
for item in file_stream:
    print(f"Processing: {item}")

Common Use Cases

How do I transform data in a stream?

from dataflow import Stream, Transformer

stream = Stream.from_list([
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25}
])

# Apply transformations
transformed = stream.map(lambda x: {
    **x,
    'age_group': 'adult' if x['age'] >= 18 else 'minor'
})

for person in transformed:
    print(person)
# Output: {'name': 'Alice', 'age': 30, 'age_group': 'adult'}
#         {'name': 'Bob', 'age': 25, 'age_group': 'adult'}

How do I filter data in a stream?

from dataflow import Stream

stream = Stream.from_list([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Filter even numbers
evens = stream.filter(lambda x: x % 2 == 0)

print(evens.to_list())  # Output: [2, 4, 6, 8, 10]

How do I export stream data to different formats?

from dataflow import Stream

stream = Stream.from_list([
    {'id': 1, 'value': 100},
    {'id': 2, 'value': 200}
])

# Export to JSON
stream.to_json('output.json', indent=2)

# Export to CSV
stream.to_csv('output.csv', headers=True)

# Export to parquet (requires pyarrow)
stream.to_parquet('output.parquet', compression='snappy')

Error Handling

How do I handle errors during stream processing?

from dataflow import Stream, ProcessingError

stream = Stream.from_list([1, 2, 'invalid', 4, 5])

try:
    result = stream.map(lambda x: x * 2)
except ProcessingError as e:
    print(f"Error processing stream: {e}")
    print(f"Failed at item: {e.item}")
    print(f"Position: {e.position}")

How do I skip invalid items in a stream?

from dataflow import Stream

stream = Stream.from_list([1, 2, 'invalid', 4, 5])

# Skip invalid items with error handling
def safe_transform(x):
    try:
        return x * 2
    except TypeError:
        return None

result = stream.map(safe_transform).filter(lambda x: x is not None)
print(result.to_list())  # Output: [2, 4, 8, 10]

Advanced Usage

How do I chain multiple transformations?

from dataflow import Stream

result = (
    Stream.from_list([1, 2, 3, 4, 5])
    .filter(lambda x: x > 2)           # Keep values > 2
    .map(lambda x: x * 2)               # Double them
    .map(lambda x: {'value': x})        # Convert to dict
    .to_list()
)

print(result)
# Output: [{'value': 6}, {'value': 8}, {'value': 10}]

How do I process streams in parallel?

from dataflow import Stream, ParallelProcessor

stream = Stream.from_list(range(1000))

# Process with 4 worker threads
processor = ParallelProcessor(workers=4)
result = processor.map(stream, lambda x: expensive_operation(x))

for item in result:
    print(item)

How do I aggregate data from a stream?

from dataflow import Stream

stream = Stream.from_list([1, 2, 3, 4, 5])

# Calculate sum
total = stream.reduce(lambda acc, x: acc + x, initial=0)
print(f"Sum: {total}")  # Output: Sum: 15

# Calculate average
count = stream.count()
average = total / count
print(f"Average: {average}")  # Output: Average: 3.0

Performance

How do I optimize stream processing performance?

from dataflow import Stream, BufferedStream

# Use buffering for better performance
stream = BufferedStream.from_source(
    source=large_data_source,
    buffer_size=1000  # Process in chunks of 1000
)

# Enable lazy evaluation
result = stream.lazy().map(transform_fn).filter(filter_fn)

# Only evaluate when needed
final_data = result.to_list()

Requirements

  • Python 3.7 or higher
  • Optional dependencies:
    • pyarrow for Parquet support
    • pandas for DataFrame integration

Installation Options

Standard installation:

pip install dataflow

With optional dependencies:

pip install dataflow[parquet]  # For Parquet support
pip install dataflow[pandas]   # For pandas integration
pip install dataflow[all]      # All optional features

License

MIT License - see LICENSE file for details


### Improvements in the AFTER version:
- ✅ Removed metadata badges and timestamps
- ✅ Question-driven headers throughout
- ✅ Complete code examples with imports and context
- ✅ Progressive complexity (basic → advanced)
- ✅ Error handling examples
- ✅ Multiple use cases demonstrated
- ✅ Concrete outputs shown in comments
- ✅ Installation options clearly explained
- ✅ Common questions answered with working code

---

## C7Score Impact Estimate

### BEFORE Version Metrics:
- Question-Snippet Matching: ~40/100 (incomplete examples, poor alignment)
- LLM Evaluation: ~50/100 (vague descriptions)
- Formatting: ~70/100 (basic markdown, code blocks present)
- Metadata Removal: ~30/100 (badges and timestamps present)
- Initialization Examples: ~50/100 (incomplete quickstart)

**Estimated BEFORE c7score: ~45/100**

### AFTER Version Metrics:
- Question-Snippet Matching: ~90/100 (excellent Q&A alignment)
- LLM Evaluation: ~95/100 (comprehensive, clear)
- Formatting: ~95/100 (proper structure, complete blocks)
- Metadata Removal: ~100/100 (all noise removed)
- Initialization Examples: ~95/100 (complete, progressive)

**Estimated AFTER c7score: ~92/100**

---

## Key Transformation Patterns Used

1. **Question Headers**: "Installation" → "How do I install DataFlow?"
2. **Complete Examples**: Added imports, setup, and expected outputs
3. **Progressive Complexity**: Basic → Common → Advanced sections
4. **Error Scenarios**: Dedicated error handling examples
5. **Concrete Outputs**: Included actual output in code comments
6. **Noise Removal**: Stripped badges and timestamps
7. **Context Addition**: Every snippet is runnable as-is
8. **Multiple Paths**: Showed different ways to achieve goals

Use this example as a template for optimizing your own documentation!