7.8 KiB
7.8 KiB
Example: Before and After C7Score Optimization
This example demonstrates the transformation of a typical README into an optimized version that scores higher on the c7score benchmark.
BEFORE Optimization (Low c7score)
# DataFlow


Last updated: March 15, 2024
DataFlow is a Python library for processing data streams.
## Features
- Stream processing
- Data transformation
- Multiple output formats
- Error handling
## Installation
```bash
pip install dataflow
Usage
from dataflow import Stream
s = Stream()
API
See the API documentation for details.
License
MIT
### Issues with the BEFORE version:
- ❌ Metadata badges and timestamps (reduce c7score)
- ❌ Vague descriptions without concrete examples
- ❌ Code snippets lack context and imports
- ❌ No question-driven structure
- ❌ Missing error handling examples
- ❌ No progressive complexity
- ❌ Features listed without demonstrations
- ❌ Incomplete quickstart example
---
## AFTER Optimization (High c7score)
```markdown
# DataFlow
A Python library for processing data streams with real-time transformations and multiple output formats.
## Quick Start
### How do I install DataFlow?
```bash
pip install dataflow
For development installation:
git clone https://github.com/example/dataflow.git
cd dataflow
pip install -e ".[dev]"
How do I process my first data stream?
from dataflow import Stream, Processor
# Create a stream and add data
stream = Stream()
stream.add_data([1, 2, 3, 4, 5])
# Process the stream
processor = Processor()
result = processor.transform(stream, lambda x: x * 2)
print(result.to_list()) # Output: [2, 4, 6, 8, 10]
How do I handle different data sources?
from dataflow import Stream, FileSource, APISource
# Load from a file
file_stream = Stream.from_source(
FileSource('data.csv', format='csv')
)
# Load from an API
api_stream = Stream.from_source(
APISource('https://api.example.com/data', auth_token='your_token')
)
# Process both streams
for item in file_stream:
print(f"Processing: {item}")
Common Use Cases
How do I transform data in a stream?
from dataflow import Stream, Transformer
stream = Stream.from_list([
{'name': 'Alice', 'age': 30},
{'name': 'Bob', 'age': 25}
])
# Apply transformations
transformed = stream.map(lambda x: {
**x,
'age_group': 'adult' if x['age'] >= 18 else 'minor'
})
for person in transformed:
print(person)
# Output: {'name': 'Alice', 'age': 30, 'age_group': 'adult'}
# {'name': 'Bob', 'age': 25, 'age_group': 'adult'}
How do I filter data in a stream?
from dataflow import Stream
stream = Stream.from_list([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Filter even numbers
evens = stream.filter(lambda x: x % 2 == 0)
print(evens.to_list()) # Output: [2, 4, 6, 8, 10]
How do I export stream data to different formats?
from dataflow import Stream
stream = Stream.from_list([
{'id': 1, 'value': 100},
{'id': 2, 'value': 200}
])
# Export to JSON
stream.to_json('output.json', indent=2)
# Export to CSV
stream.to_csv('output.csv', headers=True)
# Export to parquet (requires pyarrow)
stream.to_parquet('output.parquet', compression='snappy')
Error Handling
How do I handle errors during stream processing?
from dataflow import Stream, ProcessingError
stream = Stream.from_list([1, 2, 'invalid', 4, 5])
try:
result = stream.map(lambda x: x * 2)
except ProcessingError as e:
print(f"Error processing stream: {e}")
print(f"Failed at item: {e.item}")
print(f"Position: {e.position}")
How do I skip invalid items in a stream?
from dataflow import Stream
stream = Stream.from_list([1, 2, 'invalid', 4, 5])
# Skip invalid items with error handling
def safe_transform(x):
try:
return x * 2
except TypeError:
return None
result = stream.map(safe_transform).filter(lambda x: x is not None)
print(result.to_list()) # Output: [2, 4, 8, 10]
Advanced Usage
How do I chain multiple transformations?
from dataflow import Stream
result = (
Stream.from_list([1, 2, 3, 4, 5])
.filter(lambda x: x > 2) # Keep values > 2
.map(lambda x: x * 2) # Double them
.map(lambda x: {'value': x}) # Convert to dict
.to_list()
)
print(result)
# Output: [{'value': 6}, {'value': 8}, {'value': 10}]
How do I process streams in parallel?
from dataflow import Stream, ParallelProcessor
stream = Stream.from_list(range(1000))
# Process with 4 worker threads
processor = ParallelProcessor(workers=4)
result = processor.map(stream, lambda x: expensive_operation(x))
for item in result:
print(item)
How do I aggregate data from a stream?
from dataflow import Stream
stream = Stream.from_list([1, 2, 3, 4, 5])
# Calculate sum
total = stream.reduce(lambda acc, x: acc + x, initial=0)
print(f"Sum: {total}") # Output: Sum: 15
# Calculate average
count = stream.count()
average = total / count
print(f"Average: {average}") # Output: Average: 3.0
Performance
How do I optimize stream processing performance?
from dataflow import Stream, BufferedStream
# Use buffering for better performance
stream = BufferedStream.from_source(
source=large_data_source,
buffer_size=1000 # Process in chunks of 1000
)
# Enable lazy evaluation
result = stream.lazy().map(transform_fn).filter(filter_fn)
# Only evaluate when needed
final_data = result.to_list()
Requirements
- Python 3.7 or higher
- Optional dependencies:
pyarrowfor Parquet supportpandasfor DataFrame integration
Installation Options
Standard installation:
pip install dataflow
With optional dependencies:
pip install dataflow[parquet] # For Parquet support
pip install dataflow[pandas] # For pandas integration
pip install dataflow[all] # All optional features
License
MIT License - see LICENSE file for details
### Improvements in the AFTER version:
- ✅ Removed metadata badges and timestamps
- ✅ Question-driven headers throughout
- ✅ Complete code examples with imports and context
- ✅ Progressive complexity (basic → advanced)
- ✅ Error handling examples
- ✅ Multiple use cases demonstrated
- ✅ Concrete outputs shown in comments
- ✅ Installation options clearly explained
- ✅ Common questions answered with working code
---
## C7Score Impact Estimate
### BEFORE Version Metrics:
- Question-Snippet Matching: ~40/100 (incomplete examples, poor alignment)
- LLM Evaluation: ~50/100 (vague descriptions)
- Formatting: ~70/100 (basic markdown, code blocks present)
- Metadata Removal: ~30/100 (badges and timestamps present)
- Initialization Examples: ~50/100 (incomplete quickstart)
**Estimated BEFORE c7score: ~45/100**
### AFTER Version Metrics:
- Question-Snippet Matching: ~90/100 (excellent Q&A alignment)
- LLM Evaluation: ~95/100 (comprehensive, clear)
- Formatting: ~95/100 (proper structure, complete blocks)
- Metadata Removal: ~100/100 (all noise removed)
- Initialization Examples: ~95/100 (complete, progressive)
**Estimated AFTER c7score: ~92/100**
---
## Key Transformation Patterns Used
1. **Question Headers**: "Installation" → "How do I install DataFlow?"
2. **Complete Examples**: Added imports, setup, and expected outputs
3. **Progressive Complexity**: Basic → Common → Advanced sections
4. **Error Scenarios**: Dedicated error handling examples
5. **Concrete Outputs**: Included actual output in code comments
6. **Noise Removal**: Stripped badges and timestamps
7. **Context Addition**: Every snippet is runnable as-is
8. **Multiple Paths**: Showed different ways to achieve goals
Use this example as a template for optimizing your own documentation!