Files
gh-ryanbreen-breenix-breeni…/skills/breenix-memory-debugging/SKILL.md
2025-11-30 08:53:27 +08:00

551 lines
14 KiB
Markdown

---
name: memory-debugging
description: This skill should be used when debugging memory-related issues in the Breenix kernel including page faults, double faults, frame allocation problems, page table issues, heap allocation failures, stack overflows, and virtual memory mapping errors.
---
# Memory Debugging for Breenix
Debug kernel memory issues including page faults, allocator problems, and page table errors.
## Purpose
Memory bugs in kernel development are among the most difficult to debug. This skill provides systematic approaches for diagnosing page faults, double faults, allocator issues, and page table problems specific to Breenix.
## When to Use
- **Page faults**: Accessing unmapped or incorrectly mapped memory
- **Double faults**: Stack issues or cascading exceptions
- **Frame allocation failures**: Out of memory or allocator bugs
- **Page table problems**: Wrong mappings, missing entries, incorrect flags
- **Heap allocation issues**: OOM, corruption, leaks
- **Stack overflows**: Exceeding stack size, missing guard pages
- **Virtual address conflicts**: Multiple processes mapping same address
## Memory Subsystems in Breenix
### 1. Physical Memory (Frame Allocator)
**Location**: `kernel/src/memory/frame_allocator.rs`
**What it does**: Manages physical memory frames (4KB pages)
**Common issues**:
- Running out of frames
- Double allocation of same frame
- Frames not freed properly
- Initialization failures
**Debug approach**:
```rust
// Add logging to allocation/deallocation
log::debug!("Allocating frame at {:?}", frame);
log::debug!("Frame allocator: {} frames used", count);
// Check allocator state
log::info!("Physical memory: {} MiB usable", memory_mb);
```
### 2. Virtual Memory (Page Tables)
**Location**: `kernel/src/memory/` (process_memory.rs, kernel_page_table.rs)
**What it does**: Maps virtual addresses to physical frames
**Common issues**:
- Missing page table entries
- Wrong flags (PRESENT, WRITABLE, USER_ACCESSIBLE)
- Shared page table entries causing conflicts
- Kernel mappings not copied to process page tables
**Debug approach**:
```rust
// Log page table operations
log::debug!("Mapping page {:?} to frame {:?} with flags {:?}",
page, frame, flags);
// Verify mappings
let result = page_table.translate_addr(addr);
log::debug!("Address {:?} translates to {:?}", addr, result);
```
### 3. Heap Allocator
**Location**: Uses Rust's `#[global_allocator]`
**Size**: 1024 KiB
**Common issues**:
- Out of heap memory
- Heap corruption
- Allocations during early boot (before heap init)
**Debug approach**:
```rust
// Check heap size
log::info!("Heap: 1024 KiB");
// Log allocations if needed
// (Note: Can't use allocations in alloc functions!)
```
### 4. Kernel Stacks
**Location**: `kernel/src/memory/kernel_stack.rs`
**Layout**: 8KB stacks with 4KB guard pages at `0xffffc900_0000_0000`
**Common issues**:
- Stack overflow into guard page
- Kernel stack not mapped in process page table
- IST stack issues for double faults
**Debug approach**:
```rust
// Log stack allocation
log::debug!("Allocated kernel stack {} at {:?}", id, addr);
// Check stack bounds
log::debug!("Stack bottom: {:?}, top: {:?}", bottom, top);
// Verify stack is mapped
```
## Common Memory Errors
### Error 1: Page Fault
**Symptoms**:
```
PAGE FAULT at 0x... Error Code: 0x...
```
**Error Code Decoding**:
```
Bit 0 (P): 0 = Page not present
1 = Protection violation
Bit 1 (W): 0 = Read access
1 = Write access
Bit 2 (U): 0 = Kernel mode
1 = User mode
Bit 3 (R): 1 = Reserved bit set
Bit 4 (I): 1 = Instruction fetch
```
**Common Causes**:
**1. Accessing unmapped memory**
```rust
// Problem: Address not mapped in page table
let ptr = 0x12345000 as *const u64;
unsafe { *ptr } // PAGE FAULT - not mapped
```
**Diagnosis**:
- Check if address should be mapped
- Verify page table has entry for this address
- Confirm physical frame was allocated
**Fix**: Map the page before accessing
**2. Writing to read-only page**
```rust
// Problem: Page mapped without WRITABLE flag
let ptr = read_only_page as *mut u64;
unsafe { *ptr = 42; } // PAGE FAULT - write to read-only
```
**Diagnosis**:
- Check page table flags
- Verify WRITABLE flag is set
- Confirm not writing to kernel code/data
**Fix**: Add WRITABLE flag or don't write to read-only pages
**3. User accessing kernel page**
```rust
// Problem: Userspace trying to access kernel memory
// (from Ring 3)
let ptr = 0xFFFF_8000_0000_0000 as *const u64; // Kernel address
unsafe { *ptr } // PAGE FAULT - user accessing kernel
```
**Diagnosis**:
- Check if address is in kernel space (upper half)
- Verify page doesn't have USER_ACCESSIBLE flag
- Confirm userspace should not access this
**Fix**: Don't allow userspace to access kernel memory
**4. Accessing kernel stack not mapped in process page table**
```rust
// Problem: Kernel stack mapped in kernel PT but not process PT
// Ring 3 -> Ring 0 transition tries to use unmapped kernel stack
// This was the DIRECT_EXECUTION_FIX issue!
```
**Diagnosis**:
- Check if kernel stack is mapped in process page table
- Verify TSS RSP0 points to valid kernel stack
- Look for double fault during syscalls (int 0x80)
**Fix**: Copy kernel stack mappings to process page table
### Error 2: Double Fault
**Symptoms**:
```
DOUBLE FAULT - Error Code: 0x...
Instruction Pointer: 0x...
Stack Pointer: 0x...
```
**What it means**: Exception occurred while handling another exception
**Common Causes**:
**1. Kernel stack not mapped during exception**
```
Sequence:
1. Exception occurs (page fault, etc.)
2. CPU tries to switch to kernel stack
3. Kernel stack not mapped in current page table
4. Page fault accessing kernel stack
5. DOUBLE FAULT
```
**Diagnosis**:
- Check which exception triggered the double fault
- Verify kernel stack is mapped
- Check TSS RSP0 value
- Look at instruction pointer (where was CPU when it faulted?)
**Fix**: Ensure kernel stack mapped in all page tables
**2. Stack overflow**
```
Sequence:
1. Recursive function or large stack allocation
2. Stack exceeds allocated size
3. Writes into guard page
4. Page fault (guard page not mapped)
5. Page fault handler needs stack
6. DOUBLE FAULT
```
**Diagnosis**:
- Check stack pointer value
- Compare against stack bounds
- Look for recursive calls
- Check for large stack allocations
**Fix**: Increase stack size or fix code causing overflow
**3. Exception handler itself faults**
```
Sequence:
1. Exception occurs
2. Handler tries to access unmapped memory
3. Page fault inside handler
4. DOUBLE FAULT
```
**Diagnosis**:
- Review exception handler code
- Check what handler was executing
- Verify handler doesn't access invalid addresses
**Fix**: Fix bug in exception handler
### Error 3: Page Already Mapped
**Symptoms**:
```
Error: Attempted to map already-mapped page
```
**Common Causes**:
**1. Shared page table levels**
```rust
// Problem: Multiple processes share L3 table
// Second process tries to map page in shared table
// This was the PAGE_TABLE_FIX issue!
```
**Diagnosis**:
- Check if page table levels are shared between processes
- Verify each process has independent L3/L2/L1 tables
- Look at PML4 entry copying code
**Fix**: Deep copy page table levels, don't share
**2. Mapping same address twice**
```rust
// Problem: Code tries to map a page that's already mapped
page_table.map_to(page, frame, flags, allocator)?;
page_table.map_to(page, frame, flags, allocator)?; // Error!
```
**Diagnosis**:
- Check if page is already mapped before mapping
- Look for duplicate mapping calls
- Verify cleanup properly unmaps pages
**Fix**: Check before mapping or unmap first
### Error 4: Out of Memory
**Symptoms**:
```
Error: Frame allocator out of memory
```
**Common Causes**:
**1. Too many allocations**
```rust
// Problem: Allocating too many frames
loop {
allocator.allocate_frame(); // Eventually runs out
}
```
**Diagnosis**:
- Log total memory available
- Count allocations vs deallocations
- Check for memory leaks
**Fix**: Free frames when done, or increase memory
**2. Memory leaks**
```rust
// Problem: Frames allocated but never freed
let frame = allocator.allocate_frame()?;
// ... use frame ...
// Forget to deallocate - LEAK!
```
**Diagnosis**:
- Track allocation/deallocation counts
- Look for allocations without corresponding frees
- Use systematic allocation patterns
**Fix**: Properly free all allocated frames
## Debugging Techniques
### Technique 1: Add Checkpoint Logging
Add logging at critical memory operations:
```rust
log::debug!("CHECKPOINT: Before page table operation");
page_table.map_to(page, frame, flags, allocator)?;
log::debug!("CHECKPOINT: After page table operation");
log::debug!("CHECKPOINT: Before memory access");
unsafe { *(addr as *const u64) };
log::debug!("CHECKPOINT: After memory access");
```
If crash happens between checkpoints, you know where to focus.
### Technique 2: Verify Assumptions
Check assumptions about memory state:
```rust
// Verify address is mapped before accessing
match page_table.translate_addr(addr) {
Some(phys) => log::debug!("Address {:?} mapped to {:?}", addr, phys),
None => log::warn!("Address {:?} NOT MAPPED", addr),
}
// Verify frame was allocated
log::debug!("Allocated frame: {:?}", frame);
assert!(frame.start_address().as_u64() > 0);
// Verify flags are correct
log::debug!("Page mapped with flags: {:?}", flags);
assert!(flags.contains(PageTableFlags::PRESENT));
```
### Technique 3: Dump Page Table State
Add functions to dump page table state:
```rust
pub fn dump_page_table_entry(page_table: &PageTable, addr: VirtAddr) {
let result = page_table.translate_addr(addr);
log::debug!("Address: {:?}", addr);
log::debug!(" Translation: {:?}", result);
// Walk page table levels
// Log each level's entry
}
```
### Technique 4: Use kernel-debug-loop
Fast iteration for memory issues:
```bash
# Test fix quickly
kernel-debug-loop/scripts/quick_debug.py \
--signal "MEMORY OPERATION COMPLETE" \
--timeout 10
```
### Technique 5: Compare Working vs Broken State
If something used to work:
```bash
# Run working version
git checkout working_commit
kernel-debug-loop/scripts/quick_debug.py ... > working.log
# Run broken version
git checkout broken_commit
kernel-debug-loop/scripts/quick_debug.py ... > broken.log
# Compare
diff -u working.log broken.log
```
## Memory Issue Patterns
### Pattern: Syscall Page Fault
**Scenario**: Page fault when userspace calls `int 0x80`
**Diagnosis**:
1. Kernel stack not mapped in process page table
2. Ring 3 → Ring 0 transition fails
**Fix**: `copy_kernel_stack_to_process()`
**Reference**: DIRECT_EXECUTION_FIX.md
### Pattern: Context Switch Double Fault
**Scenario**: Double fault during context switch between processes
**Diagnosis**:
1. Wrong page table activated
2. Current stack not mapped in new page table
**Fix**: Ensure kernel stacks globally mapped
**Reference**: Global kernel page table architecture
### Pattern: Process Creation "Already Mapped"
**Scenario**: Second process creation fails with "page already mapped"
**Diagnosis**:
1. Processes sharing page table levels
2. First process's mappings conflict with second
**Fix**: Deep copy page table levels
**Reference**: PAGE_TABLE_FIX.md
### Pattern: Heap Allocation Panic
**Scenario**: Panic during heap allocation
**Diagnosis**:
1. Out of heap memory
2. Heap not initialized yet
3. Heap corruption
**Fix**: Increase heap size, defer allocation, or fix corruption
## Integration with Other Skills
### With kernel-debug-loop
```bash
# Fast iteration on memory fixes
kernel-debug-loop/scripts/quick_debug.py \
--signal "MEMORY TEST COMPLETE" \
--timeout 15
```
### With systematic-debugging
Document complex memory bugs:
```markdown
# Problem
Page fault at 0x10001082 during sys_write
# Root Cause
User buffer not mapped in process page table
# Solution
Verify user addresses before access
# Evidence
Before: PAGE FAULT
After: Successful syscall
```
### With log-analysis
```bash
# Find memory-related errors
echo '"PAGE FAULT"' > /tmp/log-query.txt
./scripts/find-in-logs
# Find allocation patterns
echo '"Allocated\|Deallocated"' > /tmp/log-query.txt
./scripts/find-in-logs
```
## Best Practices
1. **Log extensively**: Memory operations should be well-logged
2. **Verify before access**: Check addresses are mapped
3. **Use checkpoints**: Narrow down failure location
4. **Test incrementally**: Add small changes, test frequently
5. **Understand architecture**: Know how page tables work
6. **Reference working code**: Look at similar working code
7. **Document patterns**: Save solutions for future reference
## Quick Reference
### Page Fault Error Codes
```
0x0: Read from unmapped page (kernel mode)
0x1: Read from non-present page (kernel mode)
0x2: Write to unmapped page (kernel mode)
0x3: Write to non-present page (kernel mode)
0x4: Read from unmapped page (user mode)
0x6: Write to unmapped page (user mode)
```
### Memory Regions
```
0x0000_0000_0000_0000 - 0x0000_7FFF_FFFF_FFFF: User space
0xFFFF_8000_0000_0000 - 0xFFFF_FFFF_FFFF_FFFF: Kernel space
0xFFFF_C900_0000_0000: Kernel stacks
```
### Key Files
```
kernel/src/memory/frame_allocator.rs - Physical memory
kernel/src/memory/process_memory.rs - Process page tables
kernel/src/memory/kernel_page_table.rs - Kernel mappings
kernel/src/memory/kernel_stack.rs - Kernel stack allocator
```
## Summary
Memory debugging requires:
- Understanding memory subsystems (frame allocator, page tables, heap, stacks)
- Systematic diagnosis of page faults and double faults
- Checkpoint logging to isolate failures
- Verification of page table state and mappings
- Reference to past fixes (DIRECT_EXECUTION_FIX, PAGE_TABLE_FIX)
- Integration with fast iteration tools
Memory bugs are complex but systematic debugging always works.