Files
gh-bradleyboehmke-brads-mar…/skills/courses/intro-to-data-mining/course-profile.md
2025-11-29 18:01:52 +08:00

239 lines
8.2 KiB
Markdown

# Intro to Data Mining - Course Profile
**Course:** UC BANA 4080: Introduction to Data Mining with Python
**Instructor:** Brad Boehmke
## Audience
**Student Level:** Undergraduate (juniors/seniors)
**Background:**
- Business students with foundation in calculus, statistics, and possibly regression
- May have Excel experience and basic VBA exposure
- Variable programming backgrounds - many are complete coding beginners
- Understand business operations, customer behavior, and quantitative thinking
- Know how to think critically but lack experience turning theory into practice with code
**Prerequisites:**
- Quantitative methods and statistical inference courses
- No prior programming experience required or expected
- Course explicitly designed for beginners
**Key Challenge:** Bridging the gap between classroom theory and real-world data analysis
## Learning Philosophy
**Core Approach:** Hands-on, immersive learning through doing
**Key Principles:**
1. **Practice over theory** - Students learn by working with real, messy datasets
2. **Build confidence through action** - Focus on getting students comfortable with tools before perfection
3. **Close the theory-practice gap** - Move from "knowing concepts" to "applying skills"
4. **AI as assistant, not autopilot** - Use GenAI tools (ChatGPT, Claude, Copilot) to help learn, but emphasize understanding
5. **Collaborative learning** - Build community; encourage students to help each other
6. **Growth mindset** - Normalize struggle; coding is a new language that takes time
**Teaching Style:**
- Conversational, relatable tone (see intro chapter example)
- Use storytelling and scenarios (e.g., "Taylor the intern")
- Address student concerns directly (e.g., "Why learn this when AI exists?")
- Set realistic expectations about difficulty
- Encourage persistence and resilience
**Unique Aspects:**
- Explicitly addresses role of GenAI in learning process
- Balances AI assistance with foundational skill building
- Uses real-world business contexts students can relate to
## Technical Stack
**Core Environment:**
- **Python** (chosen for beginner-friendliness + professional power)
- **Jupyter Lab/Notebooks** (primary development environment)
- **Google Colab** (cloud-based option for students)
- **Quarto** (for textbook and slides)
- **Virtual environment** (venv for package management)
**Primary Libraries (Weeks 1-6):**
- **pandas** - data manipulation and DataFrames
- **numpy** - numerical computation
- **matplotlib** - basic visualization
- **seaborn** - statistical visualization
**Machine Learning Libraries (Weeks 8-13):**
- **scikit-learn** - all ML models and evaluation
**Additional Tools:**
- CSV/Excel file handling
- Basic SQL concepts (joins in pandas)
- Git/GitHub for assignment submission
**File Formats:**
- Quarto markdown (.qmd) for book chapters
- Jupyter notebooks (.ipynb) for examples, labs, homework
- Real datasets (CSV, Excel) in `/data/` directory
## Content Style
**Writing Style:**
- **Conversational and approachable** - Not dry or overly academic
- **Student-focused** - Addresses "you" directly
- **Motivational** - Builds confidence, normalizes struggle
- **Practical** - Always tied to real-world application
- **Honest** - Acknowledges difficulties, doesn't sugar-coat challenges
**Explanation Approach:**
1. **Start with WHY** - Motivate the topic before diving in
2. **Use analogies and stories** - Make abstract concepts concrete
3. **Show, don't just tell** - Working code examples over theory
4. **Progressive complexity** - Start simple, build gradually
5. **Address common questions** - Anticipate student concerns
**Examples:**
- Use **relatable business scenarios** (customer data, marketing analytics, retail transactions)
- Work with **messy, real-world datasets** (not clean, perfect examples)
- Include **visual aids** heavily (plots, diagrams, screenshots)
- Provide **executable code** that students can run and modify
**Pedagogical Elements:**
- **Callout boxes** for tips, warnings, reflections, and examples
- **Student reflection prompts** to encourage metacognition
- **Exercises** that build on chapter concepts
- **Code comments** that explain what's happening
- **Error messages and debugging guidance**
**Depth:**
- Prioritize **intuition over mathematical rigor**
- Show code implementation before heavy theory
- Balance "just enough math" with practical application
- Focus on **interpretation and application** over derivations
## Key Topics
**Module 1: Python Fundamentals (Week 1)**
- Course intro + motivation
- Variables, data types, basic operators
- Why Python? Why not just use AI?
- Setting up environment
**Module 2: Jupyter & Data Structures (Week 2)**
- Jupyter notebooks and reproducible workflows
- Lists, dictionaries, tuples
- Pandas introduction
- Importing CSV data
**Module 3: Data Wrangling (Week 3)**
- DataFrame manipulation
- Filtering and subsetting
- Aggregating data
- GroupBy operations
**Module 4: Advanced Data Manipulation (Week 4)**
- Working with dates and times
- String operations
- Relational data and joins (SQL-style in pandas)
- Merging DataFrames
**Module 5: Data Visualization (Week 5)**
- Matplotlib basics
- Seaborn for statistical plots
- Exploratory data analysis with visuals
- Best practices for effective visualization
**Module 6: Writing Efficient Code (Week 6)**
- Control flow (if/else, loops)
- Functions and modularity
- List comprehensions
- Code efficiency and readability
**Week 7: Midterm Project**
- Application of Modules 1-6
- Work with messy, real datasets
- Open-ended analysis problem
**Module 7: Machine Learning Intro (Week 8)**
- What is ML and when to use it?
- Train/test split
- Features and labels
- Model building process
**Module 8: Regression (Week 9)**
- Correlation analysis
- Linear regression with scikit-learn
- Model evaluation (R², RMSE)
- Interpretation
**Module 9: Classification (Week 10)**
- Logistic regression
- Classification metrics (accuracy, precision, recall, F1)
- Confusion matrices
- When to use classification vs regression
**Module 10: Tree-Based Models (Week 11)**
- Decision trees
- Random forests
- Feature importance
- Model interpretation
**Module 11: Model Optimization (Week 12)**
- Feature engineering
- Cross-validation
- Hyperparameter tuning (GridSearchCV)
- Model selection
**Module 12: Advanced Topics (Week 13)**
- Unsupervised learning (clustering, PCA)
- Deep learning overview
- Introduction to LLMs and GenAI concepts
**Week 14: Final Project**
- Comprehensive data science project
- Full pipeline from data cleaning to modeling
## Assessment Approach
**Grading Components:**
- **Labs** - Weekly hands-on activities (Thursdays)
- **Homework** - Applied assignments (with answer keys for instructor)
- **Midterm Project** - Comprehensive application of Modules 1-6
- **Final Project** - End-to-end data science project
- **Quizzes** - Knowledge checks (materials in `/planning/quizzes/`)
**Student Support:**
- Canvas discussion boards for peer collaboration
- Office hours
- Answer keys provided for labs and homework (instructor use)
- Multiple formats (notebook, HTML, PDF) for accessibility
**GenAI Policy:**
- **Encouraged** to use ChatGPT, Claude, Copilot as learning aids
- **Required** to understand code, not just copy it
- Emphasis on using AI to learn, not to avoid learning
- Students asked to reflect on AI tool use and limitations
**Project Structure:**
- Templates provided for major assignments
- Rubrics included in `/planning/projects/`
- Real-world datasets required
- Open-ended problems that require creative problem-solving
## Content Format
**Textbook:** Quarto book with modules 1-6 + appendices
**Slides:** Weekly presentations using Quarto + Reveal.js
**Examples:** Numbered sequence of Jupyter notebooks (01-17)
**Labs:** Weekly hands-on activities with answer keys
**Homework:** Individual assignments with solutions in multiple formats
**Datasets:** Real-world data in `/data/` directory (retail, airlines, housing, etc.)
## Course Materials Repository
All materials maintained in Git repository with structure:
- `/book/` - Textbook chapters
- `/slides/` - Weekly presentations
- `/example-notebooks/` - Companion code examples
- `/labs/` - Hands-on activities
- `/homework/` - Assignments
- `/data/` - Datasets
- `/planning/` - Instructor resources (Canvas docs, rubrics, quizzes)