239 lines
8.2 KiB
Markdown
239 lines
8.2 KiB
Markdown
# Intro to Data Mining - Course Profile
|
|
|
|
**Course:** UC BANA 4080: Introduction to Data Mining with Python
|
|
**Instructor:** Brad Boehmke
|
|
|
|
## Audience
|
|
|
|
**Student Level:** Undergraduate (juniors/seniors)
|
|
|
|
**Background:**
|
|
- Business students with foundation in calculus, statistics, and possibly regression
|
|
- May have Excel experience and basic VBA exposure
|
|
- Variable programming backgrounds - many are complete coding beginners
|
|
- Understand business operations, customer behavior, and quantitative thinking
|
|
- Know how to think critically but lack experience turning theory into practice with code
|
|
|
|
**Prerequisites:**
|
|
- Quantitative methods and statistical inference courses
|
|
- No prior programming experience required or expected
|
|
- Course explicitly designed for beginners
|
|
|
|
**Key Challenge:** Bridging the gap between classroom theory and real-world data analysis
|
|
|
|
## Learning Philosophy
|
|
|
|
**Core Approach:** Hands-on, immersive learning through doing
|
|
|
|
**Key Principles:**
|
|
1. **Practice over theory** - Students learn by working with real, messy datasets
|
|
2. **Build confidence through action** - Focus on getting students comfortable with tools before perfection
|
|
3. **Close the theory-practice gap** - Move from "knowing concepts" to "applying skills"
|
|
4. **AI as assistant, not autopilot** - Use GenAI tools (ChatGPT, Claude, Copilot) to help learn, but emphasize understanding
|
|
5. **Collaborative learning** - Build community; encourage students to help each other
|
|
6. **Growth mindset** - Normalize struggle; coding is a new language that takes time
|
|
|
|
**Teaching Style:**
|
|
- Conversational, relatable tone (see intro chapter example)
|
|
- Use storytelling and scenarios (e.g., "Taylor the intern")
|
|
- Address student concerns directly (e.g., "Why learn this when AI exists?")
|
|
- Set realistic expectations about difficulty
|
|
- Encourage persistence and resilience
|
|
|
|
**Unique Aspects:**
|
|
- Explicitly addresses role of GenAI in learning process
|
|
- Balances AI assistance with foundational skill building
|
|
- Uses real-world business contexts students can relate to
|
|
|
|
## Technical Stack
|
|
|
|
**Core Environment:**
|
|
- **Python** (chosen for beginner-friendliness + professional power)
|
|
- **Jupyter Lab/Notebooks** (primary development environment)
|
|
- **Google Colab** (cloud-based option for students)
|
|
- **Quarto** (for textbook and slides)
|
|
- **Virtual environment** (venv for package management)
|
|
|
|
**Primary Libraries (Weeks 1-6):**
|
|
- **pandas** - data manipulation and DataFrames
|
|
- **numpy** - numerical computation
|
|
- **matplotlib** - basic visualization
|
|
- **seaborn** - statistical visualization
|
|
|
|
**Machine Learning Libraries (Weeks 8-13):**
|
|
- **scikit-learn** - all ML models and evaluation
|
|
|
|
**Additional Tools:**
|
|
- CSV/Excel file handling
|
|
- Basic SQL concepts (joins in pandas)
|
|
- Git/GitHub for assignment submission
|
|
|
|
**File Formats:**
|
|
- Quarto markdown (.qmd) for book chapters
|
|
- Jupyter notebooks (.ipynb) for examples, labs, homework
|
|
- Real datasets (CSV, Excel) in `/data/` directory
|
|
|
|
## Content Style
|
|
|
|
**Writing Style:**
|
|
- **Conversational and approachable** - Not dry or overly academic
|
|
- **Student-focused** - Addresses "you" directly
|
|
- **Motivational** - Builds confidence, normalizes struggle
|
|
- **Practical** - Always tied to real-world application
|
|
- **Honest** - Acknowledges difficulties, doesn't sugar-coat challenges
|
|
|
|
**Explanation Approach:**
|
|
1. **Start with WHY** - Motivate the topic before diving in
|
|
2. **Use analogies and stories** - Make abstract concepts concrete
|
|
3. **Show, don't just tell** - Working code examples over theory
|
|
4. **Progressive complexity** - Start simple, build gradually
|
|
5. **Address common questions** - Anticipate student concerns
|
|
|
|
**Examples:**
|
|
- Use **relatable business scenarios** (customer data, marketing analytics, retail transactions)
|
|
- Work with **messy, real-world datasets** (not clean, perfect examples)
|
|
- Include **visual aids** heavily (plots, diagrams, screenshots)
|
|
- Provide **executable code** that students can run and modify
|
|
|
|
**Pedagogical Elements:**
|
|
- **Callout boxes** for tips, warnings, reflections, and examples
|
|
- **Student reflection prompts** to encourage metacognition
|
|
- **Exercises** that build on chapter concepts
|
|
- **Code comments** that explain what's happening
|
|
- **Error messages and debugging guidance**
|
|
|
|
**Depth:**
|
|
- Prioritize **intuition over mathematical rigor**
|
|
- Show code implementation before heavy theory
|
|
- Balance "just enough math" with practical application
|
|
- Focus on **interpretation and application** over derivations
|
|
|
|
## Key Topics
|
|
|
|
**Module 1: Python Fundamentals (Week 1)**
|
|
- Course intro + motivation
|
|
- Variables, data types, basic operators
|
|
- Why Python? Why not just use AI?
|
|
- Setting up environment
|
|
|
|
**Module 2: Jupyter & Data Structures (Week 2)**
|
|
- Jupyter notebooks and reproducible workflows
|
|
- Lists, dictionaries, tuples
|
|
- Pandas introduction
|
|
- Importing CSV data
|
|
|
|
**Module 3: Data Wrangling (Week 3)**
|
|
- DataFrame manipulation
|
|
- Filtering and subsetting
|
|
- Aggregating data
|
|
- GroupBy operations
|
|
|
|
**Module 4: Advanced Data Manipulation (Week 4)**
|
|
- Working with dates and times
|
|
- String operations
|
|
- Relational data and joins (SQL-style in pandas)
|
|
- Merging DataFrames
|
|
|
|
**Module 5: Data Visualization (Week 5)**
|
|
- Matplotlib basics
|
|
- Seaborn for statistical plots
|
|
- Exploratory data analysis with visuals
|
|
- Best practices for effective visualization
|
|
|
|
**Module 6: Writing Efficient Code (Week 6)**
|
|
- Control flow (if/else, loops)
|
|
- Functions and modularity
|
|
- List comprehensions
|
|
- Code efficiency and readability
|
|
|
|
**Week 7: Midterm Project**
|
|
- Application of Modules 1-6
|
|
- Work with messy, real datasets
|
|
- Open-ended analysis problem
|
|
|
|
**Module 7: Machine Learning Intro (Week 8)**
|
|
- What is ML and when to use it?
|
|
- Train/test split
|
|
- Features and labels
|
|
- Model building process
|
|
|
|
**Module 8: Regression (Week 9)**
|
|
- Correlation analysis
|
|
- Linear regression with scikit-learn
|
|
- Model evaluation (R², RMSE)
|
|
- Interpretation
|
|
|
|
**Module 9: Classification (Week 10)**
|
|
- Logistic regression
|
|
- Classification metrics (accuracy, precision, recall, F1)
|
|
- Confusion matrices
|
|
- When to use classification vs regression
|
|
|
|
**Module 10: Tree-Based Models (Week 11)**
|
|
- Decision trees
|
|
- Random forests
|
|
- Feature importance
|
|
- Model interpretation
|
|
|
|
**Module 11: Model Optimization (Week 12)**
|
|
- Feature engineering
|
|
- Cross-validation
|
|
- Hyperparameter tuning (GridSearchCV)
|
|
- Model selection
|
|
|
|
**Module 12: Advanced Topics (Week 13)**
|
|
- Unsupervised learning (clustering, PCA)
|
|
- Deep learning overview
|
|
- Introduction to LLMs and GenAI concepts
|
|
|
|
**Week 14: Final Project**
|
|
- Comprehensive data science project
|
|
- Full pipeline from data cleaning to modeling
|
|
|
|
## Assessment Approach
|
|
|
|
**Grading Components:**
|
|
- **Labs** - Weekly hands-on activities (Thursdays)
|
|
- **Homework** - Applied assignments (with answer keys for instructor)
|
|
- **Midterm Project** - Comprehensive application of Modules 1-6
|
|
- **Final Project** - End-to-end data science project
|
|
- **Quizzes** - Knowledge checks (materials in `/planning/quizzes/`)
|
|
|
|
**Student Support:**
|
|
- Canvas discussion boards for peer collaboration
|
|
- Office hours
|
|
- Answer keys provided for labs and homework (instructor use)
|
|
- Multiple formats (notebook, HTML, PDF) for accessibility
|
|
|
|
**GenAI Policy:**
|
|
- **Encouraged** to use ChatGPT, Claude, Copilot as learning aids
|
|
- **Required** to understand code, not just copy it
|
|
- Emphasis on using AI to learn, not to avoid learning
|
|
- Students asked to reflect on AI tool use and limitations
|
|
|
|
**Project Structure:**
|
|
- Templates provided for major assignments
|
|
- Rubrics included in `/planning/projects/`
|
|
- Real-world datasets required
|
|
- Open-ended problems that require creative problem-solving
|
|
|
|
## Content Format
|
|
|
|
**Textbook:** Quarto book with modules 1-6 + appendices
|
|
**Slides:** Weekly presentations using Quarto + Reveal.js
|
|
**Examples:** Numbered sequence of Jupyter notebooks (01-17)
|
|
**Labs:** Weekly hands-on activities with answer keys
|
|
**Homework:** Individual assignments with solutions in multiple formats
|
|
**Datasets:** Real-world data in `/data/` directory (retail, airlines, housing, etc.)
|
|
|
|
## Course Materials Repository
|
|
|
|
All materials maintained in Git repository with structure:
|
|
- `/book/` - Textbook chapters
|
|
- `/slides/` - Weekly presentations
|
|
- `/example-notebooks/` - Companion code examples
|
|
- `/labs/` - Hands-on activities
|
|
- `/homework/` - Assignments
|
|
- `/data/` - Datasets
|
|
- `/planning/` - Instructor resources (Canvas docs, rubrics, quizzes)
|