8.2 KiB
Intro to Data Mining - Course Profile
Course: UC BANA 4080: Introduction to Data Mining with Python Instructor: Brad Boehmke
Audience
Student Level: Undergraduate (juniors/seniors)
Background:
- Business students with foundation in calculus, statistics, and possibly regression
- May have Excel experience and basic VBA exposure
- Variable programming backgrounds - many are complete coding beginners
- Understand business operations, customer behavior, and quantitative thinking
- Know how to think critically but lack experience turning theory into practice with code
Prerequisites:
- Quantitative methods and statistical inference courses
- No prior programming experience required or expected
- Course explicitly designed for beginners
Key Challenge: Bridging the gap between classroom theory and real-world data analysis
Learning Philosophy
Core Approach: Hands-on, immersive learning through doing
Key Principles:
- Practice over theory - Students learn by working with real, messy datasets
- Build confidence through action - Focus on getting students comfortable with tools before perfection
- Close the theory-practice gap - Move from "knowing concepts" to "applying skills"
- AI as assistant, not autopilot - Use GenAI tools (ChatGPT, Claude, Copilot) to help learn, but emphasize understanding
- Collaborative learning - Build community; encourage students to help each other
- Growth mindset - Normalize struggle; coding is a new language that takes time
Teaching Style:
- Conversational, relatable tone (see intro chapter example)
- Use storytelling and scenarios (e.g., "Taylor the intern")
- Address student concerns directly (e.g., "Why learn this when AI exists?")
- Set realistic expectations about difficulty
- Encourage persistence and resilience
Unique Aspects:
- Explicitly addresses role of GenAI in learning process
- Balances AI assistance with foundational skill building
- Uses real-world business contexts students can relate to
Technical Stack
Core Environment:
- Python (chosen for beginner-friendliness + professional power)
- Jupyter Lab/Notebooks (primary development environment)
- Google Colab (cloud-based option for students)
- Quarto (for textbook and slides)
- Virtual environment (venv for package management)
Primary Libraries (Weeks 1-6):
- pandas - data manipulation and DataFrames
- numpy - numerical computation
- matplotlib - basic visualization
- seaborn - statistical visualization
Machine Learning Libraries (Weeks 8-13):
- scikit-learn - all ML models and evaluation
Additional Tools:
- CSV/Excel file handling
- Basic SQL concepts (joins in pandas)
- Git/GitHub for assignment submission
File Formats:
- Quarto markdown (.qmd) for book chapters
- Jupyter notebooks (.ipynb) for examples, labs, homework
- Real datasets (CSV, Excel) in
/data/directory
Content Style
Writing Style:
- Conversational and approachable - Not dry or overly academic
- Student-focused - Addresses "you" directly
- Motivational - Builds confidence, normalizes struggle
- Practical - Always tied to real-world application
- Honest - Acknowledges difficulties, doesn't sugar-coat challenges
Explanation Approach:
- Start with WHY - Motivate the topic before diving in
- Use analogies and stories - Make abstract concepts concrete
- Show, don't just tell - Working code examples over theory
- Progressive complexity - Start simple, build gradually
- Address common questions - Anticipate student concerns
Examples:
- Use relatable business scenarios (customer data, marketing analytics, retail transactions)
- Work with messy, real-world datasets (not clean, perfect examples)
- Include visual aids heavily (plots, diagrams, screenshots)
- Provide executable code that students can run and modify
Pedagogical Elements:
- Callout boxes for tips, warnings, reflections, and examples
- Student reflection prompts to encourage metacognition
- Exercises that build on chapter concepts
- Code comments that explain what's happening
- Error messages and debugging guidance
Depth:
- Prioritize intuition over mathematical rigor
- Show code implementation before heavy theory
- Balance "just enough math" with practical application
- Focus on interpretation and application over derivations
Key Topics
Module 1: Python Fundamentals (Week 1)
- Course intro + motivation
- Variables, data types, basic operators
- Why Python? Why not just use AI?
- Setting up environment
Module 2: Jupyter & Data Structures (Week 2)
- Jupyter notebooks and reproducible workflows
- Lists, dictionaries, tuples
- Pandas introduction
- Importing CSV data
Module 3: Data Wrangling (Week 3)
- DataFrame manipulation
- Filtering and subsetting
- Aggregating data
- GroupBy operations
Module 4: Advanced Data Manipulation (Week 4)
- Working with dates and times
- String operations
- Relational data and joins (SQL-style in pandas)
- Merging DataFrames
Module 5: Data Visualization (Week 5)
- Matplotlib basics
- Seaborn for statistical plots
- Exploratory data analysis with visuals
- Best practices for effective visualization
Module 6: Writing Efficient Code (Week 6)
- Control flow (if/else, loops)
- Functions and modularity
- List comprehensions
- Code efficiency and readability
Week 7: Midterm Project
- Application of Modules 1-6
- Work with messy, real datasets
- Open-ended analysis problem
Module 7: Machine Learning Intro (Week 8)
- What is ML and when to use it?
- Train/test split
- Features and labels
- Model building process
Module 8: Regression (Week 9)
- Correlation analysis
- Linear regression with scikit-learn
- Model evaluation (R², RMSE)
- Interpretation
Module 9: Classification (Week 10)
- Logistic regression
- Classification metrics (accuracy, precision, recall, F1)
- Confusion matrices
- When to use classification vs regression
Module 10: Tree-Based Models (Week 11)
- Decision trees
- Random forests
- Feature importance
- Model interpretation
Module 11: Model Optimization (Week 12)
- Feature engineering
- Cross-validation
- Hyperparameter tuning (GridSearchCV)
- Model selection
Module 12: Advanced Topics (Week 13)
- Unsupervised learning (clustering, PCA)
- Deep learning overview
- Introduction to LLMs and GenAI concepts
Week 14: Final Project
- Comprehensive data science project
- Full pipeline from data cleaning to modeling
Assessment Approach
Grading Components:
- Labs - Weekly hands-on activities (Thursdays)
- Homework - Applied assignments (with answer keys for instructor)
- Midterm Project - Comprehensive application of Modules 1-6
- Final Project - End-to-end data science project
- Quizzes - Knowledge checks (materials in
/planning/quizzes/)
Student Support:
- Canvas discussion boards for peer collaboration
- Office hours
- Answer keys provided for labs and homework (instructor use)
- Multiple formats (notebook, HTML, PDF) for accessibility
GenAI Policy:
- Encouraged to use ChatGPT, Claude, Copilot as learning aids
- Required to understand code, not just copy it
- Emphasis on using AI to learn, not to avoid learning
- Students asked to reflect on AI tool use and limitations
Project Structure:
- Templates provided for major assignments
- Rubrics included in
/planning/projects/ - Real-world datasets required
- Open-ended problems that require creative problem-solving
Content Format
Textbook: Quarto book with modules 1-6 + appendices
Slides: Weekly presentations using Quarto + Reveal.js
Examples: Numbered sequence of Jupyter notebooks (01-17)
Labs: Weekly hands-on activities with answer keys
Homework: Individual assignments with solutions in multiple formats
Datasets: Real-world data in /data/ directory (retail, airlines, housing, etc.)
Course Materials Repository
All materials maintained in Git repository with structure:
/book/- Textbook chapters/slides/- Weekly presentations/example-notebooks/- Companion code examples/labs/- Hands-on activities/homework/- Assignments/data/- Datasets/planning/- Instructor resources (Canvas docs, rubrics, quizzes)