Files
gh-bradleyboehmke-brads-mar…/skills/courses/intro-to-data-mining/course-profile.md
2025-11-29 18:01:52 +08:00

8.2 KiB

Intro to Data Mining - Course Profile

Course: UC BANA 4080: Introduction to Data Mining with Python Instructor: Brad Boehmke

Audience

Student Level: Undergraduate (juniors/seniors)

Background:

  • Business students with foundation in calculus, statistics, and possibly regression
  • May have Excel experience and basic VBA exposure
  • Variable programming backgrounds - many are complete coding beginners
  • Understand business operations, customer behavior, and quantitative thinking
  • Know how to think critically but lack experience turning theory into practice with code

Prerequisites:

  • Quantitative methods and statistical inference courses
  • No prior programming experience required or expected
  • Course explicitly designed for beginners

Key Challenge: Bridging the gap between classroom theory and real-world data analysis

Learning Philosophy

Core Approach: Hands-on, immersive learning through doing

Key Principles:

  1. Practice over theory - Students learn by working with real, messy datasets
  2. Build confidence through action - Focus on getting students comfortable with tools before perfection
  3. Close the theory-practice gap - Move from "knowing concepts" to "applying skills"
  4. AI as assistant, not autopilot - Use GenAI tools (ChatGPT, Claude, Copilot) to help learn, but emphasize understanding
  5. Collaborative learning - Build community; encourage students to help each other
  6. Growth mindset - Normalize struggle; coding is a new language that takes time

Teaching Style:

  • Conversational, relatable tone (see intro chapter example)
  • Use storytelling and scenarios (e.g., "Taylor the intern")
  • Address student concerns directly (e.g., "Why learn this when AI exists?")
  • Set realistic expectations about difficulty
  • Encourage persistence and resilience

Unique Aspects:

  • Explicitly addresses role of GenAI in learning process
  • Balances AI assistance with foundational skill building
  • Uses real-world business contexts students can relate to

Technical Stack

Core Environment:

  • Python (chosen for beginner-friendliness + professional power)
  • Jupyter Lab/Notebooks (primary development environment)
  • Google Colab (cloud-based option for students)
  • Quarto (for textbook and slides)
  • Virtual environment (venv for package management)

Primary Libraries (Weeks 1-6):

  • pandas - data manipulation and DataFrames
  • numpy - numerical computation
  • matplotlib - basic visualization
  • seaborn - statistical visualization

Machine Learning Libraries (Weeks 8-13):

  • scikit-learn - all ML models and evaluation

Additional Tools:

  • CSV/Excel file handling
  • Basic SQL concepts (joins in pandas)
  • Git/GitHub for assignment submission

File Formats:

  • Quarto markdown (.qmd) for book chapters
  • Jupyter notebooks (.ipynb) for examples, labs, homework
  • Real datasets (CSV, Excel) in /data/ directory

Content Style

Writing Style:

  • Conversational and approachable - Not dry or overly academic
  • Student-focused - Addresses "you" directly
  • Motivational - Builds confidence, normalizes struggle
  • Practical - Always tied to real-world application
  • Honest - Acknowledges difficulties, doesn't sugar-coat challenges

Explanation Approach:

  1. Start with WHY - Motivate the topic before diving in
  2. Use analogies and stories - Make abstract concepts concrete
  3. Show, don't just tell - Working code examples over theory
  4. Progressive complexity - Start simple, build gradually
  5. Address common questions - Anticipate student concerns

Examples:

  • Use relatable business scenarios (customer data, marketing analytics, retail transactions)
  • Work with messy, real-world datasets (not clean, perfect examples)
  • Include visual aids heavily (plots, diagrams, screenshots)
  • Provide executable code that students can run and modify

Pedagogical Elements:

  • Callout boxes for tips, warnings, reflections, and examples
  • Student reflection prompts to encourage metacognition
  • Exercises that build on chapter concepts
  • Code comments that explain what's happening
  • Error messages and debugging guidance

Depth:

  • Prioritize intuition over mathematical rigor
  • Show code implementation before heavy theory
  • Balance "just enough math" with practical application
  • Focus on interpretation and application over derivations

Key Topics

Module 1: Python Fundamentals (Week 1)

  • Course intro + motivation
  • Variables, data types, basic operators
  • Why Python? Why not just use AI?
  • Setting up environment

Module 2: Jupyter & Data Structures (Week 2)

  • Jupyter notebooks and reproducible workflows
  • Lists, dictionaries, tuples
  • Pandas introduction
  • Importing CSV data

Module 3: Data Wrangling (Week 3)

  • DataFrame manipulation
  • Filtering and subsetting
  • Aggregating data
  • GroupBy operations

Module 4: Advanced Data Manipulation (Week 4)

  • Working with dates and times
  • String operations
  • Relational data and joins (SQL-style in pandas)
  • Merging DataFrames

Module 5: Data Visualization (Week 5)

  • Matplotlib basics
  • Seaborn for statistical plots
  • Exploratory data analysis with visuals
  • Best practices for effective visualization

Module 6: Writing Efficient Code (Week 6)

  • Control flow (if/else, loops)
  • Functions and modularity
  • List comprehensions
  • Code efficiency and readability

Week 7: Midterm Project

  • Application of Modules 1-6
  • Work with messy, real datasets
  • Open-ended analysis problem

Module 7: Machine Learning Intro (Week 8)

  • What is ML and when to use it?
  • Train/test split
  • Features and labels
  • Model building process

Module 8: Regression (Week 9)

  • Correlation analysis
  • Linear regression with scikit-learn
  • Model evaluation (R², RMSE)
  • Interpretation

Module 9: Classification (Week 10)

  • Logistic regression
  • Classification metrics (accuracy, precision, recall, F1)
  • Confusion matrices
  • When to use classification vs regression

Module 10: Tree-Based Models (Week 11)

  • Decision trees
  • Random forests
  • Feature importance
  • Model interpretation

Module 11: Model Optimization (Week 12)

  • Feature engineering
  • Cross-validation
  • Hyperparameter tuning (GridSearchCV)
  • Model selection

Module 12: Advanced Topics (Week 13)

  • Unsupervised learning (clustering, PCA)
  • Deep learning overview
  • Introduction to LLMs and GenAI concepts

Week 14: Final Project

  • Comprehensive data science project
  • Full pipeline from data cleaning to modeling

Assessment Approach

Grading Components:

  • Labs - Weekly hands-on activities (Thursdays)
  • Homework - Applied assignments (with answer keys for instructor)
  • Midterm Project - Comprehensive application of Modules 1-6
  • Final Project - End-to-end data science project
  • Quizzes - Knowledge checks (materials in /planning/quizzes/)

Student Support:

  • Canvas discussion boards for peer collaboration
  • Office hours
  • Answer keys provided for labs and homework (instructor use)
  • Multiple formats (notebook, HTML, PDF) for accessibility

GenAI Policy:

  • Encouraged to use ChatGPT, Claude, Copilot as learning aids
  • Required to understand code, not just copy it
  • Emphasis on using AI to learn, not to avoid learning
  • Students asked to reflect on AI tool use and limitations

Project Structure:

  • Templates provided for major assignments
  • Rubrics included in /planning/projects/
  • Real-world datasets required
  • Open-ended problems that require creative problem-solving

Content Format

Textbook: Quarto book with modules 1-6 + appendices Slides: Weekly presentations using Quarto + Reveal.js Examples: Numbered sequence of Jupyter notebooks (01-17) Labs: Weekly hands-on activities with answer keys Homework: Individual assignments with solutions in multiple formats Datasets: Real-world data in /data/ directory (retail, airlines, housing, etc.)

Course Materials Repository

All materials maintained in Git repository with structure:

  • /book/ - Textbook chapters
  • /slides/ - Weekly presentations
  • /example-notebooks/ - Companion code examples
  • /labs/ - Hands-on activities
  • /homework/ - Assignments
  • /data/ - Datasets
  • /planning/ - Instructor resources (Canvas docs, rubrics, quizzes)