There are four assignments for the project. Their due dates are:

Project Goal

The overall goal of the project is to work in groups to apply (mostly supervised) statistical learning methods to a dataset of your choice.

Data Selection

You may use any dataset of your choice, so long as it contains at minimum 500 observations and was not previously used in class. This dataset might be relevant to research outside of this course, another field, or some other interest of yours. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from another endeavor of yours, such as a research project, be sure to gain permission from the controlling authority first.

The two most common sources of data used by students:

If you find an interesting data source (not necessarily a dataset) feel free to share it here:


The final product of this project will be a written report in the IMRAD style used throughout the course.

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model that can perform the desired statistical learning task. Most importantly, you should motivate and discuss why that task is being completed, and how well it is being completed.

Task Specifics

Group Selection

An email requesting a group formation must be received by Friday, November 8, 2019, 11:59 PM if you would like to select your own group. (If you skip this task, you will be assigned a group.) Groups may be as small as three members, and as large as four members. You may submit a group of two, but the instructor will merge it with another group, or assign additional students without a group.

For a group to be considered, an email must be sent to dalpiaz2@illinois.edu by the deadline. It must do the following:

  • Originate from your University email.
  • Have all proposed group members CCed on the email.
  • Follow the course email policy.
    • The subject line should be [STAT 432] Project Group - "Some Name Here"
      • Replace "Some Name Here" with some clever name for your group. (This is just to prevent the emails from being grouped by subject.)
  • Contain a bulleted list of names and NetIDs of group members.

Failure of any of the above steps will received an email that simply says “try again.” Absolutely no late requests will be considered, for any reason. A non-conforming email before the deadline does not count. Please send one email per group. (This is easy to accomplish via communication and CCing group members on the request email.)

  • Due to immediate complaints, I will allow groups to form across sections. Be warned: I do not advise this, but you can do whatever you like.

Analysis Proposal

A proposal of your intended project is due by Monday, November 18, 2019, 11:59 PM. It should be submitted online via Compass by a single group member.

After review of the proposal, it will be evaluated in one of two ways:

  • Approved - Your group may proceed with your plans for the data and project.
  • Pending - We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.

A proposal of your intended project should include the following:

  • The names and NetIDs of the students who will be contributing to the group project.
  • A tentative title for the project.
  • Description of the dataset that is sufficient for a reader to understand your motivation for using the dataset.
  • Background information on the dataset, including specific citation of its source.
  • The statistical learning task that the dataset will be used to accomplish the goal of the analysis. (Regression or classification.)
  • Evidence that the data can be loaded into R. - Load the data, and print the first few values of the response variable as evidence. - Create at least one plot that helps the reader understand the data.
  • Evidence that the data can be modeled in R. - Use either lm() (regression) or glm() (classification) then call predict() on the results and return the first few values. You may need to perform some data cleaning before this step.

As a group, you will submit a .zip file as you would for an analysis that contains an .html and .Rmd file, as well as the data if it cannot be linked online. If your data is too large to submit, and cannot be linked, please let us know and we will find an alternative. There is no required format or template, but you should follow reasonable R Markdown practices discussed in class.

Final Report

The final report of your analysis is due by Monday, December 16, 2019, 11:59 PM. It should be submitted online via Compass by a single group member.

As a group, you will submit a .zip file as you would for analyses which contains a .html and .Rmd file, as well as the data if it cannot be linked to online.

Peer Evaluation

A peer evaluation of the group members is due by Monday, December 16, 2019, 11:59 PM. It should be submitted online via Compass by each group member.

Individually, you will write a short review of each of your group members, including yourself. For each member, comment on:

  • Which parts of the project were worked on by that member
  • How well that member communicated with the team (Provide a score from 0 to 100 as well as written comments.)
  • How well that member understood the course concepts (Provide a score from 0 to 100 as well as written comments.)
  • Proportion of the project completed by that member (Provide a proportion from 0% to 100% as well as written comments.)

Individually, you will submit a single file (.pdf preferred) that contains your reviews. (A template will likely be released for this task.)

Project Grading


  • Points Possible: 20

You will be graded on formatting, clarity, appropriateness of data, and motivation of task.

Final Report

  • Points Possible: 70

  • Introduction
    • [5] Analysis is clearly motivated and has a clear goal.
      • The why of the analysis is made clear to the reader.
      • Reader should understand why statistical models for prediction will be useful.
    • [5] Data domain is clearly explained to the reader
      • Reader should roughly understand what the data is, and how it can be used to achieve the goal.
      • Only the most relevant information should be placed in the introduction.
      • A full data dictionary should be included in an appendix. Further detail on the data may appear in the methods section.
  • Methods
    • [5] Appropriate methods from class are used.
    • [10] Methods are used correctly.
  • Results
    • [5] Results are clearly organized either visually or as a table.
    • [5] Correct and useful metrics are used.
  • Discussion
    • [5] Correct conclusions are drawn from the results.
    • [10] How the results relate to the goal is discussed. Results are connected to the motivation of the analysis.
  • Abstract
    • [5] Abstract appropriately summarizes the analysis performed.
  • Code
    • [5] R is used appropriately.
      • Does your code perform the desired tasks?
      • Is your code readable?
      • Is your style consistent?
    • [5] rmarkdown is used appropriately.
      • Are you properly utilizing rmarkdown? (Headers, chunks, etc.)
      • Are warnings and messages suppressed when appropriate?
      • Is irrelevant code hidden? (Plots, tables, etc.)
  • General
    • [5] Narrative text is well written.
      • Text is free of spelling errors.
      • Text is written with clarity. (You will not be held to a strict grammar standard.)
      • Text is written in a manner such that a reader does not already need to be familiar with the data. (Minimal familiarity with statistical learning is assumed.)
    • [0] Directions are followed.
      • Report is submitted using correct filetypes and structure.
      • Report has a title.
      • Name and NetID are included in the report.
      • While there are no points assigned to this section, you can still lost points here.

Peer Evaluation

  • Points Possible: 5 (Evaluation of Peers) + 5 (Evaluations from Peers)

It is more important that you honestly review your team than give each member good remarks. You will be graded on how well you review your group members. If you simply give each of your team members good marks, you will likely receive far fewer points for the portion of the grade dedicated to evaluating your peers. (For example, if you give everyone 90%+ in each category, your grade will be reduced.)

Formatting and clarity will also account for a portion of the grade for your evaluations.

Results of the peer analysis will not be shared with your group members.

The instructor reserves the right to further reduce a students overall project grade if their teams reports that they did not attempt to make a significant contribution to the project.


This section will likely be updated as we progress through the remainder of the semester.

How long should the report be?

Isn’t this a lot to do at the end of the course while we have other things going on in the course? And it’s due during finals week?

How should we split up this analysis?

How do we deal with an unresponsive group member?