Exploratory Data Analysis

Logo

Data Science Institute
Vanderbilt University


Course Overview
Course Materials
Course Policies

View the Project on GitHub dsi-explore/eda-course-website

EDA Final Project Guidelines

Purpose of project: demonstrate well-developed exploratory data analysis (EDA) skills and an ability to professionally report this information to a broader audience.

EDA projects should show their audience what we can learn from the data about X (a given question or phenomena). EDA is about charting a path forward for solving problems (which ones can we solve with this data?) and gaining a deeper understanding of the processes that generate data.

Do not underestimate the importance of investigating the basics of your data: where did it come from? How was it collected? How are the variables/features measured? Investigate distributions of your variables, summary statistics, univariate visualizations, bivariate relationships, etc. EDA is an iterative process and you will need to walk your audience through your group’s process.

Summary

Your final project will contain, at a minimum, the following components:

  1. Data cleaning (organized code, completed in R)
  2. Exploratory Data Analysis (code and writing)
  3. Data Visualization (code, writing, and visual design)
  4. Machine learning-optional! (code, writing)
  5. High quality communication (visual design, presentation, writing)
  6. A consideration of ethics (writing)

Note: your project’s focus on each of these will be somewhat dictated by your data and motivating questions. Thus, if it makes no sense to explore your data using clustering, this is fine, simply write a paragraph justifying your choice not to use these approaches. Similarly, if your data requires methods from outside-of-class for pre-processing, this is fine but make sure you’ve cleared it with me (the professor, not TAs) and understand that this cannot be the main part of your work for the class, you must demonstrate EDA skills as focused on in this course.

Deliverables

Your final project will consist of three key elements for grading purposes (all to be submitted using your group projects’ github repo):

  1. Final presentation
    • slides due 11:59 PM Monday 12/9/19 PDF or powerpoint via email: cassy.dorff@vanderbilt.edu; in-class presentation to take place Tuesday 12/10/19 from 9 AM to 12, business attire. Order of presentations found on slack.
    • everyone in your group speaks for at least 2 minutes (some can speak more than others)
    • presentation must include a polished slide deck (your choice of style and format)
    • presentations will be 18-20 minutes each followed by a very brief Q & A.
  2. Final report and repository with data/analysis
    • due 11:59 PM Friday 12/13/2019 electronic submission via github
    • an easy to read PDF that contains a detailed report of your work and follows the guidelines laid out below.
    • the report will be submitted via a single group github repo that will also include well-organized R files (you must complete your code using R), a README.md file, and supporting data.
    • the final report should be written in a professional style and well-written. Citations are required, but in no specific format (choose between MLA, APA, etc). You will be graded on grammer, organization, visualization, writing style, creativity, and clarity.
    • all code and analysis must be well organized, described in the readMe or elsewhere, and easily reproducible on another machine
  3. Peer review
    • due at the start of class, printed hard copies, Tuesday December 10th
    • a detailed and confidential assessment of your peers’ work during the project to be submitted as a hard copy to the professor the day of the presentations.

Guidelines

  1. Statement of goals. What puzzle are you trying to address? What is your motivating question for the project? What problem are you interested in learning more about? Why is this important and useful? For whom might your analysis be helpful? What group, person, or company might be an ideal audience for this report?

  2. Descriptions of data. Both in your written report and in your verbal presentation you must give the audience (and me, the professor) at minimum: a sense of the data generating process; the way key variables are measured in the data; the unit of analysis; and the structure of the data.

  3. Begin to answer big questions. State clear answers to the questions you laid out at the beginning of your project and show them, when you can, graphically. Importantly, explore the implication of your ‘answers.’ What is left to explore further? Do you need more data, if so, why? What relationships are clear, what relationship are unclear? What variables do you really wish you could collect data on to help support a more in-depth analysis or additional round of study? Where useful, reference which R files refer to which step of your project so that it is easy to reproduce your results and follow along with your report.

  4. Presentation. From start to finish, all of your work should be readable, professional, and compelling. There should be no grammatical mistakes, nor spelling errors. You should have spent ample time practicing the verbal part of your presentation and checking your slides for errors.

  5. Reproducibility. If we can reproduce your work easily from your code, you will get full points for this.

Other requirements