Data Analysis Exam Thoughts and Materials
This Spring, at the end of my first-year, I took the Data Analysis (DA) exam, which is the “qualifying” exam for CMU’s Statistics Ph.D. program. This notebook includes the materials I made in preparation for the DA exam. The majority of these materials are based on the Fall 2024 Regression Analysis (36-707) lecture notes, which can be found here.1
A little background on CMU’s DA Exam from the Statistics Graduate Student Handbook: “At the conclusion of each Spring Semester the Department administers the ‘Data Analysis Exam,’ which is designed to test students’ ability to apply statistical methods to address a substantive, real problem. Students are given eight hours to complete the exam, during which time they analyze the data and write a [ten-page] report to present their analysis and conclusions. The faculty are realistic as to what can be accomplished during the eight-hour period. In grading the exam, the faculty are looking for clear presentation of an appropriate analysis of the data. Emphasis is not placed on technical or mathematical sophistication,” (p. 13).
How I Prepared:
To be fully transparent, beyond taking 36-707 last Fall, I did not do very much to prepare for the DA exam until the week before, when I spent roughly 4-6 hours each day prepping for the exam. That said, I found that a week was sufficient to review important concepts and make templates as well as other notes to use during the exam without being so long that I began to over-complicate concepts and analysis strategies.
In the week leading up to the exam, I did the following to prepare:
Read through the 36-707 notes (primarily chapters 4-15) and wrote up notes with full code which could be easily copied and pasted into my own DA exam (my completed notes are available here)
Read through completed data analysis exams from some students in upper years of the Ph.D. program
Read through my three revised data analysis reports from 36-707
Made a “recipes” sheet with the technical conditions for different models as well as the steps I would take for each model type (the recipes document I made with one of my cohort-mates is available here)
Made a DA report template that I could fill in during the exam, which included all the rubric items from 36-707 (see Table 1), copied into the relevant report sections.2
Report Section | Rubric Items |
---|---|
Executive Summary |
|
Introduction |
|
The Data (Data Summary and Exploratory Data Analysis) |
|
Methods |
|
Results |
|
Discussion |
|
Other |
|
What Went Well:
I found that the process of making the recipes document, in particular, was an extremely helpful study tool for recalling when to use the different techniques we covered in 36-707 (this especially paid off since my year’s DA exam used a poisson model with an offset, which is a model I had only implemented and interpreted the results of a few times).
It was also beneficial to spend time reviewing APA formatting and practicing interpretations for more complicated scenarios (i.e., when there is an interaction, spline term, etc.).
I also really appreciated having code that could be easily copied into my DA report during the eight hour exam. This ended up saving me a ton of time, especially when I was running diagnostics and reporting results. By scaffolding my code ahead of time, I could focus on more substantive parts of my analysis and spend more time actually writing the report, which was much needed since I am a relatively slow writer.
What Could’ve Gone Better:
Of course, there were still hiccups that came up during the exam – the biggest of which was that partial residuals function I planned to use did not work for Poisson models with offsets (the function has since been fixed, luckily!). So, I had to pivot to using randomized quantile residuals, which I am less familiar with as a diagnostic for non-linearity. Fortunately for me, none of the covariates that I included in my model ended up having a non-linear relationship with my outcome variable, meaning that I didn’t have to figure out a transformation for any covariates using my randomized quantile residual plots.
If I could go back, I also would have spent less time on my EDA plots during the DA exam. I spent a while trying to make the axis and strip names (for faceted plots) clean and also toggled the width and height of my figures a lot in an attempt to make them nicely spaced in my DA report. While this probably helped a bit in me passing (since it did make my report look nicer), I wish I had used that time to better narrow down which limitations or caveats of my model and the data I wanted to talk about in the report. Some sections of my report like like a laundry list of issues with the data and my analysis rather than a focused discussion of which limitations would impact my conclusions and their generalizablility.