Data Analysis Exam Thoughts and Materials

PhD Materials
Author

Sara Colando

Published

June 9, 2025

This Spring, at the end of my first-year, I took the Data Analysis (DA) exam, which is the “qualifying” exam for CMU’s Statistics Ph.D. program. This notebook includes the materials I made in preparation for the DA exam. The majority of these materials are based on the Fall 2024 Regression Analysis (36-707) lecture notes, which can be found here.1

A little background on CMU’s DA Exam from the Statistics Graduate Student Handbook: “At the conclusion of each Spring Semester the Department administers the ‘Data Analysis Exam,’ which is designed to test students’ ability to apply statistical methods to address a substantive, real problem. Students are given eight hours to complete the exam, during which time they analyze the data and write a [ten-page] report to present their analysis and conclusions. The faculty are realistic as to what can be accomplished during the eight-hour period. In grading the exam, the faculty are looking for clear presentation of an appropriate analysis of the data. Emphasis is not placed on technical or mathematical sophistication,” (p. 13).

How I Prepared:

To be fully transparent, beyond taking 36-707 last Fall, I did not do very much to prepare for the DA exam until the week before, when I spent roughly 4-6 hours each day prepping for the exam. That said, I found that a week was sufficient to review important concepts and make templates as well as other notes to use during the exam without being so long that I began to over-complicate concepts and analysis strategies.

In the week leading up to the exam, I did the following to prepare:

  • Read through the 36-707 notes (primarily chapters 4-15) and wrote up notes with full code which could be easily copied and pasted into my own DA exam (my completed notes are available here)

  • Read through completed data analysis exams from some students in upper years of the Ph.D. program

  • Read through my three revised data analysis reports from 36-707

  • Made a “recipes” sheet with the technical conditions for different models as well as the steps I would take for each model type (the recipes document I made with one of my cohort-mates is available here)

  • Made a DA report template that I could fill in during the exam, which included all the rubric items from 36-707 (see Table 1), copied into the relevant report sections.2

Table 1: Rubric items partitioned into their relevant IMRAD report sections (provided in 36-707, Fall 2024).
Report Section Rubric Items
Executive Summary
  • Addresses all substantive questions in substantive terms

  • Summarizes methods used and their limitations

  • Written in terms understandable to a subject- area expert, rather than being written for statisticians

  • Can be read and understood without the rest of the report

Introduction
  • Introduction summarizes questions to be answered and their importance

  • Introduction states theories or alternative explanations to be tested using the data, or outcome to be predicted

  • Introduction describes the source of the data and summarizes its relevance to the problem

  • Briefly states what the analysis achieved and what conclusions can be drawn about the research questions

  • Introduction is readable to an expert on the data’s subject, rather than being written for statisticians

The Data (Data Summary and Exploratory Data Analysis)
  • Size of dataset is given

  • Meaning of all relevant variables given (with units, when known)

  • Distributions of variables checked; notable outliers explained

  • EDA supports the modeling by exploring relationships and variables that will be useful for model

  • Missing data noted and strategy to account for it explained

  • Limitations of data for the Important limitations of the research questions noted, including problems with mentioned generalizability or omitted variable

Methods
  • Chosen analysis is clearly connected to substantive question

  • Chosen analysis is explained in terms of the substantive question, not purely statistically

  • Report explains choices made in model building, such as how variables are coded or transformed, or which variables are included

  • Analysis is supported by diagnostics or metrics that assess the model’s appropriateness for the data

  • Report explains why particular hypothesis tests are used, including their conditions and why they are reasonably met

  • Text clearly explains why this analysis was chosen over alternatives

  • Caveats and problems are noted and their potential effect on results explained

Results
  • Statistical results answer the questions asked

  • Tests and estimates are presented clearly and accurately

  • Statistical methods correctly implemented

  • When hypothesis tests are used, the null hypothesis being tested is clearly stated, as is the null distribution of the statistic and its degrees of freedom (or, if the bootstrap is used, the method used is stated), ideally in APA format.

  • Whenever possible, sizes of appropriate effects are given, not just their significance, and interpreted in substantive terms.

Discussion
  • Summarizes conclusions presented in report

  • Conclusion explains how the practical terms statistical results answer the practical or research questions

  • Conclusion notes any limitations of results and what could be done to address these

  • Conclusion explains how conclusions are noted, if limitations may affect the applicable practical conclusions presented

Other
  • Wherever results appear in report, they are presented with clear measures of uncertainty, such as confidence intervals or standard errors, in APA format

  • References are clearly given to information from outside sources; quotation marks clearly mark any verbatim text from sources

  • Citations formatted in a common style used in statistics journals [I specified the apa.csl file in my YAML header to have my citations done in APA format]

What Went Well:

I found that the process of making the recipes document, in particular, was an extremely helpful study tool for recalling when to use the different techniques we covered in 36-707 (this especially paid off since my year’s DA exam used a poisson model with an offset, which is a model I had only implemented and interpreted the results of a few times).

It was also beneficial to spend time reviewing APA formatting and practicing interpretations for more complicated scenarios (i.e., when there is an interaction, spline term, etc.).

I also really appreciated having code that could be easily copied into my DA report during the eight hour exam. This ended up saving me a ton of time, especially when I was running diagnostics and reporting results. By scaffolding my code ahead of time, I could focus on more substantive parts of my analysis and spend more time actually writing the report, which was much needed since I am a relatively slow writer.

What Could’ve Gone Better:

Of course, there were still hiccups that came up during the exam – the biggest of which was that partial residuals function I planned to use did not work for Poisson models with offsets (the function has since been fixed, luckily!). So, I had to pivot to using randomized quantile residuals, which I am less familiar with as a diagnostic for non-linearity. Fortunately for me, none of the covariates that I included in my model ended up having a non-linear relationship with my outcome variable, meaning that I didn’t have to figure out a transformation for any covariates using my randomized quantile residual plots.

If I could go back, I also would have spent less time on my EDA plots during the DA exam. I spent a while trying to make the axis and strip names (for faceted plots) clean and also toggled the width and height of my figures a lot in an attempt to make them nicely spaced in my DA report. While this probably helped a bit in me passing (since it did make my report look nicer), I wish I had used that time to better narrow down which limitations or caveats of my model and the data I wanted to talk about in the report. Some sections of my report like like a laundry list of issues with the data and my analysis rather than a focused discussion of which limitations would impact my conclusions and their generalizablility.

Footnotes

  1. Thanks to Alex Reinhart for compiling these!↩︎

  2. I did an Executive Summary + IMRAD (Introduction, Methods, Results, and Discussion) report for the exam.↩︎