Data Analysis Exam Thoughts and Materials

Published

June 9, 2025

This Spring, at the end of my first-year, I took the Data Analysis (DA) exam, which is the “qualifying” exam for CMU’s Statistics PhD program. The majority of the materials in this notebook are based on the Fall 2024 Regression Analysis (36-707) lecture notes, which can be found here.¹

A little background on CMU’s DA Exam from the Statistics Graduate Student Handbook: “At the conclusion of each Spring Semester the Department administers the ‘Data Analysis Exam,’ which is designed to test students’ ability to apply statistical methods to address a substantive, real problem. Students are given eight hours to complete the exam, during which time they analyze the data and write a [ten page] report to present their analysis and conclusions. The faculty are realistic as to what can be accomplished during the eight-hour period. In grading the exam, the faculty are looking for clear presentation of an appropriate analysis of the data. Emphasis is not placed on technical or mathematical sophistication,” (p. 13).

How I Prepared:

In full transparency, beyond taking 36-707 last Fall, I did not do very much to prepare for the DA exam until the week before, where I spent roughly 4-6 hours each day prepping for the exam. That said, I found that a week was sufficient to review important concepts and make templates as well as other notes to use during the exam without being so long that I began to over-complicate concepts and my analysis strategies.

In the week leading up to the exam, I did the following to prepare:

Read through the 36-707 notes (primarily chapters 4-15) and wrote up notes with full code which could be easily copied and pasted into my own DA exam (my completed notes are available here)
Read through completed data analysis exams from some students in upper years of the PhD program
Read through my three revised data analysis reports from 36-707 – paying special attention to things I thought were effective and any comments I received on revisions.
Made a “recipes” sheet with the technical conditions for different models as well as the steps I would take for each model type (the recipes document I made with one of my cohort-mates is available here)
Made a DA report template that I could fill in during the exam, which included rubric items from 36-707 (see Table 1), copied into the relevant report sections.²

Table 1: Rubric items partitioned into their relevant IMRaD report sections (provided in 36-707, Fall 2024).

Report Section	Rubric Items
Executive Summary	Addresses all substantive questions in substantive terms Summarizes methods used and their limitations Written in terms understandable to a subject- area expert, rather than being written for statisticians Can be read and understood without the rest of the report
Introduction	Introduction summarizes questions to be answered and their importance Introduction states theories or alternative explanations to be tested using the data, or outcome to be predicted Introduction describes the source of the data and summarizes its relevance to the problem Briefly states what the analysis achieved and what conclusions can be drawn about the research questions Introduction is readable to an expert on the data’s subject, rather than being written for statisticians
The Data (Data Summary and Exploratory Data Analysis)	Size of dataset is given Meaning of all relevant variables given (with units, when known) Distributions of variables checked; notable outliers explained EDA supports the modeling by exploring relationships and variables that will be useful for model Missing data noted and strategy to account for it explained Limitations of data for the Important limitations of the research questions noted, including problems with mentioned generalizability or omitted variable
Methods	Chosen analysis is clearly connected to substantive question Chosen analysis is explained in terms of the substantive question, not purely statistically Report explains choices made in model building, such as how variables are coded or transformed, or which variables are included Analysis is supported by diagnostics or metrics that assess the model’s appropriateness for the data Report explains why particular hypothesis tests are used, including their conditions and why they are reasonably met Text clearly explains why this analysis was chosen over alternatives Caveats and problems are noted and their potential effect on results explained
Results	Statistical results answer the questions asked Tests and estimates are presented clearly and accurately Statistical methods correctly implemented When hypothesis tests are used, the null hypothesis being tested is clearly stated, as is the null distribution of the statistic and its degrees of freedom (or, if the bootstrap is used, the method used is stated), ideally in APA format. Whenever possible, sizes of appropriate effects are given, not just their significance, and interpreted in substantive terms.
Discussion	Summarizes conclusions presented in report Conclusion explains how the practical terms statistical results answer the practical or research questions Conclusion notes any limitations of results and what could be done to address these Conclusion explains how conclusions are noted, if limitations may affect the applicable practical conclusions presented
Other	Wherever results appear in report, they are presented with clear measures of uncertainty, such as confidence intervals or standard errors, in APA format References are clearly given to information from outside sources; quotation marks clearly mark any verbatim text from sources Citations formatted in a common style used in statistics journals [I specified the `apa.csl` file in my YAML header to have my citations done in APA format]

Especially Helpful Preparations:

Making the recipes document!!!! On the DA exam, I used used a poisson model with an offset, which is a model I had only implemented and interpreted the results of a few times and might not have remembered if I had not made the recipe document ahead of time.
Reviewing APA formatting and practicing interpretations for more complicated scenarios (i.e., when there is an interaction, spline term, etc.).
Putting all my code into one document that could be easily copied into my DA report during the eight hour exam. This ended up saving me a ton of time, especially when I was running diagnostics and reporting results. By scaffolding my code ahead of time, I could focus on more substantive parts of my analysis and spend more time actually writing the report, which was much needed since I am a relatively slow writer.

What Could’ve Gone Better:

Better flexibility in how to approach model diagnostics. The partial residuals function I planned to use did not work for Poisson models with offsets (the function has since been fixed). So, I had to pivot to using randomized quantile residuals, which I did not prepare to use and was less familiar with. Fortunately for me, none of the covariates that I included in my model ended up having a non-linear relationship with my outcome variable, but if they had I think it would have taken me more time to think through transformations etc.
Spending less time on EDA plots. I spent a while trying to make the axis and strip names (for faceted plots) clean and also toggled the width and height of my figures a lot in an attempt to make them nicely spaced in my DA report. While this probably helped a bit in me passing (since it did make my report look nicer), I wish I had used that time to better narrow down which limitations or caveats of my model and the data I talked about in my report. Some sections of my report like like a laundry list of issues with the data and my analysis rather than a focused discussion of which limitations would impact my conclusions and their generalizablility.

Footnotes

Thanks to Alex Reinhart for compiling these!↩︎
I did an Executive Summary + IMRaD (Introduction, Methods, Results, and Discussion) report for the exam.↩︎