Holly Cui
by Holly Cui

Tags

  • Causal Inference
  • Estimation
  • Statistical Analysis

These two projects were completed with Elyse McFalls for the Fall 2023 section of STA 522: Study Design and Causal Inference at Duke University.

🚩 Keywords: Stratified Sampling Design, Horvitz-Thompson Estimator, Demographic Data Analysis, Statistical Inference, Census Data Interpretation GitHub

The project focuses on using county-level data to understand various aspects of the United States, such as population density, demographics, and political preferences. The data was sourced from the U.S. Census Bureau and MIT’s Election Data and Science Labs, covering the period from 2010 to 2020. A stratified sampling design was chosen, with each of the 50 states serving as individual strata, and the District of Columbia treated as a singleton stratum. The sample size was adjusted to ensure representation from every state, and the sampling frame excluded U.S. territories and remote islands due to logistical challenges and their limited impact on national survey results.

The methodology involved a detailed data collection process for county-level political preferences and demographic compositions, focusing on the Hispanic or Latino population and the aging U.S. population. The project employed the Horvitz-Thompson estimator for statistical analysis, ensuring that each state in the U.S. was accounted for and reducing the variance in population estimates. The data analysis involved estimating the average population density per county, the total number of Hispanic or Latino individuals, the change in this population between 2010 and 2020, voting preferences in the 2020 election, and the proportion of the population aged 65 or above in both 2020 and 2010.

The results revealed insights into various aspects of the U.S. population and political landscape. Key findings included estimates of the average population density per county, the total number of Hispanic or Latino individuals in 2020, and the change in this demographic over a decade. The study also provided estimates of voting preferences in the 2020 election and highlighted the increasing proportion of the U.S. population aged 65 or above, indicating a trend toward an aging population. The overlapping confidence intervals in certain estimates suggested the need for further research to understand these demographic shifts fully. Future studies might explore the factors driving these changes and their implications for health care policies, social security benefits, and other societal aspects.

Project 2 - Exploring Paper Helicopter Flight Duration: A Full Factorial Experimental Analysis

🚩 Keywords: Full Factorial Design, Causal Inference, Experimental Data Analysis GitHub

The project uses a full factorial design to investigate the effects of various design features on paper helicopters’ flight duration. Specifically, we focuses on four major factors in helicopter designs that are commonly assumed to effective – the rotor length, the leg length, the leg width, and the paperclip-on-leg maneuver. By altering the combinations of these factors, we aim to answer the following research questions:

  1. Which factors seem to be the most important for making helicopters that fly longer (in terms of time)?
  2. Is there any evidence that the effect of rotor length differs by leg width?
  3. What would be recommended as the ideal combination to make the helicopter fly long (in terms of time)?

In this study, we performed a full factorial randomized experiment on all 2^4 combinations of the main factors. Each combination is sampled 5 flights, leading to a total of 80 observations as our sample size. During the sampling process, the treatment values are randomly permuted under the guideline of a randomized experiment, and the flight duration is sequentially collected by dropping the assigned combination of paper helicopter from a fixed height of 6’6’’.

Upon designing the experiment, we believe that the full factorial methodology presents the best balance between reliability and convenience. Since our research questions are interested in determining the best combination of all four factors, we wanted to make sure that the main effects of each factor as well as the interaction effects can be estimated without further assumptions. Thus, the alternative of a fractional factorial design would either not allow us to calculate all effects due to aliasing, or require tedious planning on how to optimally craft the design. Considering 2^4 combinations is an acceptable level of time commitment on data collection, we decided to pursue on the full factorial design.

Through detailed data collection and analysis, including linear regression models and F-tests, the report identifies the most significant factors influencing flight duration. The findings suggest an ideal combination of features for maximum flight time, providing valuable insights into the practical application of experimental design and causal inference in aerodynamics. For a comprehensive understanding, please refer to the detailed report and our experimental data.