1 Abstract

When researchers want to test theories about causal relationships, an exogenous manipulation of the theorized cause, as commonly implemented in Randomized Controlled Trials (RCTs), is often considered the scientific ‘golden standard’ for making reliable causal inferences (Meldrum 2000). However, since RCTs require the cause to be manipulable, they cannot be implemented by researchers with non-manipulable theorized causes, leaving them with possibly endogenous theorized causes that may be subject to confounding, making causal inference unreliable (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56). While causal frameworks and methods have been devised to provide reliable causal inference (e.g., Abadie et al. 2010; Dunning 2008; Rosenbaum and Rubin 1983), these generally rely on strict assumptions that are dubious, untestable, and unlike RCTs, are not ensured by design (cf. Rohrer 2018; Samii 2016; Wan et al. 2024). If these assumptions are violated, results are put at risk of confounding and bias (McElreath 2019; Pearl 2000; Rohrer 2018; Samii 2016; Stuart 2010; Wan et al. 2024), reducing the reliability of causal inferences derived from such research. A more reliable alternative to these frameworks would arguably remedy the inability to manipulate the theorized cause, enabling researchers to exploit the strengths of the exogenous manipulation underlying the logic of RCTs, which, ceteris paribus, ensures reliable causal inference by design (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56). Since many instances of non-manipulable theorized causes are found in human subjects due to ethical and/or practical concerns (e.g., Buunk and van Vugt 2013: 17), the development of an alternative causal framework would prove especially useful for the social and behavioral sciences (see Wright 2015). This could exploit the fact that many social studies measure human behavior through expressions of natural language, creating a potential solution to make non-manipulable theorized cause manipulable with the use of Large Language Models (LLMs, cf. Shanahan 2024), which are specifically trained to mimic such expressions. This prospect is investigated in this paper, which introduces the Synthetic Causal Framework (SCF), a novel solution for making non-manipulable theorized causes manipulable. Formalized as an extension of the Neyman-Rubin Potential Outcomes Framework (cf. Imbens and Rubin 2015; Neyman 1923; Rubin 2005), the SCF offers a synthetic solution to the Fundamental Problem of Causal Inference (cf. Holland 1986), where simulated data and real-world data are synthesized using LLMs to produce synthetic units. To make the behavior of these units realistic and generalizable to human populations of interest, each unit has a unique life-history (cf. Bertaux 1981) generated by the LLM-model based on demographic characteristics, relevant for the topic being studied and representative of the population of interest. What would generally be considered a non-manipulable theorized cause could in such a framework then be manipulated among these LLM-driven synthetic units, with multiple simultaneous and independent manipulations being possible for each unit. By simultaneously and independently measuring the counterfactual outcome for each manipulation condition using parallel processing (cf. Ozdoganlar and Endres 2021[1999]), a procedure introduced here as ‘parallel worlds’-estimation, unbiased estimates of causal effects can then be derived using standard statistical models (cf. Westreich et al. 2015). By making a non-manipulable theorized cause manipulable, the SCF thus allows eliminating most forms of confounding by design (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56), providing the framework with an a priori expectedly high internal validity (cf. Stock and Watson 2019: 478-481). A relevant concern is the external validity (cf. Stock and Watson 2019: 481; Gelman et al. 2021: 354-355) of the framework, that is, the generalizability of its results to human populations. To test its generalizability, and provide a proof-of-concept, the SCF is implemented in a Synthetically Controlled Trial (SCT), a similarly novel SCF-based extension of an RCT, in an effort to replicate the established human-based relationship between deservingness and welfare support (cf. Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019) among synthetic units. Using the relatively cost-efficient and high-performing GPT-4o-mini LLM by OpenAI (2024a), a nationally representative sample of 1,009 synthetic respondents, and Bayesian inference (cf. Bürkner 2017, 2018), including Frequentist inference for robustness (cf. Bates et al. 2015), it is demonstrated that causal results derived from the SCF closely replicate results derived from human respondents. By merit of its high internal validity, and evidence in favor of a high external validity, this paper makes the tentative conclusion that, ceteris paribus, the Synthetic Causal Framework enables social and behavioral science researchers to draw valid and reliable causal inference when the theorized cause is non-manipulable. Practical implications of the SCF are discussed, including its substantially reduced costs compared to human studies, as well as its assumptions, limitations, future areas of application, and the need for further validations of the generalizability of SCF-based results to human populations.

Keywords: Synthetic Causal Framework, SCF, Synthetically Controlled Trial, SCT, ‘Parallel-worlds’ estimation, Synthetic units, Simulated units, Causal inference, Causal framework, Causal methodology, Causal design, Experimental design, Potential Outcomes Framework, Fundamental Problem of Causal Inference, Natural Language Processing, NLP, Large Language Model, LLM, Generative Pretrained Transformer, GPT, Artifical intelligence, AI, Quantitative social science, Political methodology, Political science, Deservingness heuristic, Welfare support, Welfare attitudes, Evolutionary psychology, Conceptual replication

2 Introduction

In their effort to understand the physical world, researchers often produce theories about causal relationships between phenomena, that is, whether one phenomenon causes another phenomenon (Aronson et al. 2018: 54-61; Angrist and Pischke 2015; Angrist and Pischke 2009; Samii 2016; Sniderman and Grob 1996; Toshkov 2016: 35-37, 145-165). Consistent with various conceptualizations of the scientific method (Eysenck and Keane 2020: 667; Aronson et al. 2018: 45-46; Toshkov 2016: 4-6, 37-41; Farrell 2014: 10-23; Lakatos 1970; Nola and Sankey 2007; Popper 2002[1959]), proposed causal theories should be sought falsified by deriving testable hypotheses about the claims made by the theory. While no theory can ever be proven correct (Lakatos 1970; Miller 1974; Popper 2002[1959]; see also Dellsén 2024)², a causal theory whose hypotheses pass repeated attempts at falsification can, for instrumental purposes, be scientifically treated as tentatively correct and added to the cumulative pile of human knowledge. That is, at least until the theory is eventually falsified and/or superseded by a better theory (Dellsén 2024; Lakatos 1970; Miller 1974; Popper 2002[1959]). While this view of the scientific method can be considered ideal, if not ‘utopic’ (Feyerabend 1993; Kuhn 1970; Nosek et al. 2012; Nosek and Bar-Anan 2012; Uhlmann et al. 2019), in part due to its flawed implementations across disciplines (Anderson 2020: 11; Bishop 2006; Chopra et al. 2023; Christensen et al. 2019; Eysenck and Keane 2020: 34-36; Gigerenzer 2004; Lenz and Sahn 2021; Open Science Collaboration 2015; Ritchie 2021; Simmons et al. 2011), it helps emphasize that tests of causality need to be reliable, so that true hypotheses are not wrongly rejected and that false hypotheses do not wrongly fail to be rejected (Lakatos 1970; Neyman and Pearson 1933).

With causal inference involving the estimation of a theorized effect between a potential cause and outcome in a theoretically-outlined context (Angrist and Pischke 2015; Angrist and Pischke 2009; Samii 2016), for insights to be scientifically reliable, it is critical that the estimated effect is valid and can be generalized to the intended context. In line with the aforementioned views on the scientific method, this is because in the commonly applied Null-Hypothesis Significance Testing framework (NHST, for criticisms, see, e.g., Clayton 2021; Cohen 1994; Gigerenzer 2004, 2018), a nil hypothesis of no effect (Cohen 1994: 999-1000) in that particular context is either rejected or fails to be rejected based on this estimate.³ Accordingly, the estimate must not reflect a falsely identified causal relationship, which would constitute a false positive result (i.e., a Type I error, Neyman and Pearson 1933), nor must the result reflect a falsely unidentified causal relationship, and thus be a false negative (i.e., a Type II error, Neyman and Pearson 1933). Reliability is further threatened if the estimate fails to reflect the true direction (i.e., a Type S error, Gelman et al. 2021: 59; Gelman and Carlin 2014) or magnitude (i.e., a Type M error, Gelman et al. 2021: 59; Gelman and Carlin 2014) of the effect in the theoretically-outlined context.

These errors of estimation and generalization can occur due to confounding (Fisher 1935; Greenland et al. 1999; VanderWeele and Shpitser 2013), which occurs if a third phenomenon can impact both the theorized cause and outcome, or if the estimate is derived from a sample unrepresentative of the theoretically-outlined context (Aronow and Samii 2016; Deaton and Cartwright 2018; Samii 2016). These can be serious errors that produce invalid and unreliable scientific insights (Aronow and Samii 2016; Gelman and Carlin 2014; Samii 2016). These errors are to preferably be eliminated by design (cf. Rohrer 2018: 34; Stuart 2010: 3; Samii 2016), where (1) the theorized cause is manipulated in an exogenous manner (Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49), with the exogeneity of the manipulation ensuring that the result cannot be confounded, since the theorized cause cannot be affected by them; and (2) the units from which the causal estimate is derived are representative of the context to which the theory supposedly applies (Stock and Watson 2019: 108; Thompson 2012). While a representative sample can generally be achieved with random sampling (Stock and Watson 2019: 108; Thompson 2012), an exogenous manipulation of the theorized cause is most often justified in a Randomized Controlled Trial (RCT, Coover and Angell 1907; Peirce and Jastrow 1885; Woodworth and Thorndike 1901). This is an experimental design, where researchers employ a random assignment mechanism (RAM) to manipulate the theorized cause, its randomness serving to ensure that the manipulation occurs exogenously of the influence of confounders (Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56). With the relatively high internal validity of results provided by an exogenous manipulation (Stock and Watson 2019: 478-481), coupled with a relatively high external validity provided by representative units (Stock and Watson 2019: 481; Gelman et al. 2021: 354-355), RCTs are generally considered the ‘golden standard’ for making reliable causal inference (Meldrum 2000; Samii 2016; though for general criticisms see, e.g., Deaton and Cartwright 2018). A limitation of RCTs, however, is that randomization requires the theorized cause to be manipulable by the RAM. This means that for any instance where the theorized cause is non-manipulable in this way (e.g., Buunk and van Vugt 2013: 17), researchers cannot employ the design of RCTs to achieve reliable causal inference, leaving their results open to confounding and wrong scientific conclusions.

Methodologists recognize that reliable causal inference is paramount to the scientific endeavour (Aronson et al. 2018: 54-61; Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Samii 2016; Stock and Watson 2019: 48-49), and for instances with non-manipulable theorized causes, they have formalized frameworks to help researchers identify premises for achieving reliable causal inference (e.g., Imbens and Rubin 2015; Pearl 1993, 1995), as well as methods whose reliability depend on those frameworks (Abadie et al. 2010, 2015; Abadie and L’Hour 2021; e.g., Angrist et al. 1996; Dunning 2008). However, these methods generally require making strict assumptions that are a priori dubious, untestable, and not ensured by design (e.g., Rohrer 2018; Samii 2016; Stuart 2010). For example, the methodology of matching requires making the dubious and untestable assumption of no unobserved confounders to achieve reliable causal inference (Ho et al. 2007: 206), that if violated makes causal inference unreliable by re-introducing the issues of confounding (e.g., Wan et al. 2024; see also Pearl 2000). Since researchers cannot assume a position of agnosticism regarding the viability of their assumptions due to the burden-of-proof being upon the researchers making a causal claim (cf. Russo 2019), concerns are justified regarding the ability of such methods in providing reliable causal inference. The current state of causal inference is thus one where researchers with non-manipulable theorized causes often have to rely on methods, whose ability to provide reliable causal inference is justifiably dubious.

The current shortcomings, however, may prove solvable given the accelerating advancements in Natural Language Processing (NLP, Pillai and Tedesco 2023). Here, Large Language Models (LLMs, Shanahan 2024) are proving increasingly capable of simulating human behavior as expressed through text. Such models, which include the popular ChatGPT by OpenAI (2022), have demonstrated human-comparable cognitive skills in the fields of logic, mathematics, and communication (OpenAI 2024a; 2024b; 2024c; Yenduri et al. 2024). This potential has been recognized by researchers across disciplines (e.g., Bail 2024), and they have, for example, already seen success in simulating patients that help train medical doctors (Holderried et al. 2024; Pang et al. 2024). A theoretical foundation for simulating human behavior with LLMs is based on these models having been trained on huge datasets containing text-based expressions of real-life human behavior (OpenAI 2024a; 2024b; 2024c; 2022; Ornstein et al. 2024), allowing them to expectedly infer how humans generally respond to different inputs. Researchers recognize that natural language is a rich media for capturing psychological and social behavior (Tadesse et al. 2018), and in their text-based format have been demonstrated to predict behavioral aspects, such as personality (Ramezani et al. 2022; Tadesse et al. 2018) and political affiliation (Khan and Khan 2024). Developing a causal framework grounded in LLMs is further advantageous when considering that LLMs can be easily controlled through prompts (OpenAI 2024a; 2024b; 2024c; Ornstein et al. 2024), which could be used to create a simulated research environment (cf. Holderried et al. 2024) consisting of LLM-driven subjects, prompted with information that produces realistic behavior. In such an environment, subjects could be manipulated in many more ways compared to humans, possibly allowing the manipulation of a previously non-manipulable theorized causes. Should such a framework be feasible, this would enable numerous researchers to conduct studies using LLMs, with an internal validity comparable to RCTs (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56). While the external validity of such a framework could reasonably be questioned since LLM-based results may not generalize to humans, this is an empirical question that, similar to the validation of existing frameworks and methods (e.g., Hainmueller et al. 2015; Sheagley and Clifford 2025), could be answered by comparing LLM-based results with human-based results on phenomena where the theorized cause is manipulable. Should this causal framework be found to provide both a high internal and external validity, this could allow researchers to more generally draw reliable causal inference for instances where the theorized cause would normally be non-manipulable. The prospect of developing such a causal framework is examined in this paper, with the aim formalized as the following research question:

RQ: Can an LLM-based causal framework be developed that provides reliable causal inference for instances where the theorized cause is otherwise non-manipulable?

To answer this research question, the paper reviews causal inference as conceptualized by the Neyman-Rubin Potential Outcomes Framework (POF, Imbens and Rubin 2015; Neyman 1923; Rubin 2005), the detrimental issues of confounding, and the limitations of existing methodologies in enabling reliable causal inference. Using the POF, a novel Synthetic Causal Framework (SCF) is then proposed to solve these limitations by offering an alternative framework that offers reliable causal inference. This SCF is designed for researchers interested in assessing the potential effect of some theorized cause on an outcome in a human population but are unable to manipulate this theorized cause using the randomization of RCTs or similar exogenous manipulation procedures. The theoretical basis of the SCF builds on the established link between language and behavior (cf. Bandura 1999: 195; Holtgraves 2013; Lucy 1992; Pennebaker et al. 2003; Whorf 2011[1956]), with a critical assumption being that the theorized effect is meaningfully expressible through natural language, which would arguably be necessary for Large Language-models (LLMs, Shanahan 2024) to emulate it. The SCF specifically involves simulating units, whose characteristics are sampled in proportion to their prevalence in a human population of interest. Such characteristics are provided to an LLM together with a prompt instructing the model to generate unit life-histories (cf. Bertaux 1981). These life-histories are to be based on the assigned characteristics and the real-life data that the LLMs have been trained on, thus infusing simulated characteristics with real-world nuance. Units produced through this synthesis of simulated characteristics and learned real-world behavior are conceptualized as synthetic units, a term based on Hegelian dialectics (cf. Fichte 1970), introduced to distinguish these units from purely simulated units not infused with information from real-world data. Using prompts, a synthetic environment is created, where the LLM is instructed to behave as the synthetic units, basing its behavior on the generated life-histories. In this environment completely under the control of researchers, the theorized cause can then be manipulated by researchers, with the SCF specifically enabling each synthetic unit to be simultaneously and independently exposed to varying levels of manipulation. The simultaneity of the manipulations serve to eliminate time-based confounding (Mansournia et al. 2017), while the independent manipulations ensures that the LLM is unaware of previous and future exposures to treatment, which serve to eliminate spill-over (cf. Imbens and Rubin 2015: 12-13; Stuart 2010: 3), prior and post manipulation biases (cf. Linos and Twist 2018; Montgomery et al. 2018). With the counterfactual outcomes of each level of manipulation having been measured, intra-unit comparisons can be made to eliminate all unit- and time-invariant confounders (cf. Westreich et al. 2015), enabling the causal effect for each synthetic unit to be reliably identified. Based on these stipulations, the SCF is defined as a theoretical framework for drawing reliable causal inference when the theorized cause in non-manipulable in human populations, achieved through the use of LLM-generated synthetic units that enable independent and simultaneous intra-unit comparisons of counterfactual outcomes. While the SCF developed by this paper would a priori possess a high internal validity, based on criticisms of similar methods (e.g., Deaton and Cartwright 2018), there are concerns for the generalizability of synthetically-derived results to human subjects that require empirical validation.

To assess the generalizability of the SCF, and provide a proof-of-concept, the robustly established causal relationship between deservingness and welfare support in human populations (cf. Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019) is sought replicated using synthetic units. With the manipulation being a text-based vignette using natural language, the ‘original’ RCT-based study (Aarøe and Petersen 2014) is replicated (n = 1,009) in the form of a novel Synthetically Controlled Trial (SCT), an SCF-based extension of the RCT. Contrary to an RCT, instead of relying on randomization to manipulate subjects to one condition each, an SCT is capable of independently and simultaneously assigning all subjects to every manipulation condition. With adult respondents from the United States (US) being used in the original and replication studies, population estimates derived from IPUMS (Ruggles et al. 2024) and ANES (2021) are used in the SCT to jointly simulate common demographic characteristics proportional to their prevalence in the US. Employing the gpt-4o-mini LLM by OpenAI (2024a), life-histories are generated from these characteristics to produce synthetic respondents. With gpt-4o-mini controlling the behavior of each synthetic respondent, the SCT then conceptually replicates the survey environment and vignettes used in Aarøe and Petersen (2014). In this synthetic environment, each synthetic respondent is then independently and simultaneously exposed to vignettes detailing a ‘lazy’ welfare recipient and an ‘unlucky’ welfare recipient, measuring the synthetic respondents’ counterfactual levels of support for stricter requirements for the welfare eligibility of each recipient. The analysis then employs a multilevel Bayesian model (cf. Bürkner 2017, 2018), with multilevel Frequentist models for robustness (cf. Bates et al. 2015), producing results that closely replicate the results from the human-based populations. Having demonstrated an implementation of the SCF, and produced evidence that causal results derived using the SCF generalize to human populations, its negligible implementations costs, assumptions, limitations, and potential improvements are discussed, followed by a conclusion about the future use of the SCF.

3 Theory

To develop a causal framework with the explicit goal of providing reliable causal inference when the theorized cause is non-manipulable, it is first necessary to understand what makes causal inference reliable. This requires defining a causal effect and the issues that inherently makes drawing causal inference with non-manipulable phenomena unreliable. The definition of a causal effect will draw on the Neyman-Rubin Potential Outcomes Framework (POF, Imbens and Rubin 2015; Neyman 1923; Rubin 2005), while the concepts of spuriousness and masking will be used to illustrate how the mere possibility of confounding produces unreliable causal inference. This is followed by a review of some of the fundamental issues inherent to existing alternative frameworks and methods for drawing causal inference with non-manipulable phenomena. From these insights, the Synthetic Causal Framework (SCF) is introduced to solve these issues and thereby enable reliable causal inference. The theory underlying the SCF is founded on behavioral and linguistic studies, simulations, and advances in natural language processing, which are used to outline fundamental assumptions and limitations of the SCF. This includes the novel conceptualizations of synthetic units and ‘parallel worlds’-estimation. The section concludes with a formalization of the SCF and a concern for the generalizability of SCF-based results to human populations, which serves as the basis for an implementation and test of the SCF in the remainder of this paper.

3.1 Causal Inference

Causal inference is a topic of broad relevance across the scientific disciplines. As of writing, more than 505,000 scientific articles are dedicated to the topic⁴, and recently in 2024, more than 10,278 projects were created to continue contributing to it.⁵ Research dedicated to causal inference span numerous fields in the social and behavioral sciences (Imbens and Rubin 2015; Wright 2015), such as economics (e.g., Angrist and Pischke 2015; Angrist and Pischke 2009; Gilchrist et al. 2024; Powell 2022), medicine (e.g., Bilgel and Galle 2015; Strübing et al. 2023), public management (e.g., Birdsall 2011), social psychology (e.g., Aronson et al. 2018: 54-61; Bettinsoli et al. 2015), political science (e.g., Abadie et al. 2010; Aronow and Samii 2016; Druckman 2015; Hartman 2021; Mutz and Pemantle 2012; Samii 2016; Slothuus 2016), and sociology (e.g., Aronow and Samii 2017). To provide theoretical conceptualizations of causal inference, several frameworks have been developed (e.g., Greenland et al. 1999; Imbens and Rubin 2015; Mazaheri et al. 2024; Pearl 1988, 1993, 1995; Rubin 2005), and these have in turn produced methods that rely on the stipulations made by these framework to identify conditions that enable researchers to draw reliable causal inferences (e.g., Abadie et al. 2010, 2015; Abadie and L’Hour 2021; Arkhangelsky et al. 2021; Dunning 2008; Pang et al. 2022; Rosenbaum and Rubin 1983; Strübing et al. 2023). As such, causal inference is an important and thriving scientific topic, and the usefulness of contributions to this literature span numerous disciplines.

The general utility of causal inference as a field is founded in theorists often proposing causal relationships to explain physical phenomena⁶. Using standard notation and set theory (Burgess 2022[1948]; Cantor 1874), a causal relationship can be conceptualized to involve at least two physically distinct phenomena (e.g., $X$ and $Y$, with $X \ne Y$)⁷, where one phenomenon ($X$) temporally precedes the second phenomenon ($X \prec Y$), with the former exacting an effect upon the latter ($X \rightarrow Y$).⁸ Such relationships are unique to causal inference, since the theoretical claim is not merely that one phenomenon correlates with, or predicts, another phenomenon, but that the phenomenon causes the other, which is especially vital for researchers interested in ultimate causes (e.g., Boyer and Petersen 2018) and the causal mechanisms underlying phenomena (e.g., Bullock et al. 2010). This distinction can be further illustrated by emphasizing that the goal of causal inference is the valid and reliable identification of a theorized causal effect (e.g., $D$). The concern for validity and reliability means that an estimated effect (e.g., $\hat{D}$) should ideally be equivalent to the true effect (i.e., $\hat{D} = D$). This distinct focus can be contrasted with predictive (Cranmer and Desmarais 2017; James et al. 2021), correlational (Pearson 1895a), and imputational inference (Kalton and Kasprzyk 1982; King et al. 2001; Rubin 1988). While these are similarly concerned with relationships between phenomena, such relationships are not necessarily causal, and their goal is to predict a phenomenon (e.g., $Y$) using other phenomena (e.g., $X$), with the central concern being to make predictions (i.e., $\tilde{Y}$) that ideally are perfectly accurate (i.e., $\tilde{Y} = Y$). Since these approaches serve different purposes, they can have different concerns for what constitutes valid and reliable inference. For example, while the inclusion of causes in a model may yield good predictive inference (Schölkopf et al. 2021), the inclusion of non-causes in a model cannot yield good causal inference, or as alternatively stated by Nancy Cartwright (1994): ‘No causes in, no causes out’. To illustrate the concerns specific to causal inference, a more formal conceptualization of an effect is useful, which can be provided by the Potential Outcomes Framework.

3.1.1 Potential Outcomes Framework

One popular conceptualization of a causal effect, helpful in establishing reliable causal inference, is provided by the Neyman-Rubin Potential Outcomes Framework (POF, Holland 1986; Imbens 2015: 374-377; Imbens and Rubin 2015; Neyman 1923; Rubin 2005), which defines a causal effect as a difference in potential outcomes (Imbens and Rubin 2015: 6; Angrist and Pischke 2015: 6-8; Angrist and Pischke 2009: 13-15; Wooldridge 2019: 100-102). Consistent with the modern conceptualization by Guido W. Imbens and Donald B. Rubin (2015), with the use of set theory (Burgess 2022[1948]; Cantor 1874), potential outcomes can be understood as the set of all possible outcomes, irrespective of whether or not any of these outcomes are realized. To help distinguish realized from potential outcomes, one can draw a parallel to the particle/wave duality in physics (Thomson and Reid 1927; Zeilinger 1999; see also Monroe et al. 1996): The set of potential outcomes can be understood as a superposition of possible, yet contradicting outcomes that collapses when any one outcome is realized. As such, only one realized potential outcome is ever observed, while all other potential outcomes become counterfactual outcomes. This concept of a causal effect helps emphasize that comparisons between potential outcomes are what defines a causal effect, and once having observed a realized outcome, causal inference involves comparing this realized outcome with at least one other counterfactual outcome (Imbens and Rubin 2015: 4; Wooldridge 2019: 10-14, 53-56). If the theorized cause truly affects the outcome, different variations of the theorized cause would expectedly produce different potential outcomes, which in turn imply different counterfactual outcomes. This can be synthesized with the above criteria for a causal effect so that any physically distinct and temporally preceding phenomena, which can be held attributable to a difference in potential outcomes, can be considered a cause of that outcome.⁹

A strength of the POF is the ability to formalize this difference in potential outcomes mathematically. This is shown here using discrete mathematics (Gallier 2011), with the underlying assumption being that the phenomena of interest can be meaningfully reduced to mathematical constructs (Mazur 2006; for discussion, see, e.g., Eronen and Romeijn 2020). For the purposes of illustration, no measurement error (Bandalos 2018) will be assumed, and consistent with the POF (Angrist et al. 1996; Imbens and Rubin 2015: 9-12), the illustration will invoke the Stable Unit Treatment Value Assumption (SUTVA, Cox 1958; Gelman et al. 2021: 353-354; Rosenbaum and Rubin 1983; Rubin 1974; Stuart 2010: 3), which states that the ‘potential outcomes for any unit do not vary with the treatments assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.’ (Imbens and Rubin 2015: 10). This means that units characterized by a certain manifestation or level of the theorized cause do not affect units with different manifestations in any relevant way, which would hinder the identification of an effect due to spill-over (Imbens and Rubin 2015: 12-13; Stuart 2010: 3; e.g., Aronow and Samii 2017). Again using set theory (Burgess 2022[1948]; Cantor 1874), first suppose that the outcome (i.e., $Y$) is a real-valued vector ($Y \subseteq \mathbb{R}$) of an arbitrary length ($|Y|$), so that $Y = \{Y_1, Y_2, Y_3, ..., Y_{|Y|}\}$, with any arbitrary index-value being denoted by $y$. Second, suppose a distinct and temporally-precedent theorized cause (i.e., $X$, with $X \prec Y$, $X \ne Y$), which is similarly a real-valued vector ($X \subseteq \mathbb{R}$), its length ($|X|$) being arbitrary, so that $X = \{X_1, ..., X_{|X|}\}$, with $x$ indicating an arbitrary index-value. Suppose then a $N_C \cdot K_C$ matrix of all precedent and distinct causes of the outcome (denoted $C$), which in the event that $X$ truly is a cause of $Y$, does not contain $X$ (i.e., $X \not \subseteq C$). Note that $C$ is not indexed for ease of illustration, and that all causes in $C$ precede and are distinct from $X$. The potential outcome, $Y(\cdot)$, can then be expressed conditionally on $C$ and, as theorized, possibly conditional on $X$ as: $Y(X, C)$. A causal effect of $X$ would then entail a difference between two potential outcomes that vary solely with respect to $X$, that is, $Y(X_x, C)$ and $Y(X_{x'}, C)$, with $C$ held constant and $x'$ being an index-value of $X$ that is not $x$. By keeping $C$ constant and only varying $X$, any difference in potential outcomes will be solely attributable to $X$, allowing its theorized effect to be reliably assessed. In the instance that the difference varies across $X$, the causal effect is heterogenous by being conditional on the particular potential outcomes being compared (i.e., $D_{\underset{x, x'}{\Delta}}$).¹⁰ This formalization of causal inference based on (Imbens and Rubin 2015) results in equation 1.1 provided below.


$D_{\underset{x, x'}{\Delta}} = Y(X_x, C) \: - \: Y(X_{x'}, C)$	1.1

While this formalization helps conceptualize a causal effect, equation 1.1 is arguably too simplistic for most practical purposes. This is evident when adding notation to reflect that causal inference typically involves computing causal effects from a number ($n$) of measured units (i.e., $n \in \mathbb{N}$, indexed by $i$), which consistent with SUTVA are here assumed to be independent of each other. Since the theorized cause, potential outcomes, and causal effects can vary across units, 1.1 can be modified using Imbens and Rubin (2015) to that of 1.2.


$D_{\underset{x, x'}{\Delta}, i} = Y_i(X_{x, i}, C_i) \: - \: Y_i(X_{x', i}, C_i)$	1.2

NOTE: Mathematical equation. $D$ = Difference in potential outcomes. $Y(\cdot)$ = Potential outcome. $X$ = Theorized cause, with $x$ being the index-value of an arbitrary manifestation of the theorized cause, while $x'$ is the index-value of an arbitrary manifestation of the theorized cause that is not $x$. $C$ = All causes that impact $Y(\cdot)$, excluding $X$ in the event that $X$ impacts the potential outcomes. $i$ is the index-value of an arbitrary unit.

As shown in equation 1.2, the computation of a causal effect requires a comparison of the same unit across at least two contradicting potential outcomes. This may seem plausible by manipulating the same individual with respect to $X$ at two different points in time (e.g., a repeated manipulations design, Shaughnessy et al. 2015: 219-242; Sullivan 2008). However, consistent with equation 1.1 and 1.2, this would require the assumption that time ($T$) is not a cause of $Y$ ($T \not \subseteq C$), since any phenomena contained in $C$ must be constant when comparing potential outcomes to avoid conflating the effect of $X$ with the effect of $T$. This can be exemplified by formalizing time as $T$, a real-valued time vector ($T \subseteq \mathbb{R}$) of arbitrary length ($|T|$), so that $T = \{T_1, ..., T_{|T|}\}$, indexed by $t$. In the event that $T$ impacts the potential outcomes, $T$ is here not contained in $C$, is physically distinct from and occurs simultaneously with $X$, and the requirement for equivalence in this term can then be made explicit, as demonstrated in equation 1.3.


$D_{\underset{x, x'}{\Delta}, t, i} = Y_i(X_{x, i}, T_{t, i}, C_i) \: - \: Y_i(X_{x', i}, T_{t, i}, C_i)$	1.3

NOTE: Mathematical equation. $D$ = Difference in potential outcomes. $Y(\cdot)$ = Potential outcome. $X$ = Theorized cause, with $x$ being the index-value of an arbitrary manifestation of the theorized cause, while $x'$ is the index-value of an arbitrary manifestation of the theorized cause that is not $x$. $T$ = Time, with $t$ being the index-value of an arbitrary point in time. $C$ = All causes that impact $Y(\cdot)$, excluding $X$ in the event that $X$ impacts the potential outcomes, as well as $T$ in the event that $T$ impacts the potential outcomes. $i$ is the index-value of an arbitrary unit.

Equation 1.3 implies that differences between potential outcomes ($D$) needs to be computed by varying the theorized cause ($X$) for the same individual ($i$), at the same point in time ($T$), with all other causes ($C$) held constant. A repeated measures design logically violates this equality by varying time across the potential outcomes, and any differences in potential outcomes caused by time will be able to confound results, producing estimates that conflate the effect of $X$ on the potential outcomes with the effect of $T$. This issue is further emphasized by considering that a repreated measures design must assume that multiple manipulations are independent of each other, which is easily violated if the unit is impacted by the manipulation in a way that affects future manipulations, which would cause spill-over and violate the fundamental SUTVA assumption. Examples of such violations include expectancy (cf. McDermott 1895: 34) and learning effects (cf. Shaughnessy et al. 2015: 222-224), which interacts with the effect of the manipulation. This independent and simultaneous comparison of individuals at counterfactual states has been recognized as impossible, being denoted the Fundamental Problem of Causal Inference (Holland 1986; Stuart 2010: 3).

Much of the literature on causal inference has been dedicated to solving the Fundamental Problem of Causal Inference (e.g., Holland 1986: 947). These typically involve shifting the focus from individual causal effects (i.e., $D_i$) to average causal effects (i.e., $\bar{D}$, Angrist and Pischke 2015: 8-10; Angrist and Pischke 2009: 3-4, 13-22, 54-57; Stock and Watson 2019: 475-476), also referred to as expected causal effects (i.e., $\mathbb{E}(D)$). Consistent with SUTVA, assuming that the manifestation of the theorized cause cannot vary within each group condition (Imbens and Rubin 2015: 9-12), this focus on averages enable one to drop the individual unit subscript $i$, since estimates are now computed based on the potential outcomes of groups of individuals. This is demonstrated by modifying equation 1.2, producing equation 1.4, where the $T$ introduced for equation 1.3 will (again) be considered simultaneous with, and a subset of, $C$ if it impacts the potential outcomes.


$\mathbb{E}(D_{\underset{x, x'}{\Delta}}) = \mathbb{E}(Y(X_x, C)) \: - \: \mathbb{E}(Y(X_{x'}, C))$	1.4

NOTE: Mathematical equation. $\mathbb{E}(\cdot)$ = Expectation. $D$ = Difference in potential outcomes. $Y(\cdot)$ = Potential outcome. $X$ = Theorized cause, with $x$ being the index-value of an arbitrary manifestation of the theorized cause, while $x'$ is the index-value of an arbitrary manifestation of the theorized cause that is not $x$. $C$ = All causes that impact $Y(\cdot)$, excluding $X$ in the event that $X$ impacts the potential outcomes.

Equation 1.4 reveals that contradicting unit comparisons of potential outcomes are no longer a strict requirement for reliable causal inference. Instead, this only requires group comparisons where all confounders are held constant. Time is less of an obtrusive confounder here, since it is theoretically possible to simultaneously compare two groups at contradicting levels of the theorized cause, though the time of making these comparisons may still matter if the effect is heterogenous across time. Such a group comparison is theoretically achieved through an exogenous group assignment, which is typically implemented in a Randomized Controlled Trial (RCT, Coover and Angell 1907; Peirce and Jastrow 1885; Woodworth and Thorndike 1901) using a Random Assignment Mechanism (RAM). The randomness of this RAM serves to ensure that there are no expected group differences prior to the manipulation of $X$, enabling a constant $C$ across the group potential outcomes (Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Holland 1986: 948; McDermott 1895: 33; Wooldridge 2019: 54-56). For example, while units may not be assigned to the manipulation at the same point in time ($T$), at the group level, the time of manipulation are expectedly equivalent, since units are assigned to a group independent of the time of their participation, producing estimated group differences unconfounded by time.

Further notation can be introduced to illustrate the RAM, which involves assuming varying group manipulation conditions ($G$) of an arbitrary length ($|G|$), so that $G = \{G_1, G_2, G_3, ..., G_{|G|}\}$, where $g_x$ indexes the arbitrary manipulation condition (e.g., a treatment or control condition), with the $x$-subscript indicating the manifestation of $X$ that the manipulation is theorized to induce. This helps conceptualize an exogenous manipulation as practically equivalent to an exogenous group assignment, where a different exposure condition in each group serves to manipulate the units in different ways. Here, the exogeneity of the RAM can be formalized as a group assignment probability ($Pr$) that is independent of confounders (i.e., $Pr(G = g_x) = Pr(G = g_x \ | \ C)$). The distinction between the manipulated value of the theorized cause and the manipulation level intended by the assigned group condition serves to indicate that they are not necessarily equivalent and that the estimated effect may be conditional on both. For example, suppose that the theorized cause ($X$) truly is a cause of the outcome ($Y$) and that we employ an RCT, where for this particular study the use of the RAM ensures perfect group balance (i.e., constant $C$) across two different group manipulation conditions (e.g., $g_x$ and $g_{x'}$). However, suppose then that the actual level of the theorized cause achieved by the group manipulation conditions results in the same level (i.e., $x$) across groups. While $X$ does cause $Y$ in this example, the group manipulation conditions fail to manipulate $X$ resulting in an estimated causal effect of zero, which is exactly what happens if an RCT was to be conducted on a theorized cause that cannot be manipulated. Besides highlighting the issue that this paper serves to solve, this consideration helps distinguish between the manipulation level theoretically induced by a group manipulation condition ($g_x$) and the actual level of manipulation ($x$) that occurs, which is why an inadequate exogenous manipulation can fail to identify a causal effect and thus be one cause of a false negative result. This is reflected in the distinct notions of a causal effect versus a treatment effect, with the latter being an often used term to indicate that an effect can be conditional on the treatment used in the manipulation conditions. While this paper will generally assume that the intended and actual manipulation levels are equivalent, allowing a reliable identification of a causal effect, the ability to account for the distinction is useful and further warranted by the notion that interactions between the theorized cause and the manipulation condition can occur that results in varying identified effects. Such interactions can occur because the manipulation condition changes the effect of the theorized cause on the outcome (Linos and Twist 2018). For example, if the unit has already been exposed to that manipulation condition prior to the study (i.e., pretreatment effects, cf. Sniderman 2011; e.g., Slothuus 2016), with an identified effect in this instance constituting a special case of a treatment effect denoted the marginal effect of additional treatment (Slothuus 2016: 303; Gaines and Kuklinski 2011: 450). The fact that the potential outcomes, and thus the identified causal effect, can be conditional on both the theorized cause, confounders, and the manipulation condition is incorporated into equation 1.5:


$\mathbb{E}(D_{\underset{x, x'}{\Delta}, \underset{g_x, g_{x'}}{\Delta}}) = \mathbb{E}(Y(X_{x}, C) \: \| \: G = g_x) \: - \: \mathbb{E}(Y(X_{x'}, C) \: \| \: G = g_{x'})$	1.5

As illustrated in equation 1.5, consistent with (Stuart 2010: 2), a causal effect can be defined as the expected difference in potential outcomes across counterfactual group manipulation conditions. This helps further the understanding that the exogenous manipulation conditions here serves to produce two groups, which a priori have the same potential outcomes, since the RAM makes independent group assignments, but by merit of being differently manipulated across the manipulation conditions, they will expectedly differ with respect to the theorized cause, resulting in two counterfactual group outcomes being observed, one for each group. For this illustration, reliable causal inference reduces to simply computing the difference in group means (Deaton and Cartwright 2018: 3; Stock and Watson 2019: 121-123). In line with this, it should be noted that the principle of no group differences except with respect to $X$ in equation 1.5, realistically only applies to groups of infinite size, or to estimates derived from an infinite sequence of RCTs. This is because for any study with finite group size, the RAM used to implement the exogenous manipulation does not guarantee no group difference with respect to $C$ (i.e., no confounding), only the expectation of no group differences (Deaton and Cartwright 2018: 6; Imai et al. 2008: 488). This principle of no expected confounding is also referred to as unbiasedness (Angrist and Pischke 2015: 12; Hartman 2021: 386; cf. Wooldridge 2019: 717-718). Stated similarly, group balance will only be guaranteed asymptotically (Wooldridge 2019: 721-724), meaning that for any comparison of groups of finite size, or estimates from a finite sequence of RCTs, $C$ may be inequivalent across group potential outcomes (Deaton and Cartwright 2018: 6; Imai et al. 2008: 488), allowing confounders to influence the results of any particular study, though confounding is reduced the larger the groups and the greater the amount of RCTs that causal estimates are averaged over, with the specific bias for any study being unknowable unless one knows the correct functional form of the confounders (Deaton and Cartwright 2018: 4; Rubin 1974).

Taken together, the POF provides useful insights on how to conceptualize causal effects and the premises for making reliable causal inferences. This primarily emphasizes equivalency with respect to confounders when comparing potential outcomes that differ with respect to the theorized cause. While the POF helps illustrate concerns for confounding and the elimination of confounders through exogenous manipulation, it is necessary to further emphasize the issues of confounding and their inescapability when the theorized cause is non-manipulable.

3.1.2 Confounding

Confounding is a serious concern to reliable causal inference (Fisher 1935; Greenland et al. 1999; VanderWeele and Shpitser 2013). Defined as ‘a variable that (1) temporally precedes treatment and is both (2) associated with the treatment and (3) associated with the outcome’ (Elwert 2013: 262), a confounder can make a causal interpretation of results entirely invalid, leading to the production of wrong scientific insights, which if used for policy recommendations can mean a waste of resources and the implementation of policies that could potentially harm individuals. Confounding can make causal inference unreliable through numerous avenues. Some of the most important confounders considered here include reverse-causality, spuriousness, and masking. To provide an overview of these type of relationships, directed acyclic graphs (DAGs, Pearl 1988, 1993, 1995) of (1) no causal relationship, (2) a causal relationship, (3) a reverse-causal relationship, (4) a spurious (non-causal) relationship, and (5) a masked causal relationship are provided in figure 3.1.

Reverse-causality refers to instances where the theorized cause of some outcome is actually itself caused by that outcome (e.g., $X \leftarrow Y$, Shadish et al. 2002). Causal inference may be made unreliable by this type of confounding whenever the theorized cause does not precede the outcome ($X \nprec Y$), but instead succeeds the outcome ($X \succ Y$). This violates the criterion of temporal precedence (Hume 2011: 72-73) outlined above (i.e., $X \prec Y$). Temporal precedence can be guaranteed by employing an exogenous manipulation of the theorized cause, for example, using the RAM of RCTs, since an exogenous theorized cause by definition cannot be caused by other phenomena, such as the theorized outcome, it can only itself cause other phenomena. When the theorized cause is non-manipulable, researchers often employ alternative strategies, such as relying on longitudinal data (Wooldridge 2019: 7-10, 334-335), that is, they measure the theorized cause and outcome at one point in time (e.g., $T = t_1$, with $X_{t_1}$ and $Y_{t_1}$), followed by measuring the outcome again at a later point in time (e.g., $T = t_2$, with $Y_{t_2}$). They can then substract the measures of the outcome across this time period ($\Delta Y = Y_{t_2} - Y_{t_1}$) and model a relationship between the theorized cause measured at the first time point ($X_{t_1}$) with the difference in the outcome ($\Delta Y$), ensuring that the outcome being studied temporally succeeds the theorized cause. However, longitudinal studies can be considerably expensive, and they can often enhance other types of biases (e.g., attrition bias, Hausman and Wise 1979; Rohrer 2018: 36), so researchers may alternatively use a cross-sectional design (Wooldridge 2019: 5-7). Here, they can try to ensure temporal precedence by measuring the theorized cause prior to measuring the outcome, and as a response to the reliability of this solution, it is here necessary to emphasize that the time of measuring a phenomenon does not equal its time of origin. This means that just because the theorized cause is measured prior to the outcome, this does not mean that the theorized cause originated at a time prior to the outcome, and temporal precedence thus remains violable.¹¹ As such, while a longitudinal study helps ensure temporal precedence, which is necessary for reliable causal inference, it is not a sufficient condition (de Leeuw 1985: 372; with claims to the contrary being a post hoc ergo proper hoc logical fallacy, Manninen 2019a), and as will be made evident below, an exogenous manipulation generally offers a more reliable approach in terms of reliability.

Figure 3.1: Examples of Relationships

NOTE: Directed Acyclic Graphs (Pearl 1988, 1993; DAGs, Pearl 1995). $x$ = Theorized cause, $y$ = outcome, $s$ = spurious confounder, $m$ = masking confounder. Non-exhaustive list of relationship types.

Two other types of confounders considered here are (1) spuriousness, which refers to an instance when there is no relationship between the theorized cause and outcome, but where a third variable (e.g., $s$) impacts both of them, creating a spurious relationship that researchers may wrongly attribute to the theorized cause (Gumbel 1933; McElreath 2019: 123-144; Stock and Watson 2019: 584-586); and (2) masking, that involves a situation with a relationship between the theorized cause and outcome, but where a third variable (e.g., $m$) impacts both of them in a way that masks their relationship, leading to the researcher wrongly concluding no effect of the theorized cause¹² (McElreath 2019: 144-153). These related confounders are both reliably solved with an exogenous manipulation, since an exogenous theorized cause by definition cannot be affected by a third variable, preventing any spurious or masking effects to occur. Stated in the terminology of the aforementioned DAGs, this is because an exogenous manipulation closes any and all ‘back-doors’, that is, any relationships between the theorized cause and confounders (i.e., the back-door criterion, Elwert 2013: 259-260; McElreath 2019: 184-189; Rohrer 2018: 30-31). However, when the theorized cause is non-manipulable, researchers may alternatively attempt to identify all possible confounders to avoid the issues of falling victim to spuriousness or masking, for example, by drawing encompassing DAGs. However, what constitutes the true DAG of a causal relationship is inherently unknowable (Elwert 2013: 270), and it is the researchers’ responsibility to justify having accounted for all possible confounders (Elwert 2013: 248). Failing to account for any confounder can produce spurious or masked relationships, while accounting for any non-confounder can induce other types of biases (e.g., collider bias or overcontrol bias, Elwert 2013: 250; Elwert and Winship 2014; McElreath 2019: 176-183; Rohrer 2018: 34-38). With non-manipulable theorized causes, researchers can reasonably suspect that their results may be confounded and attempt to estimate how much a confounding effect would be necessary to make an identified relationship spurious (e.g., Cinelli and Hazlett 2020). But a shortcoming of such procedures is that researchers may erroneously conclude that because the confounding effect would have to be too large for a known confounder to produce, then the identified relationship must be causal. There are several issues with this argument. Specifically it applies numerous logical fallacies (Arp et al. 2019), including an appeal to ignorance (Bebbington 2011; McCraw 2019), or alternatively an appeal to personal incredulity (Manninen 2019b), if not both, since researchers’ inability to know or imagine such a relevant confounder does not mean that this confounder does not exist. Further, assuming that the large confounding effect would have to originate from a single confounder commits the oversimplification fallacy (Burkett 2019), since the large confounding effect could be caused jointly through the aggregation of smaller effects by numerous confounders. Ultimately, unless researchers know the functional form of the confounder, which they rarely do, they cannot know the magnitude by which it may confound results (Deaton and Cartwright 2018: 4).

In line with these insights, the issues of reverse-causality, spuriousness, and masking would seem most reliably solved through an exogenous manipulation of the theorized cause. This implies that any phenomenon that is merely observed or manipulated in an non-exogenous manner may be endogenous and affected by confounders (Mutz and Pemantle 2012: 3), such as reverse-causation, spuriousness, and masking. Accordingly, researchers interested in drawing causal inference from non-manipulable phenomena face numerous issues of confounding that a priori justify serious concerns regarding the reliability of causal inference drawn from possibly endogenous phenomena. While numerous methodologies exist to mitigate these concerns, as will be made clear in the subsequent section, unless they enable an exogenous manipulation of the theorized cause, these justified concerns generally make them unable to provide reliable causal inference.

3.1.3 Issues with Existing Solutions

The issues of drawing reliable causal inference with non-manipulable theorized causes are widespread and affect numerous phenomena of substantial scientific interest. Non-manipulable phenomena include ‘gender, personality, and other deeply rooted traits and values […]’ (Buunk and van Vugt 2013: 17) that all have been the focus of numerous studies (Block and Block 2006; Graham and Haidt 2012; Mac Giolla and Kajonius 2019; e.g., Mondak 2010). These are generally considered non-manipulable (e.g., Holland 1986: 946) either for ethical reasons or due to being biologically rigid and thus practically impossible to manipulate (e.g., personality traits, Buss 2009; Larsen et al. 2021; Mitchell 2018). For the aforementioned reasons, any causal relationship involving non-manipulable phenomena, RCTs cannot be employed for reliable causal inference, and observational designs that fail to account for the possibility of confounding thus risks making false positive or false negative conclusions regarding the hypothesized causal relationships. A solution to this issue could involve changing the scope of the research from causal inference to correlational or predictive inference, since this avoids the issue of making false causal conclusions. However, researchers may find it difficult to change scope if their central interest is specifically that of a causal relationship. This is evidenced by researchers often making inappropriate causal interpretations (for such criticisms, see, e.g., Chiu et al. 2023; Keele 2015; Lee 2012; Rohrer 2024; Samii 2016) despite the lack of manipulability of the phenomena essentially restricting them to correlational/predictive/imputational inference. While they often justify a causal interpretation through theory (e.g., Xu and Peterson 2017), given the multitude of ways for confounding outlined above, a theoretical argument alone seldom provides any reliable guarantees for the absence of confounding, a consideration that is further expanded upon below.

A strategy for researchers to justify a causal interpretation when being unable to exogenously manipulate the theorized cause is through theory and argumentation. While this arguably helps justify a causal interpretation (Elwert 2013: 247), such arguments often make implicit, if not explicit, use of logical fallacies (Arp et al. 2019). Examples of fallacious logic that researchers can employ to justify a causal interpretation include shifting the burden of proof (Russo 2019) of the causal claim to the skeptic (e.g., ‘If you are so skeptical of my claim that $x$ causes $y$, prove to me that $x$ does not cause $y$.’). This is a logical fallacy because the burden of proof (onus probandi) ultimately falls upon the one making the causal claim (i.e., the researcher) and not the one skeptical of the claim (Cargile 1988; Hume 2010[1748]; Russo 2019; Sagan 1979). Another fallacy can involve a researcher justifying a causal interpretation because (whether guided by theory or intuition) they cannot imagine any confounders, or have accounted for the select number of confounders of which they are aware (e.g., ‘I have controlled for $z$ so therefore $x$ must cause $y$.’). This would be the fallacy of an appeal to ignorance (Bebbington 2011; McCraw 2019), or alternatively an appeal to personal incredulity (Manninen 2019b), if not both, since the ignorance of (or inability to imagine) confounders does not equate with there being no confounders. The probably least sound argument explicitly involves simply requiring researchers to be able to imagine a causal relationship (Morgan and Winship 2007: 279; as cited by Bollen and Pearl 2013: 313), which obviously fails to lift the burden of proof and is highly susceptible to confirmation bias by the causal claimant (Johnson 2019). Another, perhaps more commonly known, logical fallacy that researchers can employ is to suggest that a correlation between phenomena implies causation. This would be a cum hoc ergo propter hoc logical fallacy (Manninen 2019c), since the co-occurrence of phenomena, no matter how consistent, does not necessitate that one phenomenon be the cause of the other¹³. This is why researchers generally recognize that correlation does not imply causation (Pearson 2015[1900]) and that causation does not imply correlation (Kennaway 2015).¹⁴

Researchers can also employ methodological designs to help justify a causal interpretation of results. Among the more justifiable approaches is the natural experiment (Dunning 2008; Stock and Watson 2019: 490), which attempts to exploit a manipulation of the theorized cause to help identify a causal effect. The natural experiment, however, differs from an RCT, since this manipulation is not controlled by the researchers, but instead is under the control of ‘naturally occurring’ phenomena. In line with the logic underlying the RAM in RCTs, it remains key that this natural manipulation occurs exogenously of all possible confounders, which in this context is referred to the manipulation being as-if-random (Stock and Watson 2019: 497-498). Should the naturally-occurring manipulation fail this criteria, it will be endogenous, resulting in possible confounding and unreliable causal inference. Numerous researchers have employed natural experiments to test theorized causal relationships with theorized causes not manipulable by researchers (e.g., Erikson and Stoker 2011; Jones and Olken 2005; Kokkonen and Sundell 2020). Here, the as-if-random manipulation is often sought justified through theory and knowledge about the manipulation mechanism. While some manipulation mechanisms are undoubtedly more credibly exogenous than others (e.g., a lottery, Erikson and Stoker 2011), many are open to the aforementioned logical fallacies (e.g., ‘I think of no confounders, ergo no confounders.’). The ability of natural experiments to provide reliable causal inference therefore rests on the exogeneity of the manipulation mechanism, which unlike the RAM of RCTs are not generally as testable in relation to their randomness (Mutz and Pemantle 2012). Another limitation is the identification of natural experiments, which can be difficult and subject to luck. As such, while natural experiments, in principle, could enable reliable causal inference through an as-if-random manipulation of a phenomena non-manipulable by researchers, they are difficult to identify, and distinguishing between truly exogenous and endogenous manipulation mechanisms are difficult if not impossible, making them generally unable to deliver truly reliable causal inference.

Designs related to the natural experiment (i.e., quasi-experiments) include relying on regression discontinuity designs and instrumental variables. The regression discontinuity design (RDD, Skovron and Titiunik 2015; Thistlethwaite and Campbell 1960) involves identifying a causal effect by studying a theorized cause across a time period. During this period some change is induced in the theorized cause, which is expected to cause a change in the outcome (Stock and Watson 2019: 495-496; Angrist and Pischke 2015: 147-177; Angrist and Pischke 2009: 251-267; Gelman et al. 2021: 432-452). To justify a causal interpretation, it is necessary that the change is induced exogenously in an as-if-random manner (Stock and Watson 2019: 496). As such, the procedure is highly similar to a natural experiment, though with the added requirement that the theorized cause is observed across time. While this helps researchers understand variations in the outcome prior to the induced variation in the theorized cause, which undoubtedly can help separate random noise from a possible effect of the theorized cause (e.g., regression to the mean effects, cf. Gelman et al. 2021: 87-90; McElreath 2019: 92), this design again relies on strong assumptions of an exogenous manipulation that similarly makes natural experiments generally unreliable for causal inference. By contrast, Instrumental variables (IVs, Stock and Watson 2019: 427-472, 494; Angrist and Pischke 2015: 98-146; Angrist and Pischke 2009: 113-218, 259-267; Gelman et al. 2021: 421-454; Imbens and Rubin 2015: 513-584) are phenomena that are manipulable, either by researchers or by naturally-occurring phenomena, which are expected to impact the theorized cause. The idea is to isolate variation in the theorized cause exogenously induced by the IV and identify a relationship between this variation in the theorized cause and the outcome (Angrist and Krueger 1991; e.g., Hariri 2012; Stock and Trebbi 2003; Wright 1928). While for naturally-occurring IVs, whose exogeneity rely on the same issues of as-if-randomness that generally make natural experiments unreliable, IVs exogenously manipulated by researchers can arguably provide a better justification for a causal interpretation of results. While this may sound promising, the reliability of IVs rests on numerous assumptions (Gelman et al. 2021: 422-423), and a central criticism considered here targets the exclusion restriction assumption, which states that the IV cannot impact the outcome except solely through the theorized cause (Angrist and Pischke 2009: 116-117, 153-154; Angrist and Pischke 2015: 101-107, 130-131; Samii 2016: 948; Gelman et al. 2021: 423). This is a strong assumption that is impossible to test, since this would require investigating every possible phenomena with a temporal occurrence between the theorized cause and the outcome. Consistent with the above, the reliability of this strategy for causal inference cannot rely on researchers’ ability to adequately account for every possible confounder nor for every possible way that an IV could impact an outcome other than through the theorized cause. Accordingly, neither RDDs nor IVs are taken to generally provide reliable causal inference.

A third set of methods that researchers can employ for causal inference involves the construction of counterfactual comparison groups. Such designs include difference-in-differences, synthetic control, matching, and functional adjustment. Through different means these all attempt to correct for imbalances between units varying with respect to the theorized cause, which if inadequately accounted for can introduce confounding through selection bias (Angrist and Pischke 2015: 8-11; Angrist and Pischke 2009: 23-24; Deaton and Cartwright 2018: 9; Heckman 1979; Rohrer 2018: 36-38). The perhaps most sophisticated of these is the difference-in-differences (DiD) design (Stock and Watson 2019: 492-494; Angrist and Pischke 2009: 227-243; Angrist and Pischke 2015: 178-208). This involves observing at least two groups across time, where one group experiences a (naturally) occurring change in the theorized cause. To assess the counterfactual scenario of the outcome had this group not experienced this change during the observed time period, the difference in the outcome of the non-changed group for the same time period is used as an estimate of the counterfactual difference, producing a difference-in-differences estimate (e.g., Allers and Hoeben 2010; Bechtel and Hainmueller 2011; Card and Krueger 1994). For this method to be reliable, the changed and the non-changed group must be balanced in terms of all possible confounders prior to the chance in the theorized cause, an assumption that has also been referred to as parallel trends (Angrist and Pischke 2015: 184-187). This is commonly ascertained by assessing balance between the groups prior to the change. If data contains numerous measurements of the groups pre- and post-manipulation, this method can be combined with RDD to produce the possibly more reliable differences-in-discontinuity design (Grembi et al. 2012). However, it is clear that the assessment of parallel trends requires assuming a version of no unmeasured confounders to avoid results being plagued by omitted variable bias (OVB, Angrist and Pischke 2015: 69-93, 132-174, 212-213; Ho et al. 2007: 206), which again is dubious since researchers may not know of all possible confounders. Given these and other issues (e.g., Baker et al. 2022; Bertrand et al. 2004; Ferman 2023; Kahn-Lang and Lang 2020; Ryan et al. 2015), in accordance with the above, this method can similarly be found too unreliable for causal inference.

The synthetic control method similarly requires data across time, with the same underlying logic of a group experiencing a change in the theorized cause (Abadie et al. 2010, 2015; Abadie and L’Hour 2021). Here, multiple groups not experiencing this change are used to construct a counterfactual estimate of the outcome of the changed group had it not experienced the change. This involves a re-weighing of data similar to matching (see below), and for a causal estimate to be reliable, the counterfactual comparison group cannot differ from the treated group with respect to unobserved confounders. As such, it can be subjected to the same criticisms of plausibly failing to account for omitted variables as the DiD, making it similarly unreliable for causal inference.

By comparison, matching (Angrist and Pischke 2015: 50-56; Angrist and Pischke 2009: 69-91; Ho et al. 2007; Imbens 2015; Imbens and Rubin 2015: 337-358; Stuart 2010; e.g., Kam and Palmer 2008) does not require longitudinal data, but it similarly involves constructing a counterfactual outcome using units that did not experience a change in the theorized cause. While it does not rely on ‘parallel trends’ like the DiD, it does make the same innate assumption of no unobserved confounders (Samii 2016: 948; Stuart 2010: 5; Rohrer 2018: 32-33; Ho et al. 2007: 200, 206-207), and in the event of failing to account for any confounder, this can re-introduce confounding through selection bias (Imai et al. 2008: 489; Pearl 2000; Wan et al. 2024). These and other issues (Imai et al. 2008; King and Nielsen 2019) means that it can generally be considered to fail in providing reliable causal inference, in particular for any scenario where researchers cannot reliable demonstrate no possible confounders.

Lastly, functional or parametric adjustment (cf. Stuart 2010: 2; Imai et al. 2008: 489), commonly known as controlling, involves the inclusion of covariates in a statistical model in an attempt to account for confounding phenomena. This is closely related to matching, and while it can reduce imbalances between two groups varying with respect to the theorized cause, as well as reduce variation in the outcome to help identify an effect of the theorized cause, the reliability of this procedure relies on numerous assumptions. This includes the assumption of correct functional form (Ho et al. 2007: 218), which means that the relationship between the confounder and the theorized cause and outcome must have been modeled correctly. For example, a linear adjustment is only appropriate if the impact of the confounder truly is linear, which can be difficult to ascertain. Another assumption shared with matching (Ho et al. 2007: 202) is that the phenomena being adjusted for must temporally precede the theorized cause, since the adjustment of any phenomena that occurs after the theorized cause risks inducing post-treatment bias (Montgomery et al. 2018). A more critical assumption is again the assumption of no unmeasured confounders. Since both matching and functional adjustment amounts to nothing more than selection on observables (Imbens 2015: 373), their ability to provide reliable causal inference relies on the correct adjustment for all confounders (Imai et al. 2008; Stuart 2010: 3), including confounders of which the researchers doing the adjustment may not be aware. For these reasons, functional adjustment also generally fails to provide reliable causal inference (cf. Aronow and Samii 2016; Imai et al. 2008; Samii 2016). With all these designs relying on the assumption of no unobserved confounders, it is worth stressing that in their attempt to increase balance, either by choosing groups similar to the group experiencing a change in the theorized cause, re-weighing groups, matching units, or making functional adjustments, that this runs the risk of increasing other imbalances. That is, by decreasing some imbalances, they can introduce new imbalances through selection bias, specifically by conditioning on observables (Ho et al. 2007: 202; Rohrer 2018; Wan et al. 2024).

Given the above review of existing methodologies for causal inference with non-manipulable phenomena, while it may seem strict to deem these methods unreliable in terms of causal inference, the take-away is that viewing these methods as being able to provide reliable causal inference generally increases the risk of making false causal conclusions in an manner that can be considered to be unjustified. Instead of putting researchers in a position where they have to measure all possible confounders, since this is the burden of proof of causal inference, an alternative and reliable approach would, instead of relying on numerous dubious and untestable assumptions, involve a design that at least guarantees, or at least by expectation, simply by its design, that results cannot be systematically confounded (cf. Rohrer 2018: 34; Stuart 2010: 3; Samii 2016). Researchers must listen to the multiple criticisms targeting attempts at causal inference with non-manipulable phenomena (e.g., Baumrind 1983; Berk 1994; Cartwright 1994; Cliff 1983; de Leeuw 1985; Elwert 2013; Freedman 2004; Guttman 1977; Imai et al. 2008; Rohrer 2018; Samii 2016; Wan et al. 2024), recognize that there simply can be ‘no causation without manipulation’¹⁵ (Holland 1986: 959), and understand that, similar to that of the RAM in RCTs, reliable causal framework generally necessitates an exogenous manipulation.

3.2 Synthetic Causal Framework

The previous sections have established that (1) a causal effect is the difference in potential outcomes, (2) that confounding is an ever present threat when the theorized cause is non-manipulable, and (3) that existing solutions generally fail to provide reliable causal inference, at least by comparison to RCTs. The consequence of the latter is that substantial parts of scientific disciplines interested in non-manipulable theorized causes are barred from making reliable tests of causal hypotheses, making a solution to this problem relevant for scientific progress. These insights will be used to lay the foundation of a framework aimed at solving the issue of non-manipulability and provide a methodology that can provide causal inference as reliable as RCTs.

The novel framework devised here to enable researchers to draw reliable causal inference when the theorized cause is non-manipulable will be denoted the Synthetic Causal Framework (SCF), which will be outlined and formalized in this section. The development of this framework will draw on the established logic and notation by the Neyman-Rubin Potential Outcomes Framework (POF, Holland 1986; Imbens and Rubin 2015; Neyman 1923; Rubin 2005), with the aim of maximizing reliability through a high internal and external validity by design (cf. Stuart 2010: 3; Rohrer 2018: 34; Samii 2016). Following the POF, an a priori high internal validity will be achieved by making the non-manipulable theorized cause exogenously manipulable, allowing causal inference to be at least as internally valid as RCTs (cf. Angrist and Pischke 2009; Imai et al. 2008: 488; McDermott 2011; Stock and Watson 2019: 478-481). As similarly proven by theory on statistical inference, random sampling (Angrist and Pischke 2015: 14-16, 34-37; Thompson 2012) will be employed by the framework to provide a high external validity (cf. Gelman et al. 2021: 354-355; Stock and Watson 2019: 481; Imai et al. 2008: 488; McDermott 2011).

The first step in developing the SCF involves defining its scope. Given that the theorized cause is often non-manipulable in the social and behavioral sciences (Buunk and van Vugt 2013: 17), due to the manipulation of humans sometimes being unfeasible (e.g., due to ethical, practical, or cost concerns), the SCF will be developed with the intended aim of application within these disciplines, specifically for research topics where human individuals are the unit of interest, and where the theorized cause would otherwise not be manipulable. This means that the concern for external validity in relation to the SCF involves inference to human populations. The successful development of the framework will thus require an ability to rely on random sampling from an arbitrary human population, at least on a theoretical basis, and the manipulation of an arbitrary theorized cause, with results derived from an analysis of this manipulation being generalizable to the intended human population. This limitation in scope is unlikely to substantially reduce the usefulness of the framework, since the social and behavioral sciences encompass numerous disciplines and subdisciplines (Wright 2015). This is not to say that the SCF cannot necessarily be applied to other areas of research, but the scope helps clarify its intended area of application, as well the area-specific assumptions and limitations that the development and application of the SCF needs to consider and be evaluated against.

3.2.1 Large Language Models

With the intended application of the SCF being that of human-based studies, considerations can be made about random sampling and how to achieve an exogenous manipulation of an otherwise non-manipulable theorized cause. Given that numerous social scientific studies are conducted using online surveys (Batinic et al. 2002; Gideon 2012), where human respondents provide text-based responses, the solution that will underlie the SCF involves simulating human behavior using Natural Language Processing (Pillai and Tedesco 2023). In this field, Large Language Models (LLMs, Shanahan 2024) are developed with the specific aim of imitating human language and intelligence (Ornstein et al. 2024; e.g., Kjell et al. 2022). There have been numerous advances in this field, with LLMs being increasingly complex and trained on larger sets of data of text-based human language (OpenAI 2024b; 2024c; e.g., OpenAI 2022). The most advanced models include the Generative Pretrained Transformers (GPT) by OpenAI, such as the popular ChatGPT (OpenAI 2022). These GPT-models have proven capable of closely mimicking not just human language (Jansen et al. 2023; e.g., Colette et al. 2024; OpenAI 2022), but also human cognition, performing comparably in terms of logic, mathematics, and communication (e.g., OpenAI 2024a; 2024b; 2024c). For example, GPT-4o (OpenAI 2024b) can process text, audio, and image input, making it easy for researchers to use it in various ways, while o1 (OpenAI 2024c) can engage in complex reasoning that has enabled it to outperform physics PhDs. The prospect of simulating human behavior using such models has been recognized by researchers (e.g., Bail 2024). For example, LLMs have previously seen success in simulating patients to help train medical doctors (e.g., Holderried et al. 2024; Pang et al. 2024). A reliance on LLMs, however, restricts the application of the SCF to language-based mostly-online virtual environments, since the use of these models generally require a stimuli expressible through natural language in the form of a prompt, and efficient use of these models generally involve making online requests to the LLM using an application programming interface (API, De 2023), which require a stable internet connection. However, since text-as-data approaches are becoming more popular (e.g., Alrababa’h and Blaydes 2021; Barberá and Rivero 2014; Chadefaux 2014; Daigneault et al. 2018; Hollibaugh 2019; Van Atteveldt et al. 2017), this limitation to a text-based virtual environment is nonetheless unlikely to substantially reduce the SCF’s general applicability to researchers. At the same time, while their performance can be considered impressive, such models still have limitations (e.g., Edyko et al. 2023; Koga and Du 2024; Lecler et al. 2023; Obaid et al. 2023; Park and Choi 2024; Saka et al. 2024; Shea and Ma 2023; Zhang et al. 2024), and one might question whether models trained on natural language can be expected to sufficiently capture the nuances of human behavior.

To build a case for LLMs being capable of emulating human behavior, or at least behavior as it is expressed in natural language, and why such research would even be meaningful in the first place, it should be emphasized that the recognized functionalist view in psychology makes no innate distinction between the mental states of humans and mechanical operations of computers (Farrell 2014: 88, 398-403; for a general introduction and criticisms, see, e.g., Farrell 2014: 340-359; Hempel 1959; Rosenberg 2016: 195-210; for an example, see, e.g., Elster 1983). In this view (Farrell 2014: 111), it would be considered entirely meaningful to investigate human behavior through simulated modelling, since what is essential for understanding such behavior is not the substance of the unit being investigated but the function of their behavior. For example, the question of whether an LLM needs to possess consciousness (Eysenck and Keane 2020: 767-800) to emulate human behavior is entirely dependent on whether the distinct functions of consciousness are relevant for the specific behavior being studied. That is, if the functions of consciousness for the studied behavior can be adequately modeled and emulated, research using these models are just as meaningful and valid. Viewed alternatively using behaviorism (Farrell 2014: 362-389), intrapsychic differences between an LLM and humans, such as the presence of consciousness, would be entirely irrelevant, and any scientific inquiry will yield meaningful insights as long as their behavior are indistinguishable.

As to whether LLMs trained on natural language could be expected to sufficiently capture the nuances of human behavior, social psychologists and linguistics have long argued that language, thoughts, and behavior, while distinct phenomena, are closely linked (Holtgraves 2013; Lucy 1992; Thibodeau and Boroditsky 2011; Whorf 2011[1956]; e.g., Carpenter and Drabick 2011; Chen 2013; Fausey and Boroditsky 2010; Gao et al. 2015). Natural language is a medium for communicating information (Lyons 1991), and the types of data transferable in this manner are numerous. For example, information about the characteristics of objects (e.g., their color and shape), groups (e.g., their shared identity and goals), and individuals (e.g., their values and beliefs) can be communicated through natural language. Communication does not necessitate the transfer of information between individuals, since individuals sometimes engage in subvocal talking (i.e., thinking, Farrell 2014: 377-378), which can also occur vocally (i.e., by talking to oneself), and these types of communication are similarly communicable through natural language, showcasing just how nuanced human behavior can be captured through language. This is mostly achieved by language providing a formalized syntax, grammar, and vocabulary for expressing past and present thoughts, relations, experiences, feelings, values, and intended behavior in a structured, consistent, and predictable manner (Anderson 2020: 375-448; Eysenck and Keane 2020: 461-568; Lyons 1991). At the same time, language is undergoing constant adaptation to human requirements for communication (e.g., the term ‘Synthetic Causal Framework’ was invented to denote the concepts introduced in this paper), making it relatively reliable in terms of capturing the nuances of old and new behavior. Especially relevant is the consideration that language can be used to communicate one’s intended behavior, which can be considered an imperfect proxy for actual behavior (Ajzen and Fishbein 1977; Back et al. 2009; Zanna et al. 1980), since stated behavior is not necessarily equivalent to actual behavior.

The ability of language to capture the nuances of human behavior is further supported by personality psychology (Larsen et al. 2021), where trait-based approaches have relied on the lexical hypothesis (Galton 1884) to derive distinct personality traits that capture large parts of human behavior. This lexical hypothesis states that as language has evolved, the most relevant characteristics to distinguish individuals from each other have been incorporated into our language as distinct words. This ties into the use of language as being revelatory of the intraphychic state of individuals, with communication patterns and choice of words being predictive of the psychological states and social behavior of an individual (Chung et al. 2007; Pennebaker et al. 2003). But language not only expresses behavior in a nuanced manner, it also ‘shapes thought patterns on which actions are based.’ (Bandura 1999: 195). This means it also affects behavior by structuring our thoughts (Farrell 2014: 261), with some arguing that language even structures our perceptions of reality (i.e., the hypothesis of linguistic relativity, Lucy 1992; Whorf 2011[1956]; e.g., Boroditsky 2001; Fausey and Boroditsky 2011). That language shapes our thoughts and behavior is, for example, evidenced by framing (Tversky and Kahneman 1981) and priming (Bargh and Chartrand 2000), which often consists of a text-based information, communicated through natural language, that affects behavior by structuring our thoughts, either by presenting fragmented information or making some information salient (Weingarten et al. 2016). Such research demonstrates that human preferences and behavior are not fixed nor entirely determined on our genetic predisposition and the entirety of our lived experiences, but, consistent with the view of human reason as being bounded (Elster 1989/1990; March 1994; Simon 1955), instead vary in relation to the salience of information communicated through natural language. Accordingly, it could be expected that an LLM, despite lacking genes, consciousness, and lived experiences, if provided with enough relevant information about a particular context, could emulate human preferences and behavior, at least as they are expressed using natural language. Given these considerations, language would seem to capture many relevant nuances of human behavior, making human behavior somewhat predictable from natural language, and since LLMs have been trained on natural language (Ornstein et al. 2024), they can arguably be expected to, albeit imperfectly, predict human behavior.

3.2.2 Simulated Data

The choice of the SCF to mimic human behavior using LLMs to make non-manipulable theorized causes manipulable invites concerns regarding the appropriateness of essentially simulating data for research. Simulated data is often used by researchers in statistics to investigate the properties of estimators, since simulations allow them to control every underlying aspect of the data (e.g., Green and Yang 2009; Hanmer and Kalkan 2013; Homburg and Dobartz 1992; Hu and Bentler 1999; Jackson 2003; MacCallum 1990; Sheng and Sheng 2012). While concerns regarding the inference of simulated results to real world applications could plausibly be justified, such inferences underlie numerous approaches to research, and are, as mentioned, consistent with functionalist and behaviorist psychology (Farrell 2014: 88, 111, 362-389, 398-403). For example, the statistical method of bootstrapping (Efron 1979, 2003; Efron and Tibshirani 1994) is used to make inferences to the real-world, and this involves simulating counterfactual samples of real-world data to obtain a sampling distribution of the estimates, their usefulness for inference generally outcompeting analytical approaches in terms of robustness (Angrist and Pischke 2009: 293-308; King and Roberts 2014). For the non-parametric bootstrap (Efron and Tibshirani 1994; e.g., Barber and Thompson 2000), this is done by permuting data (i.e., re-sample data with replacement), assuming that the unit characteristics in the data are proportional to their prevalence in the population, while the parametric bootstrap (Efron and Tibshirani 1994; e.g., Studer 2021) involves assuming a specific probability distribution for the estimates and computing sampling distributions using Monte Carlo simulation (Robert and Casella 2013; Rubinstein and Kroese 2016). This is similarly done in Bayesian statistics (Bayes and Price 1763; Gelman et al. 2014; Kruschke 2014; Laplace 2009[1814]; McElreath 2019) with the use of Markov-Chain Monte Carlo methods (Brooks et al. 2011; Duane et al. 1987; Gelman and Rubin 1992; Metropolis et al. 1953), which like the parametric bootstrap, involves parametric simulations. These methods have from a statistical-theoretical standpoint numerous advantages, and they are generally considered valid when making inferences to the real-world (e.g., Gelman et al. 2021; Imai and Tingley 2012: 223-224; King et al. 2000: 352; Pawitan 2001: 122). Besides relying on simulations for statistical inference, they are also used when encountering real-world limitations of the phenomena being studied. For example, in the field of social dilemmas (van Lange et al. 2013), simulations underlie agent-based modeling, which are models used to estimate the actions of individuals in relation to social dilemmas across a number of trials too large to feasibly be conducted with human participants (van Lange et al. 2013: 31; Macy and Willer 2002), and this is again entirely consistent with a functionalist view of social research (Farrell 2014: 88, 398-403). This means that the reliance on simulated data is not necessarily inappropriate for research, and it can be emphasized that ‘as with any computer simulation, it becomes important to collect actual data, contrast the results against the simulation results, and then address the question of why there are deviations between the modeled and actual results.’ (van Lange et al. 2013: 31) As such, similar to the bootstrap, Bayesian statistics, and agent-based modeling, the employment of simulations in the SCF do not automatically make results invalid nor meaningless. Instead, it poses an empirical question regarding the quality of the simulated data and the ability to infer results to human populations, something that the SCF must be validated against.

In terms of data quality, consistent with the mantra of garbage-in-garbage-out (GIGO, Yoder-Wise 2023), the ability of the SCF to provide reliable causal inferences hinges in part on its external validity, since its fundamental aim is to generalize results to human populations. To take advantage of the aforementioned fact that LLM have been trained on a plethora of data containing natural language expressions of human thoughts, values, and behavior (cf. OpenAI 2024a; 2024b; 2024c; 2022; Ornstein et al. 2024), which would expectedly allow it to closely emulate human behavior as expressed through natural language, the SCF will simulate human units by using an LLM to behave as each unit. This means that the SCF produces results based on LLM-driven inferences of human behavior. Given that the external validity of this behavior is critical, the SCF will involve several procedures that serve to make the LLM-based behavior as realistic as possible. The first step involves supplying the LLM with a prompt detailing relevant characteristics of each human unit. Since LLMs are inherently text-processing neural networks that generate a probability-based output based on the input (cf. OpenAI 2024a; 2024b; 2024c; 2022; Ornstein et al. 2024), supplying the model with the characteristics of each simulated unit will likely enable it to more accurately infer the types of language-based behavior of an individual with those specific characteristics (cf. Chung et al. 2007; Pennebaker et al. 2003). While the relevancy of characteristics will likely differ between research topics, human characteristics that are generally considered relevant include those related to their sex, age, income, education, and geography (Memon et al. 2024). Researchers applying the SCF are encouraged to consider characteristics beyond those mentioned here, since consistent with concerns of latent measurement precision (cf. Bakker and Lelkes 2018; Credé et al. 2012), the greater the amount of simulated characteristics, the more precise the LLM can plausibly be expected to realistically and generalizably infer human behavior when controlling simulated units. At the same time, researchers should note that, just like humans (cf. Anderson 2020: 171-244; Eysenck and Keane 2020: 239-392), LLMs have a finite limit of how much information they can process and retain (cf. OpenAI 2025a), so the most relevant characteristics in relation to the research topic should likely take precedence. As such, by explicitly specifying these characteristics and supplying them to the LLM, it is expected that its ability to realistically emulate humans will expectedly increase.

The second consideration is that, like with any human data (Aronow and Samii 2016; Deaton and Cartwright 2018), the external validity of the simulations are closely related to how representative the simulated units are of the population of interest. If data on the entire population of interest is available, external validity is generally not a concern (Imai et al. 2008), but since the simulation of every human in a population would likely be unfeasible for a practical implementation of the SCF, the framework will be developed with a reliance on samples of simulated human units in mind, making external validity a concern. Generalizing from any sample to a population, it is an ideal that the unit characteristics in the sample are proportional to their prevalence in the population, since sample-population imbalances can produce errors (e.g., Type M and S) if effects vary across characteristics (Gelman et al. 2021: 59; Gelman and Carlin 2014; Imai et al. 2008). A sample with characteristics representative of the population is typically achieved through random sampling (Stock and Watson 2019: 108; Thompson 2012), where each unit has the same probability of being included in the sample. However, since the SCF will rely on LLM-based simulated units, a random sample from a population cannot be achieved in the conventional manner, for example, by distributing an online survey. Instead, the SCF will involve a simulated random sample, which will specifically be done by considering representativeness solely with respect to the aforementioned relevant characteristics. To apply random sampling to the simulation of these characteristics, it is necessary to obtain the prevalence of these characteristics in the population of interest. This may be unfeasible for some applications, but census data often contain such micro-level data, or at the very least, estimates of them (e.g., the IPUMS for the US Population, Ruggles et al. 2024). To then mimic random sampling, it is ideal if the simulated random sampling relies on the joint probability of characteristics (i.e., the probability of a characteristic conditional on all other characteristics), since different combinations of characteristics may not have equal probability of occurring in the population. For example, suppose that the population of interest is the voter population of the United States and that the characteristics to simulate are race (e.g., Caucasian and Afro-American) and party affiliation (e.g., Democrat and Republican). Here, it is known that Afro-Americans, compared to Caucasians, have greater probability of being affiliated with the Democratic Party than the Republican Party (Wallace et al. 2009), so it would be incorrect for the simulation to assume that the probability of being a Republican and a Caucasian is the same as the probability of being Republican and Afro-American (i.e., $Pr(Republican \ | \ Caucasian) \ne Pr(Republican \ | \ Afro-American)$). Accordingly, a simulated random sampling that simulates characteristics with independent probability would fail to account for this interdependence of characteristics, which is why the SCF will simulate random sampling using joint probability. This procedure for simulated random sampling can be implemented by transforming the census data into a stratification matrix (cf. Gelman and Little 1997), which is characterized by containing the proportion of each combination of the characteristics in the population, which in Frequentist statistics is equivalent to the joint probability. Analogous to the aforementioned bootstrapping (cf. Efron 1979, 2003; Efron and Tibshirani 1994), joint probability random sampling is implemented by randomly sampling with replacement from the stratification matrix, with the sampling probability being the (estimated) population proportion of each combination of characteristics. As evidenced by the equivalent logic of the bootstrap (cf. Efron 1979, 2003; Efron and Tibshirani 1994), this (asymptotically) guarantees a sample representative of the population with respect to the characteristics. A limitation of this procedure is the reliance on census data, which can contain measurement error and may not be current, easily available, nor contain measures of all relevant characteristics. For these instances, it may be necessary to simulate characteristics with outdated population estimates and/or without accounting for interdependencies. Nevertheless, the use of a joint-probability-based random sampling helps ensure that the characteristics supplied as the basis of the LLM-driven simulated units are sampled in proportion to their prevalence in the population of interest, increasing the external validity of the SCF.

3.2.3 Life-Histories

To further improve the ability of the SCF to provide results that generalize to human populations, the literature on life-histories (Bertaux 1981; Rosenthal 2018; Schütze 1997, 2008) is consulted. To understand why the concepts of life-histories can be used within the SCF to make the behavior of the simulated units more realistic, a brief review is presented here. A life-history can be considered ‘a narrative gestalt that must be envisioned as an ordered sequence of personal experiences’ (Schütze 2008: 168; cited in Wygnańska 2019: 116, emphasis in original), whose orderliness ‘implies the inner identity development of the biography incumbent’ (Schütze 2008: 168; cited in Wygnańska 2019: 116). This means that a life-history is a structured autobiographical narrative, and the fact that it expresses a person’s psychological construction of self-identity based in intuition (Schütze 2008: 164; Wygnańska 2019: 115) is supported by evidence of people forming ‘spontaneous narratives of self-lived experiences’ (Riemann 2014: 27; e.g., Alea 2008; Bellana et al. 2022). In sociology, this concept is the foundation for the life history method, a style of interview, where the interviewer lets the interviewee tell their life history (Riemann 2014: 23; Wygnańska 2019: 113). The method has seen extensive use (for review, see, e.g., Goodson 2001; for critcisms, see, e.g., Hamel 2013), including case studies of people having undergone traumatic episodes (Gaulejac 1997; e.g., Riemann 2014; Wygnańska 2019). Being the storyteller of their own life, the emphasis of this method is on the interviewee’s subjective experiences, and they have generally free constrains on how to order and frame their life and what experiences to talk about. A life-history can include information about the origin of the interviewee’s parents, their childhood, family dynamics, ethnicity, moral and cultural values, education, employment, social interactions, and experiences as an adult (e.g., Riemann 2014; Williams and Khan 2020; Wygnańska 2019). The main advantage of this method is ‘the possibility for the narrator to recreate in his/her story, the significant biographical processes that shaped his/her life in subsequent phases of biography, experienced situations, and the social framework in which his/her biographical processes took place.’ (Kaźmierska and Schütze 2013: 131; cited in Wygnańska 2019: 115). This means that the interview does not have to extensively cover all aspects of their life, letting the interviewee choose to highlight what they perceive as the most defining events of their life. Based on the life-history, the researchers then attempts to ‘explain motivations, attitudes, and values that shape human behavior and the flow of human affairs’ (Stanfield 1987) by basing the analysis on how the interviewee constructs meaning around the events of their life (Schütze 2008: 131; Wygnańska 2019: 116). Since this narration may cover traumatic events and socially undesirable behavior of the interviewee, trust between interviewee and interviewer are critical for the quality of the information divulged in this manner (Riemann 2014: 23; Wygnańska 2019: 115), and if this is achieved, even socially undesirable behavior can be divulged.

This concept of life-histories can be used by the SCF to further characterize simulated units. Compared to manually specifying every relevant characteristic, by letting the LLM serve a the autobiographical narrator of the simulated unit, the relevant characteristics assigned to each unit (e.g., gender, age, income), based on the aforementioned joint probability sampling procedure, can be used to the LLM to further ‘flesh out’ the unit through the generation of a life-history containing more characteristics and information about the unit. Stated differently, prompting the LLM to generate a life-history allows the generation of greater nuances about the individual, their identity, and behavior. Similar to the rationale for supplying the LLM with explicitly assigned relevant characteristics, basing the life-history on the assigned characteristics serves to ensure that the contents of the life-history are probabilistically determined conditional on the assigned characteristics. Compared to merely supplying the assigned characteristics to the LLM as a list, this also serves to embed the assigned characteristics in a realistic manner within the narrative structure of the life-history, allowing them to contribute to the creation of a more ecologically valid (cf. Schmuckler 2001) autobiographical identity of each simulated unit. This, of course, requires that the LLM generates a life-history that includes and is consistent with the assigned characteristics. While these are relevant concerns, they are partly empirical and can be evaluated by the researcher by comparing the assigned characteristics with the generated life-history. As such, prompting the LLM to generate life-histories based on the assigned characteristics are likely to provide more information on the type of individual that the LLM is to emulate, thus increasing the realism and external validity of the LLM-based behavior to human populations. Similar to the specification of relevant characteristics to simulate for the simulate units, note again practical limitations on the length of life-histories, which need to be short enough for the LLM to process and retain it when controlling the behavior of the synthetic units.

3.2.4 Synthetic Units

The synthesis of simulated characteristics and the LLM-generated life-history is expected to produce units that when controlled by the LLM produces substantially more realistic and generalizable behavior than simulated units lacking life-histories. This is because the LLM, by merit of being trained on real-world data (OpenAI 2024a; 2024b; 2024c; Ornstein et al. 2024), expectedly infuses the researcher-assigned characteristics with real-life characteristics, making the product not strictly simulated nor strictly real. As such, units produced by the synthesis of simulated characteristics and LLM-determined life-histories will, borrowing Hegelian dialectical terminology (Fichte 1970; Hegel 2014; though see Kaufmann 1959: 154), be denoted as synthetic units to clarify a conceptual distinction from simulated units. The concept of ‘synthetic units’ borrows from the existing concept of synthetic data (e.g., Abadie et al. 2010; Agarwal et al. 2023; Emam et al. 2020; Strübing et al. 2023). Such data similarly involves a synthesis between real-world and simulated data, and it has been used to circumvent privacy laws, where data containing personal information is used to generate realistic but simulated data that can be shared without violating such laws (Abowd et al. 2004; Emam et al. 2020). On a similar note, synthetic data is also used to train deep learning models aimed at real-world application (Nikolenko 2021; Kennedy et al. 2024; e.g., Zivenko et al. 2025). Consistent with the SCF’s reliance on LLM to generate synthetic units, previous studies have generated synthetic data using statistical models (Christensen et al. 2019: 187; e.g., Kinney et al. 2011), and the synthetic units produced by an LLM with the SCF can thus be considered to constitute a special case of synthetic data that is entirely valid for the basis of research. This also means that a key step of applying the SCF is thus the transformation of simulated units into synthetic units by generating life-histories based on relevant characteristics that are sampled in proportion to the population.

With synthetic units having been produced, it is then possible to conduct a study using these units in place of human respondents. This has the caveat that synthetic respondents can be manipulated in many more ways compared to human units, since there are fewer practical, ethical, and cost concerns. This means that a previously non-manipulable theorized cause could in principle be manipulated with the SCF, though a key assumption here is that the manipulation is meaningfully expressible through natural language and can be processed as input by the model (e.g., text, images, etc.). For example, if the purpose of the study is to evaluate the effects of a new pharmaceutical drug on human welfare, the LLM-driven synthetic respondents conceptualized by the SCF will likely yield invalid results because the LLM necessitates that the phenomena under investigation can be meaningfully captured in existing language-based expressions. The effects of a new drug fails to meet this criteria, because the effects of pharmaceutics mostly involve biochemical processes that may not be distinctly expressed in natural language. Accordingly, the use of the SCF in medical research is likely invalid and should only be considered with utmost precaution due to the direct ramifications of such research on human health. With these considerations in mind, the specific procedure used in the SCF to achieve an exogenous manipulation can then be outlined.

3.2.5 Exogenous Manipulation

For the SCF to provide an internal validity compareble to RCTs, the manipulation of the theorized cause in synthetic units must be exogenous to reliably avoid confounding (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49; Wooldridge 2019: 54-56). In line with the aforementioned considerations on this, Guido W. Imbens and Donald B. Rubin (2015: 31) have outlined theoretical restrictions on what constitutes an exogenous assignment mechanism, which involve it being individualistic, probabilistic, and unconfounded. These restrictions apply to the aforementioned RAM of RCTs, where each unit is individually assigned to a group manipulation condition (e.g., treatment or control), with equal probability (e.g., 50%), and independently of all other factors. This means that the RAM exogenously assigns any individual to any one manipulation condition. However, recall from the Potential Outcomes Framework and equations 1.4-1.5, how this procedure can be viewed as the result of the inability to assign the same unit to multiple manipulation conditions due to a lack of independence and simultaneity (equations 1.2-1.3), and that the RAM as a result only prevents confounding in an unbiased and asymptotic manner (Deaton and Cartwright 2018: 6). By instead relying on synthetic units, the SCF offers a synthetic solution to the Fundamental Problem of Causal Inference (cf. Holland 1986), since each unit can be independently and simultaneously assigned to every manipulation condition. This is because each synthetic respondent can in principle be exposed to one level of a manipulation theorized to impact the potential outcome, $Y_i(X_{x, i})$, while independently and simultaneously being exposed to another level of manipulation, $Y_i(X_{x', i})$. The independence of assignment occurs because the LLM determining the behavior of the synthetic units in each manipulation condition is unaware of previous, cuncurrent, and future manipulations of each unit, making the potential outcomes independent and thus avoiding pre- and post-manipulation biases (cf. Linos and Twist 2018; McElreath 2019: 170-176; Montgomery et al. 2018). The simultaneity of assignment occurs because multiple requests to the LLM can be made at the same time, which eliminates time as a confounder, because it would not vary between the manipulation conditions (equation 1.3). The results of this is that a counterfactual outcome, one for each manipulation condition, can be measured in an unbiased manner. Causal effects are then computed as the difference of each synthetic unit in one manipulation condition to the same unit in another manipulation condition (equation 1.2), meaning that all unit- and time-invariant factors are subtracted away, producing a causal effect arguably more reliable than one produced by an RCT. Under these conditions, not just average causal effects, but individual causal effects can be estimated without confounding (cf. Westreich et al. 2015). As such, in the SCF, all synthetic respondents are assigned to every manipulation condition using a simultaneous and independent exogenous assignment mechanism (SIXAM). The SIXAM of the SCF meets the aforementioned criteria by being individualistic (cf. Imbens and Rubin 2015: 34), since the probability of being assigned to a manipulation condition is independent of the manipulation condition of other units; it is probabilistic (cf. Imbens and Rubin 2015: 38), since it meets the requirement of every unit having the possibility of being assigned to any manipulation condition (with that probability being 1); and it is unconfounded (cf. Imbens and Rubin 2015: 38), since the assigned manipulation condition does not depend on the potential outcomes. As such, the SIXAM of the SCF differs from the RAM of RCTs, and causal effects are derived within the SCF through an independent and simultaneous generation of counterfactual outcomes that are compared across synthetic respondents (see equation 1.2), enabling unbiased estimates of causal effects through the elimination of confounders by design.

The simultaneity of the simultaneous and independent generation of counterfactual outcomes outlined above serves to eliminate time confounding results when employing equation 1.2, for reasons previously outlined in relation to equation 1.3. While this paper mostly serves to outline the theoretical framework of the SCF (though for a proof-of-concept, see below), with specific implementations of the SCF likely varying across research topics, a theoretical basis on how to implement a simultaneous generation is provided here for clarity. The ability to simultaneously generate counterfactual outcomes for the same synthetic unit is made possible by the reliance on LLMs, with multiple LLMs being available for use at the same time. Hence, a synthetic unit can be operated by one LLM under one manipulation condition, while the same synthetic unit is operated by a different yet identical LLM under another manipulation condition. This is perhaps most easily achieved by relying on an API for the LLM (e.g., OpenAI’s API, 2025a). By specifying the different manipulation conditions for the synthetic unit, a parallel process (cf. Gottlieb and Almasi 1989; Ozdoganlar and Endres 2021[1999]) can be used to simultaneously call the API, one time for each manipulation condition, allowing the LLM to expectedly generate an outcome for each manipulation condition with no time difference nor dependency on the generation of other outcomes under different manipulation conditions. To conceptually denote this parallel processing-procedure in relation to the SCF, this simultaneous and independent exogenous manipulation and generation of counterfactual outcomes will, with inspiration from Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel (2016), be denoted as ‘parallel worlds’-estimation. This helps emphasize the notion that the employment of equation 1.2 to compute causal effects makes a simultaneity assumption to ensure its equivalency with equation 1.3. This simultaneity assumption states that the point in time for measuring the counterfactual outcomes for any unit do not vary with the manipulation assigned to each unit. Note that this assumption mostly matters for the computation of unconfounded individual causal effects, since the average causal effect, just like for RCTs (Ho et al. 2007: 207-208), is unconfounded if the points in time for measuring the counterfactual outcomes do not vary across the manipulation groups. As such, the SCF solves the Fundamental Problem of Causal Inference (cf. Holland 1986) by implementing a simultaneous and independent exogenous manipulation and generation of multiple counterfactual outcomes for each synthetic respondent, which are used to reliably isolate a causal effect through intra-unit comparisons of counterfactual outcomes. This identification strategy outlined for the SCF can account for the aforementioned fact that the mere assignment to any manipulation does not ensure the intended level of manipulation, and that the potential outcome could in fact depend on an interaction between the theorized cause and the manipulation levels (e.g., if the LLM infers a prior manipulation of the synthetic unit which interacts with an effect on the inferred outcome). This is achieved by modifying equation 1.2 using equation 1.5, resulting in the following:


$D_{\underset{x, x'}{\Delta}, \underset{g_x, g_{x'}}{\Delta}, i} = (Y_i(X_{x, i}, C_i) \: \| \: G = g_x) \: - \: (Y_i(X_{x', i}, C_i) \: \| \: G = g_{x'})$	1.6

As already stated, equation 1.6 involves the computation of individual treatment effect using intra-unit comparisons of counterfactual outcomes that have been generated using the ‘parallel worlds’-estimation procedure, which employs the aforementioned SIXAM. Unlike equation 1.2, equation 1.6 accounts for an interacting effect between the manipulation condition and the theorized cause by explicitly denoting the effect (i.e., $D$) as possibly conditional on the realized (i.e., $x$ and $x'$) and intended levels of manipulations (i.e., $g_x$ and $g_{x'}$). Aggregated to the group level, the identification strategy is equivalent to equation 1.5 and can be re-formalized for clarity as the following equation.


$\mathbb{E}(D_{\underset{x, x'}{\Delta}, \underset{g_x, g_{x'}}{\Delta}}) = \mathbb{E}(Y(X_{x}, C) \: \| \: G = g_x) \: - \: \mathbb{E}(Y(X_{x'}, C) \: \| \: G = g_{x'})$	1.7 (1.5)

To statistically estimate the causal effect, equations 1.6 and 1.7 specify it as the difference in potential outcomes across counterfactual manipulation conditions. This is easily estimable using the simple but powerful difference-in-means estimator, which is unbiased and requires few assumptions (Deaton and Cartwright 2018: 3; Gelman et al. 2021: 350-351). For designs with more than two manipulation conditions, the identification strategy of the SCF can easily be generalized to an arbitrary number of manipulation conditions, though the estimation of a causal effect with more than 2 groups would likely necessitate introducing more assumptions about the functional form of the relationship between the theorized cause and outcome (e.g., non-linearity). Note also that since the estimation involves intra-unit comparisons, the identification strategy of the SCF as outlined here guarantees perfect balance between groups in terms of unit-invariant confounders (though time-invariant confounders will also be balance if the aforementioned simultaneity assumption holds), which is different from the mere unbiased and asymptotic properties of balance in RCTs. This procedure is similar to and equally efficient as blocking (Gelman et al. 2021: 347-353), which involves a stratified randomization based on measured confounders that guarantees balance on those confounders (Imai et al. 2008: 489; Duflo et al. 2007: 3925). Perhaps most importantly in terms of efficiency, by merit of making such intra-unit comparisons, the SCF obtains perfect balance on both measured and unmeasured confounders.

Inferring a causal estimate derived from the sample to the population using this procedure outlined for SCF can then be done using standard statistical inference, for example, with parametric or non-parametric statistical models. However, for inferential statistics to be reliable, they need to account for each synthetic unit having been measured multiple times. This is because of the fact that data derived from the same unit can be autocorrelated (Stock and Watson 2019: 375; Gelman et al. 2021: 459), which produces downward-biased estimates of inferential uncertainty. Autocorrelation can be mitigated by accounting for the clustered structure of the data, with a common approach being multilevel models (Gelman and Hill 2007). These can be easily implemented using standard statistical software packages¹⁶, meaning the causal inference can be easily derived from data generated by the SCF using standard statistical inference.

3.2.6 Formalization

With the theory underlying the framework outlined above, the Synthetic Causal Framework (SCF) is here defined as a theoretical framework for drawing reliable causal inference when the theorized cause in non-manipulable in human populations, achieved through the use of LLM-generated synthetic units that enable independent and simultaneous intra-unit comparisons of counterfactual outcomes. As outlined here, the SCF can primarily be considered a theoretical contribution to the literature on causal inference, with its scope of application being that of the social and behavioral sciences (cf. Wright 2015), specifically studies that rely on human populations, where the theorized cause is non-manipulable. Its central assumption is that the theorized cause and outcome must be meaningfully expressible through natural language, and researchers applying the SCF must be able to justify why this assumption holds for their particular research topic. Since it is typically unclear what specific data an LLM have been trained on, researchers should also justify why they use a specific LLM and why they expect the theorized cause and outcome to have been expressed in data that this LLM could have been trained on. These are prerequisites for establishing why the LLM can be expected to infer human behavior in relation to an effect of the theorized cause on the outcome.

This formalization of the SCF can be considered an an extension of the Neyman-Rubin Potential Outcomes Framework (cf. Holland 1986; Imbens and Rubin 2015; Neyman 1923; Rubin 2005), with its fundamental definition of a causal effect being the difference in potential outcomes varying solely with respect to the theorized cause. The causal identification strategy of the SCF involves making simultaneous and independent intra-unit comparisons of the counterfactual outcomes, with the equation used for estimating this at the individual level being equation 1.6, a modified form of the aforementioned equation 1.2. At the group level, the equation used for estimating the expected causal effect is equation 1.7, which is equivalent to the aforementioned equation 1.5. By using the ‘parallel worlds’-estimation procedure, the SCF applies a simultaneous and independent exogenous assignment mechanism (SIXAM) to assign each synthetic unit to every manipulation condition, with a counterfactual outcome being measured for each, allowing intra-unit comparisons of the counterfactual outcomes to eliminate all unit- and time-invariant confounders. Should the ‘parallel worlds’-estimation procedure fail or be unavailable to researchers (e.g., parallel processing requires a computational core for each manipulation condition), to reliably estimate individual causal effects, researchers may alternatively rely on the simultaneity assumption. However, researchers invoking this assumption should note that if it is violated, time may confound causal estimates at the individual level, but similar to RCTs, not at the group level. While these considerations together ensure a high internal validity of the SCF, its external validity is less certain. Such concerns are mitigated by procedures embedded within the SCF, foremost through the generation of synthetic units from characteristics simulated for their relevance to the research topic. This simulation involves census data, where a joint probability random sampling procedure is used to ensure that the characteristics simulated are proportional to their prevalence in the population of interest. These characteristics are then used to generate further characteristics as part of a life-history (cf. Bertaux 1981), that while being expectedly less important than the simulated characteristics, nonetheless serve to add human-like nuance to the synthetic units, so that the LLM-driven behavior of each synthetic unit becomes more realistic and generalizable to humans. To provide an ease of overview of the formalization of the SCF and the workflow involved in its use, a flowchart is provided in figure 3.2.

Figure 3.2: The Synthetic Causal Framework

$\ $

NOTE: Flowchart. ${}^{[1]}\!$Joint probability sampling is used to simulate relevant unit characteristics in proportion to their prevalence in the population. ${}^{[2]}\!$Simulated unit characteristics and instructions to generate a life-history are provided in a prompt to the LLM. SIXAM = Simultaneous and independent exogenous assignment mechanism. ${}^{[3]}\!$Parallel processing is used to simultaneously, independently, and exogenously assign each synthetic unit to both manipulation conditions. With the LLM controlling the behavior of the synthetic units based on the generated life-histories, a counterfactual outcome for each manipulation condition is measured. Note that the dashed circle indicates the area of the design where the ‘parallel worlds’-estimation is implemented. ${}^{[4]}\!$Data is transposed into a long-format to account for the multiple outcome measurements of each synthetic unit.

Having outlined the SCF in this manner, there are several considerations worthy of a brief discussion. It should be noted that the reliance on generating synthetic data for causal inference is not new (e.g., Mazaheri et al. 2024). For instance, the previously mentioned synthetic control methodology (Abadie et al. 2010, 2015; Abadie and L’Hour 2021) serves to enable causal inference through the generation of counterfactual outcomes. This method is often used in case studies of countries and the effects of policy implementations, with the counterfactual outcome being constructed from the outcomes of other units (e.g., countries that did not implement the policy, Bilgel and Galle 2015; Birdsall 2011; Gilchrist et al. 2024; Powell 2022). While this method does provide justification for a causal interpretation and is relatively easily implemented in conventional software (e.g., Abadie et al. 2011), it does possess the same previously mentioned issues of adjusting for observables, since the synthetic conterfactual outcome is constructed from measured characteristics across groups in the data, leaving results possibly confounding by the failure to match on unobservables. Another limitation of that approach is the necessity for researchers to possess time-series data, reducing it applicability by researchers (though applicability is expanding, e.g., Abadie and L’Hour 2021; Arkhangelsky et al. 2021; Pang et al. 2022). While both methods essentially involve the inference of counterfactual outcomes, unlike the synthetic control method, the Synthetic Causal Framework (SCF) proposed here does not solely match on observables and does not require time-series data, making it more broadly applicable.

On a related note, the internal and external validity of the SCF’s reliance on LLMs may be criticized on the theoretical grounds that the generation of counterfactual outcomes essentially involves the LLM making stochastic predictions conditional on the life-history of each synthetic respondent. Implicit in such criticism is the premise that such behavior is fundamentally different from how humans behave. However, for this criticism to hold, the generation of outcomes in humans must substantively differ from synthetic respondents. That is, humans must not be stochastic nor condition their responses on their life-history. But researchers consistently assume that, and find evidence consistent with, human behavior being stochastic and conditional on their life-history (e.g., their gender, age, education, income, political affiliations, Memon et al. 2024), where every aspect of one’s identity or experiences do not matter equally, nor are they equally salient across different moments. Analogous to an LLM making a stochastic prediction of counterfactual outcomes conditional on a life-history, human respondents similarly make a guess about the outcome (e.g., their preference for a political leader) based on salient components of their life-history, which may change at a later time, for example, due to the acquirement of new experiences and information. Similarly in terms of stochasticity, when making statistical inference from data consisting of human self-report data, a common assumption is that of random measurement error (Gelman et al. 2021: 49-50; Levy and Mislevy 2020; Stock and Watson 2019), where any human response is characterized by some idiosyncratic and stochastic error component, and the concepts of inter and intra-rater reliability specifically serve to account for the stochasticity of human rating (Saal et al. 1980). Accordingly, an LLM predicting a preference based on the life-history and characteristics of the synthetic respondent is not inherently different a human respondent forming a preference based on their characteristics, life-history and context of the situation.

While these considerations arguably does lend some credence to the external validity of the SCF, it should be noted that LLMs currently do possess numerous limitations (e.g., Edyko et al. 2023; Koga and Du 2024; Lecler et al. 2023; Obaid et al. 2023; Park and Choi 2024; Saka et al. 2024; Shea and Ma 2023; Zhang et al. 2024). While future models may solve some of these issues (e.g., OpenAI 2024c), a central concern for this paper is whether results produced by current LLMs allow researchers to implement the SCF and expect results to generalize to human populations. To test the external validity of the SCF, and provide a proof-of-concept for how researchers can implement it, the remainder of this paper is therefore dedicated to validating the SCF against human data and assess whether its results are generalizable.

4 Methodology

To justify the Synthetic Causal Framework (SCF) empirically, as well as help researchers implement it, a conceptual replication study (cf. Isager et al. 2023: 2; Nosek et al. 2012: 619) is undertaken that simultaneously serves as a demonstrative proof-of-concept. This section first details that the external validity of the SCF will be validated through a large-scale replication (n = 1,009) of the established relationship between deservingness and welfare support (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). This involves a theoretical justification of this relationship, with the exact causal estimate to replicate being determined by conducting a meta-analysis (Eysenck 1995; Glass 1976; Sharpe and Poets 2020) on studies previously investigating it within the United States. The study then follows the illustrated workflow of the SCF (see figure 4.1) by implementing a joint-probability random sampling procedure to simulate random sampling from the adult population of the United States, making results expectedly more representative of that population. This is followed by specifying prompts for generating life-histories (cf. Bertaux 1981) and detailing code for making requests to the gpt-4o-mini LLM (OpenAI 2024a) using OpenAI’s API (2025a), which serve to transform the simulated respondents into synthetic respondents, making their expected behavior more nuanced and realistic. The novel concept of a Synthetically Controlled Trial (SCT) is then introduced, where the ‘parallel worlds’-estimation procedure is implemented for a simultaneous and independent exogenous manipulation of deservingness. Group balance and the simultaneity assumption are then assessed, followed by specifying the Multilevel Bayesian Generalized Robust Linear Model (cf. Bürkner 2017, 2018) that is used in the analysis for identifying the causal effect of deservingness on welfare attitudes.

Figure 4.1: SCF-based Replication Design

$\ $

NOTE: Synthetic Causal Framework-flowchart. n = Sample size. ${}^{[1]}\!$Based on population estimates from IPUMS (Ruggles et al. 2024) and ANES (2021), joint probability sampling is used to simulate relevant unit characteristics in proportion to their prevalence in the population. These unit characteristics are US state of residence, area type of residence, number of children in household, sex, age, marital status, ethonoracial group, level of education, employment status, annual household gross income in 2022 US dollars, preferred political party, and political ideology. ${}^{[2]}\!$Simulated unit characteristics and instructions to generate a life-history are provided in a prompt to the API of the GPT-4o mini LLM (OpenAI 2024a). SIXAM = Simultaneous and independent exogenous assignment mechanism. ${}^{[3]}\!$Parallel processing is used to simultaneously, independently, and exogenously assign each synthetic unit to manipulation conditions, where they are exposed to a ‘lazy’ (Manipulation A) and an ‘unlucky’ welfare recipient (Manipulation B). With the LLM controlling the behavior of the synthetic units based on the generated life-histories, a counterfactual outcome for each manipulation condition is measured, which is the support for restricting the welfare eligibility for the corresponding welfare recipient. Note that the dashed circle indicates the area of the design where the ‘parallel worlds’-estimation is implemented. ${}^{[4]}\!$Data is transposed into a long-format to account for the two outcome measurements of each synthetic unit, doubling the sample size.

Following recent recommendations for reproducible and transparent research (e.g., Alister et al. 2021; Lenz and Sahn 2021; Open Science Collaboration 2015; Peng 2011; Ritchie 2021), this study subscribes to the open science standard (Christensen et al. 2019; Nosek et al. 2012, 2015). This means that the study is (1) an open project, whose development can be followed by the public; (2) that it uses open data, with raw data being made freely available; and (3) that it relies on open software by using the free and open-source R programming language (R Core Team 2024) and RStudio integrated development environment (IDE, Posit Team 2024), which is especially suited for data science (Welbers et al. 2017: 245) and reproducible research (Krotov and Tennyson 2018: 171), with code being provided as R syntax in the markdown (Gruber 2014) version of this file. This is all made available using the Open Science Framework (OSF), with materials provided on the project hub: DOI: 10.17605/OSF.IO/5P3M2 (Note that data and code are not made available in this preview version). Similarly, for transparency and reproducibility, software specifications, used R packages, descriptive statistics of the data, and details not covered in the main paper are provided in the appendix.

4.1 Deservingness Heuristic

To assess the external validity of the SCF, that is, the generalizability of results derived from synthetic respondents to human populations, this study seeks to replicate the established relationship between deservingness and welfare attitudes (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). To understand the concepts involved in this relationship, and its usefulness for validating the SCF, a brief review of existing theory and results is provided.

The relationship between deservingness and welfare attitudes is founded in evolutionary psychology (Buss 2019), where substantial parts of human behavior are considered the product of evolution by natural selection (Darwin 1998; Futuyma and Kirkpatrick 2023). In this view, the three essential components of evolution - genetic variation, inheritance and differential reproductive success (Buss 2019: 7) - has lead to the human genome adapting physically, psychologically, and socially to the conditions of the environment (Buss 2009, 2019; Dawkins 1976; Elster 2007). With evolution by natural selection operating mostly at a slow pace, the human genome is generally considered adapted to an environment characteristic of a hunter-gatherer society (Buss 2019; e.g., Boyer and Petersen 2018; Van Prooijen and van Vugt 2018), meaning that it is mostly not adapted specifically for the modern society. In the long run, evolution produces adaptations useful for a particular environment, but these adaptations are not necessarily optimal and can prove disadvantageous across different environments (Boyer and Petersen 2018; Buss 2019; Li et al. 2018; Sperber and Hirschfeld 2004). While some of these adaptations have lead to predispositions for different types of behavior that vary across individuals (e.g., Claessens et al. 2020; Hibbing et al. 2019; Oxley et al. 2008; though see Bakker et al. 2020; Osmundsen et al. 2022), others are posited to produce mostly invariant predispositions in the form of cognitive modules in the brain (Kurzban and Aktipis 2007; Sperber and Hirschfeld 2004) that operate subconsciously, automatically, and mandatory (Cosmides et al. 2003: 173). Being specialized for making cost-efficient computational operations that affect our behavior by serving as decision heuristics (Kurzban and Aktipis 2007; Sperber and Hirschfeld 2004; e.g., Boyer and Petersen 2018; Pietraszewski et al. 2015), such modules are closely related to System I intuition-based thinking (Kahneman 2024; e.g., Haidt 2013; Haidt and Joseph 2004). It should be noted that while cognitive heuristics are generally theorized as being universal and thus invariant across populations, they are not necessarily utilized to the same extent across individuals (e.g., Aarøe et al. 2017), and like other predispositions (Hibbing et al. 2019: 26), they do not determine behavior in a monocausal manner, since biological and social phenomena likely interact (e.g., Hinsz et al. 2013). With cognitive modules functioning involving processes spanning different brain regions (Kurzban and Aktipis 2007; Sperber and Hirschfeld 2004), they are physical constructs, and similar to how other cognitive modules can become defunct due to a neurological disorder (e.g., people with prosopagnosia cannot recognize faces, Buss 2019: 253), some individuals may not possess nor be affected by a cognitive heuristic.

Since humans are social creatures and depend on cooperation with others to improve their reproductive success, this has lead to the development of social groups (Buss 2019), and cognitive modules on how to assess ingroup and outgroup members likely evolved to solve issues with whom to cooperate and how to redistribute goods (Fong et al. 2006). One such cognitive module has been denoted the deservingess heuristic (Aarøe and Petersen 2014; Petersen et al. 2011, 2012; Petersen 2012), and researchers state that it ‘influences everyday judgments about interpersonal help giving’ and ‘constitutes a key psychological basis for judgments about political help giving; that is, social welfare’ (Aarøe and Petersen 2014: 686). Consistent with related research (Feather 1999; Gilens 2000; Larsen 2006; Skitka and Tetlock 1993; van Oorschot 2000; Weiner 1995), this means that the deservingness heuristic assesses whether members are ‘deserving’ or ‘underserving’ of being redistributed goods (Aarøe and Petersen 2014: 686). An individual being assessed as ‘low in deservingness’ can, for example, occur because they enjoy the benefits of the goods while being ‘unwilling to invest effort to improve their circumstances’ (Aarøe and Petersen 2014: 687). By comparison, an individual may be assessed as ‘high in deservingness’ because they are ‘making an effort and trying but failing due to forces beyond their control’ (Aarøe and Petersen 2014: 687), which indicate that the perceived effort of a person is central in assessing their deservingness. The deservingness heuristic, similar to other heuristics, likely operates in an automatic and subconscious manner (Petersen et al. 2011) by regulating emotions (cf. Al-Shawaf et al. 2015), producing a behavioral response (e.g., a political attitude) by evoking a positive emotion for individuals deemed high in deservingness and a negative emotion for individuals low in deservingness (Petersen et al. 2012), which in turn affect political support toward welfare policies. Based on numerous studies across cultures (e.g., Aarøe and Petersen 2014; Appelbaum 2002; Meyer and Mulherin 1980; Petersen et al. 2012; Weiner 1995), the deservingness heuristic is likely a cognitive module universally shared across the human population due to the common issues of group cooperation (e.g., free-riding, Delton et al. 2012). Since cognitive heuristics are cost-efficient, they can make suboptimal evaluations due to relying on sparse information, and consistent with numerous studies (Allport 1954; Bodenhausen and Lichtenstein 1987; Macrae et al. 1994), researchers argue that it can operate simply through stereotypes (Aarøe and Petersen 2014: 687), which are defined as ‘beliefs about the characteristics, attributes, and behaviors of members of certain groups’ (Hilton and von Hippel 1996: 240; cited in Aarøe and Petersen 2014: 687). This ties directly into the perceived effort of a person and means that the deservingness of a welfare recipient can be determined simply based on how much they resemble a stereotypical depiction of a ‘lazy’ or ‘unlucky’ welfare recipient (cf. Aarøe and Petersen 2014; Petersen et al. 2012). Taken together, the deservingness heuristic operates by taking in sparse information about individuals (e.g., ‘lazy’ and ‘unlucky’ stereotypes) and computes an affective response that exacts a causal effect on political support for social welfare policies. For an illustrative view, the operations of the deservingness heuristic underlying the relationship between deservingness and welfare attitudes are provided as a directed acyclic graph (DAG, Pearl 1988, 1993, 1995) in figure 4.2. Note that much of the aforementioned research is also somewhat consistent with evolution by cultural selection (cf. Dawkins 1976; Sperber 1976), since the deservingness heuristic could in principle be a learned heuristic universally shared across cultures, similar to numerous other cognitive heuristics stipulated to be social constructions (e.g., Kuklinski and Quirk 2000; Lau and Redlawsk 2001).

Figure 4.2: Relationship Between Deservingness & Welfare Support

$\ $

NOTE: Directed Acyclic Graph (DAG, Pearl 1988, 1993, 1995). Theorized causal relationship based on (Aarøe and Petersen 2014).

The relationship between deservingness and welfare attitudes has been directly replicated across numerous studies (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). Since these are direct replications (cf. Isager et al. 2023: 2; Nosek et al. 2012: 619), they rely on highly similar methods, and for brevity, this review will only present the design of the original (i.e., Aarøe and Petersen 2014). In the original study, Lene Aarøe and Michael Bang Petersen (2014: 688) rely on cross-national, representative samples from the United States (US, n = 1,009) and Denmark (n = 1,006). Here, they employ a experimental survey design (cf. Gaines et al. 2006) containing three manipulation conditions, with the manipulations being validated text-based vignettes containing deservingness-cues in relation to a ‘lazy’, ‘unlucky’, and a ‘control’ welfare recipient¹⁷ (Aarøe and Petersen 2014: 689; also used in Petersen et al. 2012). Survey participants are assigned to view either one of these vignettes using the aforementioned random assignment mechanism (RAM) characteristic of randomized controlled trials (RCTs, Coover and Angell 1907; Peirce and Jastrow 1885; Woodworth and Thorndike 1901), which together with its reliance on stratified random and representative samples (Aarøe and Petersen 2014: 688) means that their study is here considered capable of providing reliable causal inference. The theorized welfare support outcome is operationalized as political support for tightening the requirements for welfare eligibility of the shown welfare recipient: ‘To what extent do you disagree or agree that the eligibility requirements for social welfare should be tightened for persons like him?’ (Aarøe and Petersen 2014: 689, emphasis in original), with the response being measured using a 7-point Likert (Likert 1932) response scale, ranging from ‘Strongly disagree’ to ‘Strongly agree’. They consider relevant demographics to include gender, age, and education (Aarøe and Petersen 2014: 689), and they use Frequentist multiple linear models, estimated using ordinary least squares (OLS, Aarøe and Petersen 2014: 691), with all ‘Don’t know’-responses being excluded from the analysis (Aarøe and Petersen 2014: 689). By comparison, the replications do not include the ‘control’-vignette and solely rely on US-based samples, with Kane et al. (2023) relying on a sample size of 805, while Mummolo and Peterson (2019) conduct two separate replications, with sample sizes of 1,874 and 5,550. To leverage these direct replications, focus will be on the ‘lazy’ and ‘unlucky’ conditions and the results specific to the US. In all three studies (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019), they identify the effect of deservingness on welfare support as positive and statistically discernible¹⁸ from zero with 95% confidence (see the respective estimates in table 10.1 in the appendix).

To arrive at the specific causal estimate to replicate, a meta-analysis (Eysenck 1995; Glass 1976; Sharpe and Poets 2020) is conducted using the estimates provided by these studies¹⁹ (i.e., Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019), which can be considered the ‘platinum standard’ of science (Pedersen and Stritch 2018: 606), since it combines the results from each study to produce a more robust uncertainty-weighted average estimate. This exploited the numerous advantages of Bayesian inference (Gelman et al. 2014; Heck 2019; Kruschke 2010; Levy and Mislevy 2020; McElreath 2019; Wagenmakers et al. 2010) by employing a multilevel Bayesian model (cf. Bürkner 2017, 2018; e.g., Maier et al. 2022). This used the brms R package (Bürkner 2017, 2018) to implement the No-U-turn algorithm (NUTS, Hoffman and Gelman 2014) for the Hamiltonian Markov-Chain Monte Carlo procedure (Duane et al. 1987; Neal 1993, 2011), implemented in the probabilistic software STAN (Stan Development Team 2024), to efficiently draw 40,005 samples from the posterior using 7 independent sampling chains, with 2,000 warmup samples being drawn in advance. This produced an estimated uncertainty-weighted average effect of .44 (SD = .12; $BF_{1,0}$ = 22.37; 95% HDI[.12; .61]; $\hat{R}$ = 1.00; ESS = 5,553; MCSE = .00; Prior = $\mathcal{T}_3$(.00, .08); for an explanation of these metrics, see the ‘Bayesian inference’ and ‘Analysis’-sections), which indicates a positive effect of deservingness on welfare support that can be considered discernible from zero with 95% credibility (see the appendix for model formalization, diagnostics, and posterior distributions). As generally recommended by King et al. (2000), these results can be more meaningfully interpreted as the following: Ceteris paribus, compared to being exposed to a ‘lazy’ welfare recipient (i.e., someone low in ‘deservingness’), individuals exposed to an ‘unlucky’ welfare recipient (i.e., someone high in ‘deservingness’) are on average 43.92 percentage-points (pp) more supportive of allowing the recipient to receive welfare support at the current requirements for welfare eligibility, plus or minus 11.88 pp, and with 95% credibility between 12.24 pp and 61.41 pp.

An advantage of this Bayesian approach (Kass and Raftery 1995; Raftery 1995) is the ability to compute posterior odds (i.e., $\dfrac{Pr(H_\alpha \ne 0 \ | \ D)}{Pr(H_\alpha = 0 \ | \ D)}$) from the likelihood odds (i.e., $\dfrac{Pr(D \ | \ H_\alpha \ne 0 )}{Pr(D \ | \ H_\alpha = 0)}$) and the prior odds (i.e., $\dfrac{Pr(\ H_\alpha \ne 0)}{Pr(H_\alpha = 0)}$) for the estimated parameters (e.g., $\alpha$) using Bayes theorem (cf. Bayes and Price 1763; Laplace 2009[1814]; McElreath 2019: 36-39) as shown in equation 1.8:


$\dfrac{Pr(\ H_\alpha \ne 0 \ \| \ D)}{Pr(H_\alpha = 0 \ \| \ D)} = \dfrac{Pr(D \ \| \ H_\alpha \ne 0 )}{Pr(D \ \| \ H_\alpha = 0)} \times \dfrac{Pr(H_\alpha \ne 0)}{Pr(H_\alpha = 0)}$	1.8

With the likelihood odds also known as the Bayes factor (BF, Kass and Raftery 1995; Raftery 1995), this statistic can be computed from Bayesian models by exploiting the Savage-Dickey density ratio (see Heck 2019; Makowski et al. 2019a; 2019b; Wagenmakers et al. 2010). Then, by assuming prior odds of 1, implying that a non-zero hypothesized parameter is a priori as likely as a hypothesized zero parameter²⁰ (i.e., $\dfrac{Pr(\ H_\delta \ne 0)}{Pr(H_\delta = 0)} = 1$), the BF simplifies to the posterior odds (i.e., $\dfrac{Pr(\ H_\delta \ne 0 \ | \ D)}{Pr(H_\delta = 0 \ | \ D)} = \dfrac{Pr(D \ | \ H_\delta \ne 0)}{Pr(D \ | \ H_\delta = 0)} \times 1 = BF$, Wagenmakers et al. 2010: 165). This metric is particularly useful, because it addresses the limitations and criticisms of the p-value (e.g., American Statistical Association 2016; Cohen 1994; Johnson 1999; Kruschke 2010; Wagenmakers et al. 2010), which is directly illustrated by taking the exponent of the $ln(BF)$ provided in the results, and assuming prior odds of 1, one can state that the hypothesized effect of deservingness on welfare support is 22.37 times more probable than the hypothesis of no effect given the data. Following guidelines by Adrian E. Raftery (1995), the relative evidence provided by the BF can then be interpreted to indicate ‘strong evidence in favour of’ the hypothesized relationship between deservingness and welfare support given the data.

Since a Bayesian model relies on the specification of prior distributions for each parameter which may inadvertently bias results if they are ‘strongly informative’ relative to the data (Gelman et al. 2021: 123-127; Gelman et al. 2014: 55-56), to assess robustness, a multilevel Frequentist meta-analysis (Viechtbauer 2010) is also conducted by employing a model using restricted maximum likelihood (REML, Patterson and Thompson 1971). For inference, this exploited a Student t-parametric bootstrap (Efron and Tibshirani 1994) with an equivalent 40,005 permutations to increase comparability with Bayesian estimates. To compute summary statistics, contrary to the Bayesian results (cf. McElreath 2019: 197), this relied on Frequentist estimators (cf. Fisher 1970; Stock and Watson 2019; Wooldridge 2019), such as biased-reduced estimators of standard error (SE, Park et al. 2022; Park and Wang 2020, 2022) and bias-corrected and accelerated (BCa) confidence intervals (Efron 1987; Makowski et al. 2019a). This yielded a similar estimated SE-weighted average effect of .50 (SE = .06; 95% BCa[.38; .62]), which for practical purposes can be interpreted similar to the Bayesian estimates (though for the interpretive differences, see, e.g., Clayton 2021). While the overlap (Makowski et al. 2019a) between the Bayesian posterior distribution and the Frequentist parametrically-bootstrapped (maximum) likelihood distribution is relatively moderate (67.92%), the exact difference is -.06 (SD = .13; 95% HDI[-.41; .16]), which indicates that, while the Bayesian approach produces a lower, and thus more conservative, estimated average effect, the difference is indiscernible from zero with 95% credibility. As such, the priors used in the Bayesian model can be considered ‘weakly informative’ in their influence on the estimated parameters, and the specific causal estimate to replicate will be the Bayesian estimates.

Having outlined both the theory, design, and effect pertaining to the relationship between deservingness and welfare attitudes, the reason for assessing the external validity of the SCF by replicating specifically these results can be made clear. Since the relationship is founded in the concept of the deservingness heuristic (Aarøe and Petersen 2014; Petersen et al. 2011, 2012; Petersen 2012), which is essentially a computational algorithm for making decisions on support for welfare policies, consistent with functionalist psychology (Farrell 2014: 88, 111, 398-403), it is likely that LLM-based simulated behavior would be capable of mimicking the functionality of this cognitive module because the neural networks underlying these models mimic the neural structure of the human brain (cf. OpenAI 2024a; 2024b; 2024c; 2022; Ornstein et al. 2024). Since the input for the deservingness heuristic has been demonstrated to be expressible entirely as text-based vignettes (Aarøe and Petersen 2014; Petersen et al. 2011, 2012), the manipulation conditions meaningfully fit with the input-format of LLMs. Similarly, all LLMs trained on a large enough repertoire of text-based expressions of everyday human behavior (e.g., most of OpenAI’s GPT-models, cf. OpenAI 2024a; 2024b; 2024c; 2022; Ornstein et al. 2024) can be expected to have been trained on data reflecting stereotypes, specifically those relating to welfare recipients, which would likely occur in relation to expressions of support for welfare policies targeting these stereotypes. With the gpt-4o-mini LLM by OpenAI (2024a) being one such model that simultaneously manages to be comparatively cost-efficient (costing on average 0.375 US-dollars per 1 million tokens at the time of writing) and capable in terms of producing high-quality outputs, the validation of the SCF will rely on this model. Another consideration in using this phenomena to validate the SCF is that since the human-based studies rely on random sampling and experimental designs (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019), the causal estimates can be considered reliable, and with multiple direct replications confirming the initial results (i.e., Kane et al. 2023; Mummolo and Peterson 2019), the ability to combine these in a meta-analysis without having to worry about pooling together too different studies (cf. Ioannidis 2016; Sharpe 1997; see also general criticisms by Data Colada 2022a; 2022b; 2022c; 2022d), this provides a rather robust estimate of the effect to compare the SCF against. The specific choice of assessing the external validity of the SCF against the relationship between deservingness and welfare attitudes can thus be considered meaningful and reliable, though it should be noted that any one single assessment is inadequate to robustly conclude that the SCF is externally valid.

With these considerations in mind, the current direct replication study will draw on the theory and existing results pertaining to the deservingness heuristic (Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019) to specify the expectation that support for the social welfare eligibility of recipients is higher when the recipient is perceived as high in deservingness compared to recipients perceived as low in deservingness, which is formalized as the following hypothesis:

$H_1$: Support for welfare eligibility is higher for deserving welfare recipients compared to undeserving welfare recipients.

Given the previously outlined theoretical foundation of the SCF, and the consideration that an investigation of the deservingness heuristic in relation to synthetic units could be meaningful and reliable, it will be expected that the estimated causal effect of deservingness on welfare eligibility does not differ between human-based and synthetic-based studies, formalized as:

$H_2$: The effect of deservingness on welfare eligibility is the same whether measured among synthetic respondents or human respondents.

Considering these two hypotheses, the subsequent sections will demonstrate how to simulate representative data of the US-population through the use of joint probability random sampling, followed by the introduction of the Synthetically Controlled Trial (SCT), which will emulate the survey conditions of the original survey and rely on vignettes to operationalize a manipulation of deservingness through text-based descriptions of a ‘lazy’ and ‘unlucky’ welfare recipients.

4.2 Simulating Data

As previously mentioned, simulating data allows the researcher to decide which characteristics the LLM should base its life-history and responses on. While the selected gpt-4o-mini LLM by OpenAI (2024a) is expectedly capable of inferring characteristics proportional to a population by basing them on their prevalence in its training data, explicitly determining the characteristics gives the researcher more control over the characteristics generated, which could impact the behavior of the synthetic respondents and how generalizable results are to the population of interest. Similarly, while it can be expected that the greater the number of simulated characteristics, the more realistic the behavior of the LLM, specifically which and how many characteristics to simulate will likely vary depending on characteristics found relevant for the particular research topic and the ability of the selected LLM to retain and process information. For this demonstration, the common demographics (e.g., ANES 2021; Ruggles et al. 2024) of US State residence, area residence type, number of children in household, sex, age, marital status, ethnoracial group²¹, highest level of education, employment status, and annual household gross income ($US) will be simulated, which includes the demographic characteristics deemed relevant by Aarøe and Petersen (2014), and this will likely provide gpt-4o-mini with plenty of realistic information from which to generate the life-histories. Furthermore, since the results from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023) relate specifically to political attitudes toward welfare requirements, the characteristics of preferred political party and political ideology will also be simulated.

Simulating characteristics can be done so that their prevalence in the data is proportional to a national population of interest. Since this demonstration focuses on replicating the results from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023), the population of interest is determined to be the adult population of the United States of America (US). A population-proportional simulation of characteristics can be achieved by constructing a stratification matrix using large samples such as those that form the basis of census data (cf. Gelman and Little 1997). A stratification matrix consists of every combination of the characteristics of interest and their frequency in the population, and a US-based version can be constructed using US Census Bureau data provided by the IPUMS (Ruggles et al. 2024). Here, the 2022-wave is used due to being relatively contemporary, containing measures of most characteristics to simulate (i.e., US State residence, area residence, number of children in household, sex, age, marital status, ethnoracial group, highest level of education, employment status, and annual household gross income in 2022 US dollars) and a large sample size ($n$) of 3,373,378 respondents, which after data cleaning (e.g., removing missing values and respondents less than 18 years old) results in 2,403,591 usable observations. However, the data lacks measures of political party affiliation and political ideology, meaning that these measures will have to be imputed (Kalton and Kasprzyk 1982; King et al. 2001; Rubin 1988).

To impute the measures of political party affiliation and political ideology, data from the similarly recent 2020-wave of the American National Election Survey (ANES 2021) can be used. While the ANES relies on samples substantially smaller than the IPUMS ($n$ = 8,280), it is also contemporary and contains similar measures of all characteristics to simulate. The subset of the ANES of all these characteristics can be used to train statistical models to impute political party and ideology, which necessitates recoding all variables to have the same categories and scale across the IPUMS and ANES. To leverage as much data from the ANES as possible, missing values were imputed (range: 12 - 3,700) with the multiple imputations by chained equations procedure (MICE, van Buuren and Groothuis-Oudshoorn 2011), using the random forest algorithm (Breiman 2001; Ho 1995; Liaw and Wiener 2002) due to its generally high accuracy, non-parametric nature, and ability to handle both categorical and continuous data. This involved 121 imputation iterations, which was estimated as the number of iterations necessary for chain convergence (cf. von Hippel 2020). Having undertaken data cleaning similar to the IPUMS data prior to imputation (e.g., removing respondents less than 18 years old), this results in one imputed ANES sample of 6,568 usable observations.

Following then the methodology of cross-validation (Stone 1974), an unbiased measure of the imputation accuracy of statistical models can be derived by splitting the ANES data into a train and test dataset, using the conventional 80/20 split ratio (e.g., França et al. 2024), training the models on the training data and testing their imputation accuracy on the testing data. To assess variability in imputation accuracy, Bootstrapped cross-validation (Xu and Liang 2001) is used, which here involves iteratively splitting, training, and testing models a total of 10 times. To impute respondents’ dichotomously-coded preferred political party, a varying-intercepts Bernoulli model (Gelman and Hill 2007) and a Random Forest model (Breiman 2001; Ho 1995; Liaw and Wiener 2002) are compared. From their predictions, a confusion matrix (Tharwat 2018) is generated, with imputation accuracy being computed as the equally-weighted arithmetic mean of the total accuracy (ACC), sensitivity (i.e., recall) and specificity (Yerushalmy 1947), positive predictive value (PPV; i.e., precision) and negative predictive value (NPV) (Altman and Bland 1994), which are popular, complementary metrics of model accuracy. The average imputation accuracy of the varying-intercepts Bernoulli model is 66%, with a standard deviation (SD) of 1 percentage points (pp), while the Random Forest model has an imputation accuracy of 65% (SD = 1pp), meaning that the varying-intercepts Bernoulli model is chosen to impute political party in the IPUMS data due to its (marginally) superior accuracy. Imputing political ideology, which due to its Likert (1932) scale could be treated either as an ordered or (less appropriately) as a continuous variable (cf. Lantz 2013; Wu and Leung 2017), involves a varying-intercepts Gaussian model (Gelman and Hill 2007), a Generalized Additive Model (GAM, Hastie and Tibshirani 1990; Wood 2004, 2011, 2017), and a Random Forest model (Breiman 2001; Ho 1995; Liaw and Wiener 2002). These are similarly evaluated, though given its non-dichotomous coding, only ACC is used to measure imputation accuracy. These models produce an ACC of 30% (1pp), 30% (1pp), and 43% (1pp), respectively, resulting in the Random Forest model being chosen to impute political ideology for its superior accuracy. While the imputation accuracies of both political party affiliation and political ideology can be considered relatively high, they are far from an accuracy of 100%, meaning that these imperfectly imputed values may make the resulting stratification matrix less reliably representative of the adult US population.

Having imputed preferred political party and political idelogy in the IPUMS data, all chosen characteristics are available, and a stratification matrix is generated by counting the frequency of every combination of the characteristics (n = 2,366,924). Then, analogous to a non-parametric bootstrap (Efron 1979, 2003), simulating data can done by sampling with replacement from the stratification matrix, with the sampling probability being the frequency of each combination of characteristics. This aforementioned joint probability procedure for simulated random sampling ensures that the value of each characteristic is simulated conditionally on the values of all the other characteristics (e.g., P(sex = Male | State = Tennessee, Age = 65, Political Ideology = Conservative)), making the resulting data more closely reflect the proportionality of characteristics in the US population. Asymptotically, data simulated in this manner is representative of the national population of interest, and for smaller datasets that nonetheless are unrepresentative, the multilevel regression and poststratification (MRP, Gelman and Little 1997) method may alternatively be used to adjust for such sample-population imbalances. With this US-based stratification matrix available, a sample can then easily be generated. For the purposes of this demonstrative replication, the sample size of the original study is used (Aarøe and Petersen 2014: 688), which entails a relatively large sample size of 1,009 simulated respondents being generated. Given that the aforementioned ‘parallel worlds’-estimation is here used to produce two counterfactual outcomes for each respondent (see below), this results in an effective sample size of 2,018. With regards to successfully replicating the aforementioned effect size of .44 (SD = .12; 95% HDI[.12; .61]), a false-positive error rate ($\epsilon_{\alpha}$, i.e., Type I error) of 5% is specified a priori, which together with 4 fixed model parameters (i.e., intercept mean, intercept standard deviation, coefficient, and the model degrees of freedom, see model in the ‘Bayesian inference’-section), the false-negative error rate ($\epsilon_{\beta}$, i.e., Type II error) is estimated to be 4.14%. This means that the study has an estimated statistical power (Gelman and Carlin 2014) of 95.86%, which can be considered more than ‘adequate’ for replicating the existing results (cf. Cohen 1988, 1991). Similarly, the type M error rate ($\epsilon_{M}$, Gelman and Carlin 2014) is 1.03, meaning that the effect is unlikely to be substantially under-/overestimated (i.e., the exaggeration rate is only 3%), while the type S error rate ($\epsilon_{S}$, Gelman and Carlin 2014) is .00%, indicating almost no probability of the estimate having the wrong sign. As such, the probability of a successful replication is high, at least as indicated by these metrics.

Table 4.1: Examples of the Simulated Dataset
State	Area	Children	Sex	Age	Marital Status	Ethnoracial Group	Education	Employment Status	Household Income	Political Party	Political Ideology
Kentucky	Small Town Or Surburb	0	Female	62	Not Married	Black/African American	Some Post-High School, No Bachelor’s Degree	Not In Labor Force	10,980	Democratic Party	Moderate; Middle Of The Road
Kansas	Small Town Or Surburb	0	Female	59	Divorced	White	Graduate Degree	Employed	42,700	Democratic Party	Slightly Liberal
North Carolina	Small Town Or Surburb	0	Female	53	Widowed	White	Some Post-High School, No Bachelor’s Degree	Employed	25,200	Republican Party	Conservative
California	Small Town Or Surburb	1	Female	49	Married	Mexican	Less Than High School Credential	Not In Labor Force	60,600	Democratic Party	Moderate; Middle Of The Road
Virginia	Small Town Or Surburb	0	Female	52	Married	White	Bachelor’s Degree	Employed	100,200	Democratic Party	Liberal
Minnesota	Small Town Or Surburb	0	Female	65	Married	White	Graduate Degree	Not In Labor Force	141,400	Democratic Party	Liberal

NOTE: Randomly-sampled rows from the simulated data. n = 1,009. ‘Ethnoracial group’ is a term used to denote the overlapping racial/ethnic/national terminology employed in the US (e.g., ‘White’, ‘African-American’, ‘Chinese’, etc.). Household income is provided as annual household gross income in 2022 US dollars ($). Based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Simulating the data in this manner produces a 1009 $\times$ 13 matrix, with randomly-determined examples of simulated units being provided in table 4.1. The simulated data contains an identification number (id, omitted from table 4.1) and the 12 measures of the selected characteristics of each simulated respondent: Their US state of residence (e.g., ‘Maine’, ‘Vermont’, ‘Florida’), area residence type (e.g., ‘City’, ‘Rural Area’, ‘Small Town Or Surburb’), number of children (range: ‘0’ - ‘4 Or More’), sex (i.e., ‘Female’, ‘Male’; 52.82% females), age (mean: 51.72; SD = 18.53; range: ‘18’ - ‘95’), marital status (e.g., ‘Not Married’, ‘Widowed’, ‘Divorced’), ethnoracial group (e.g., ‘Chinese’, ‘Bangladeshi’, ‘Ecuadorian’), level of education (range: ‘Less Than High School Credential’ - ‘Graduate Degree’), employment status (e.g., ‘Not In Labor Force’, ‘Employed’, ‘Unemployed’), annual household gross income in 2022 US dollars ($) (median: 82,000; IQR = 99,000; range: ‘-6,910’ - ‘1,360,000’), preferred political party (i.e., ‘Democratic Party’, ‘Republican Party’; 53.72% democrats), and political ideology (range: ‘Extremely Liberal’ - ‘Extremely Conservative’). For more details and greater overview of the simulated data, descriptive statistics are provided both metrically and graphically in the appendix.

On average, the sample is estimated to capture 97.69% (SD = 6.29pp) of the information in the US population in relation to the specified characteristics²², and comparisons of the sample distributions with US population estimates reveals that the sample could generally be considered representative on all measured demographics. Compared to the nationally representative sample used by Aarøe and Petersen (2014: 688), which employed stratified sampling, the two samples differ by this simulated data including an age range up to 95 compared to their age range up to 70. Possibly as a result of this, the mean age of their sample is 41.1 years, while it is 51.7 here. Also, the ratio of females is in their sample 51%, while it in this simulated sample is 52.8%. As such, while the sample of Aarøe and Petersen (2014) is relatively younger, based on the characteristics described in their materials, there are no substantial differences between the samples, meaning that this sample can be considered comparatively representative. This representativeness of the sample is likely attributable to the relatively large nominal sample size (n = 1,009) and the joint probability procedure used to generate it. With this simulated data now having been generated, OpenAI’s (2024a) gpt-4o-mini can then be used to generate their life-histories.

4.3 Generating Life-Histories

With simulated data available, a prompt can be engineered, which serves as an input that instructs the LLM to generate life-histories for the respondents. As mentioned, a life-history (Bertaux 1981) can be considered an autobiographical narrative by which individuals frame their life, experiences, identity, relationships, existential purpose and place in society (e.g., Znaniecki and Thomas 2020[1974]), and their purpose here is to provide more character and nuance to the simulated respondents in a way that goes beyond their assigned characteristics. This will expectedly help the LLM respond in a manner that is consistent with a human possessing the assigned characteristics. Writing a prompt that instructs the LLM to generate the life-history based on the assigned characteristics helps ensure that other characteristics generated in the life-history are functions of the explicitly determined characteristics. The prompt engineered for this specific purpose, with examples of characteristics specific to a randomly-sampled simulated unit is provided in table 4.2.

Table 4.2: Life-History Prompt
id	Prompt
495	You are to act as a character and must remain in character for the purposes of a roleplay scenario. ‘You’ and ‘your’ thus refer to your character from now on. Your character has the following characteristics: Country: United States of America (USA) State: North Carolina Area: Rural Area Number of Children in Household: 0 Sex: Male Age: 25 Marital Status: Not Married Ethonoracial Group: Black/African American Annual Household Gross Income (In 2022 $US): 13200 Education: Some Post-High School, No Bachelor’s Degree Employment Status: Not In Labor Force Preferred Political Party: Democratic Party Political Ideology: Moderate; Middle Of The Road Be creative but realistic and compose a brief life-history for your character based on the provided characteristics:

NOTE: Example of a prompt engineered to generate a life-history. The characteristics in the prompt are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

The prompt specified here for generating life-histories instructs the gpt-4o-mini LLM (OpenAI 2024a) to act as a character and to persistently remain in character. This serves to ensure that all output generated by the LLM reflects behavior consistent with that character. It also contains a list of all the assigned characteristics, followed by a request for it to generate a (brief) life-history based on these characteristics. While an increase in the number of provided characteristics and the request of a longer life-history will expectedly result in behavior that more realistically reflects a human possessing these traits, this likely comes at the cost of greater financial costs and time needed by the LLM to process the input and generate an output. An advantage of relying on the API of GPT-models by OpenAI is that researchers can control numerous parameters of the model. For example, the max length (i.e., max_tokens) parameter can be restricted to reduce costs (OpenAI 2025a), though this will likely reduce its ability to act in a manner consistent with its assigned characteristics. What constitutes an ‘optimal’ max length will expectedly vary across different research purposes, depending in part on the number of relevant characteristics from which the LLM is to generate life-histories. For this demonstration the max_tokens for generating life-histories is set to 500, which is found to be adequate within this context, since the LLM can generate life-histories within this limit without its output being truncated due to exceeding the limit. Another relevant parameter is the temperature of the LLM’s output (i.e., temperature), which determines the randomness of the output (OpenAI 2025a). A higher value makes its output adhere less to language syntax, making its output less cohesive and meaningful. The default value outlined by OpenAI is 1, and while an optimal value will likely vary across contexts, a text analysis of the generated life-histories (see below) found this value to be adequate for the context of generating realistic life-histories.

To generate life-histories from the simulated data, it is recommended to use the gpt-4o-mini LLM through OpenAI’s API (OpenAI 2025a). As mentioned, gpt-4o-mini is among the best-performing and cheapest models (OpenAI 2024a; 2025b), with a hallucination rate as low as 1.7% and factual consistency rate of 98.3% (Vectara 2025), indicating that the model is generally reliable. Making requests through OpenAI’s API is preferred, since data generated in this way are not used to train their models, and in terms of privacy, such data is generally deleted after 30 days (OpenAI 2024d). This helps reduce concerns that studies such as this could inadvertently train the model, leading to results more consistent with researcher preferences rather than human behavior, which would result in tautological results. Compared to manually generating life-histories, calling the API using software exponentially speeds up the process and is easily done through R in RStudio. To this end, this demonstration relies on the openai R package (Rudnytskyi 2023) because of its ease in making such API requests. Due to the speed of such software, it is furthermore recommended to enforce pauses between calls to the API to reduce traffic, which can stress their systems. OpenAI already enforces rate limits (OpenAI 2025a), and to avoid unintentionally hitting one of their numerous request limitations, the researcher can a priori enforce a pause between requests. This demonstration uses a relatively substantial pause of 3.1 seconds to avoid hitting rate limits, though values could vary depending on the number and length of requests.

This demonstration implements and recommends that LLM-output is generated through separate requests, one for each simulated respondent, to ensure that observations are independent. An alternative to this procedure is batching, which involves grouping multiple requests together into a single request (OpenAI 2025a). For example, by providing the characteristics of several simulated respondents in the same prompt and asking for multiple life-histories. While this can reduce both financial and time costs, a serious methodological concern of batching is that generating life-histories (and most other types of outcome) in batches produces output that may be interdependent, which typically is not desirable. For example, output generated in batches may be more strongly correlated than output generated in other batches or output generated individually. This would violate the common statistical assumption of observations being independent and identically distributed (iid, Gaenssler and Stute 1979; McElreath 2019: 81), which would likely need to be accounted for using statistical modeling (e.g., multilevel models, Gelman and Hill 2007). Furthermore, when generating the outcome across different group conditions using batching, there may be information spill-over, for example, if the LLM in generating a counterfactual output mistakes a unit in one manipulation condition with a unit in another manipulation condition, or if the mere presence of information relating to both group conditions affects the generation of results. This spill-over would violate the aforementioned fundamental assumption of stable unit treatment value assumption (SUTVA, Angrist et al. 1996; Cox 1958; Gelman et al. 2021: 353-354; Imbens and Rubin 2015: 9-12; Rosenbaum and Rubin 1983; Rubin 1974), which the SCF was formalized to depend on for reliable causal inference. While general methods could be implemented to account for this information spill-over (Imbens and Rubin 2015: 12-13; e.g., Aronow and Samii 2017), these can be complex (Rosenbaum 2007), and as similarly argued by Elizabeth Stuart (2010: 3), the most effective solution is simply to prevent interdependency and spill-over by design. As such, to prevent violations of iid and SUTVA in this manner, it is recommended to generate life-histories and most other types of outcome separately, unless the purpose specifically is to study interdependency.

With the specifications and considerations in mind, a life-history for each simulated resondent can be generated. In this demonstration, this involves a for-loop, which iteratively goes through each respondent, makes a call to OpenAI’s API for the selected gpt-4o-mini LLM, and saves the output to the data. The loop accounts for possible errors in calling the API by waiting 10 seconds before trying again, up to a total of 3 times. Information about the output generation provided by OpenAI is saved, such as the time of receiving a response from the model, which help provide documentation of the data. The number of tokens used and the time it took to generate the output is also stored, which will be used to ascertain financial and time costs. Note that before it is possible to call OpenAI’s API, researchers must create an OpenAI account, deposit adequate finances, and generate a unique API key (OpenAI 2025a). The key will be used to connect your requests with your finances and should thus be kept private. Following these specifications, the below R code provides an overview of some of the code used to generate life-histories with the gpt-4o-mini LLM (OpenAI 2024a):

# Initialize columns for storing the GPT-output
simulated_data$life_history_gpt <- NA
simulated_data$life_history_id_gpt <- NA
simulated_data$life_history_object_gpt <- NA
simulated_data$life_history_created_gpt <- NA
simulated_data$life_history_model_gpt <- NA
simulated_data$life_history_prompt_tokens <- NA
simulated_data$life_history_completion_tokens <- NA
simulated_data$life_history_logprobs <- NA
simulated_data$life_history_finish_reason <- NA
simulated_data$life_history_message.refusal <- NA
simulated_data$life_history_message.role <- NA
simulated_data$life_history_time_cost <- NA


# Construct prompt
messages_life_history <- list()
messages_life_history[[1]] <- list(
  "role" = "system",
  "content" = "You are a human simulator."
)
messages_life_history[[2]] <- list(
  "role" = "user",
  "content" = NA
)


# Progress bar to keep track of the for-loop
pb <- txtProgressBar(min = 0, max = nrow(simulated_data), style = 3)


# Indices in the data for which to generate life-histories
indices <- simulated_data$id[is.na(simulated_data$life_history_gpt)] %>% as.numeric()


# Reset attempts
attempt_thresholds <- 0


# For-loop for generating life-histories
for(i in c(indices)){
  
  # Check if threshold has been met
  if(attempt_thresholds >= max_attempt_thresholds){
    print("Threshold for maximum number of errors reached. Stopping...")
    break
  }
  
  
  # Start time recording
  t0 <- Sys.time()

  
  # Enforce a pause if this is the second iteration or higher
  if(i > 1){
    Sys.sleep(pause_duration)
  }
  
  
  # Adjust prompt to reflect respondent i's characteristics
  messages_life_history[[2]][2] <- list(paste0(
    "You are to act as a character and must remain in character for the purposes of a roleplay scenario. 'You' and 'your' thus refer to your character from now on. Your character has the following characteristics:
    Country: United States of America (USA)
    State: ", simulated_data$state[i], "
    Area: ", simulated_data$urban[i], "
    Number of Children in Household: ", simulated_data$children[i], "
    Sex: ", simulated_data$sex[i], "
    Age: ", simulated_data$age[i], "
    Marital Status: ", simulated_data$married[i], "
    Ethonoracial Group: ", simulated_data$race_d[i], "
    Annual Household Gross Income in 2020 $US: ", simulated_data$household_income_d[i], "
    Education: ", simulated_data$education[i], "
    Employment Status: ", simulated_data$employment[i], "
    Preferred Political Party: ", simulated_data$political_party[i], "
    Political Ideology: ", simulated_data$political_ideology[i], "
    Be creative but realistic and compose a brief life-history for your character based on the provided characteristics:"
  ))
  
  
  # Set initial attempt
  attempt <- 1
  
  
  # Call GPT model (and use tryCatch to make the loop continue retrying if encountering an error)
  while(attempt <= max_attempts){
    print(paste0("Attempt: ", attempt, "/", max_attempts))
    attempt <- attempt + 1

    tryCatch(
      expr = {
        gpt_output <- openai::create_chat_completion(
          model = selected_gpt_model,
          messages = messages_life_history,
          max_tokens = max_tokens$life_history,
          temperature = temperature$life_history,
          openai_api_key = OPENAI_API_KEY
        )
        break
      },
      error = function(e){
        print(paste("Error encountered. Error message:", e))
        print(" Going to sleep for 10 seconds ...")
        Sys.sleep(time = 10)
      }
    )
  }
  
  
  # Check is used attempts reach threshold
  if(attempt >= max_attempts){
    attempt_thresholds <- attempt_thresholds + 1
    print(paste0("Maximum number of errors reached threshold. Thresholds met: ", attempt_thresholds, "/", max_attempt_thresholds))
    
    # Update progress bar
    setTxtProgressBar(pb, i)
    
    
    # End and store time recording
    t1 <- Sys.time()
    simulated_data$life_history_time_cost <- t1-t0
    
    next
  }

  
  # Store life history
  simulated_data$life_history_gpt[i] <- gpt_output$choices$message.content


  # Store utility info
  simulated_data$life_history_id_gpt[i] <- gpt_output$id
  simulated_data$life_history_object_gpt[i] <- gpt_output$object
  simulated_data$life_history_created_gpt[i] <- gpt_output$created
  simulated_data$life_history_model_gpt[i] <- gpt_output$model
  simulated_data$life_history_prompt_tokens[i] <- gpt_output$usage$prompt_tokens
  simulated_data$life_history_completion_tokens[i] <- gpt_output$usage$completion_tokens
  simulated_data$life_history_logprobs[i] <- gpt_output$choices$logprobs
  simulated_data$life_history_finish_reason[i] <- gpt_output$choices$finish_reason
  simulated_data$life_history_message.refusal[i] <- gpt_output$choices$message.refusal
  simulated_data$life_history_message.role[i] <- gpt_output$choices$message.role
  

  # Update progress bar
  setTxtProgressBar(pb, i)
  
  
  # End and store time recording
  t1 <- Sys.time()
  simulated_data$life_history_time_cost <- t1-t0
}


# Convert Unix Timestamp to Datatime
simulated_data$life_history_datetime <- lubridate::as_datetime(simulated_data$life_history_created_gpt)

Using the aforementioned specifications, generating 1,009 life-histories using this for-loop takes 192.76 minutes, or 3.21 hours, with the expected time-cost per simulated respondent thus being 11.46 (SD = .00) seconds. Output from the gpt-4o-mini model can be gauged to ascertain whether it ran into any issues during the generation (OpenAI 2025a). This includes its finish_reason, where a stop-response is consistent with no issues, while length-response means that the life-histories were cut short due to the model reaching the specified max_tokens limitation. Another output parameter is message.refusal, which indicates whether the model refused to provide an output, and an NA-response is generally consistent with no issues. A message refusal can happen for several reasons and retries or changing the input may fix the issue. For an example of the contents of a life-history, a randomly-sampled index is provided in table 4.3.

Table 4.3: Example of an LLM-Generated Life-History
id	Life-History
495	My name is Marcus Lee, and I grew up in the small, tight-knit community of rural North Carolina. My folks are from here—a mix of tradition and a yearning for the future. They instilled in me the values of hard work and community support, but also the importance of pursuing education and finding my own path. After finishing high school, I dabbled in a few local community college courses—mostly focused on automotive repair and computer technology—because I’ve always had a knack for figuring things out. Unfortunately, life took a few unexpected turns, and I had to step back from my education. Financial struggles hit hard, and I’ve found myself working odd jobs and helping out family and neighbors rather than sticking to a career path. At 25, I’m not married and don’t have kids, but I keep close ties with my family and friends. I often help my younger cousins with their homework or mentor them through issues they face in school. Being part of this community means everything to me; we rely on each other, especially in tough times. Politically, I lean towards the Democratic Party and believe in finding balanced solutions that cater to everyone’s needs. I value social justice but also want to ensure that economic opportunities are accessible to all, including the folks in my rural area who often get left behind. With an annual income of around $13,200, I’ve managed to keep my head above water, but it’s definitely a struggle. I find small joys—like cooking on the weekends, playing pick-up basketball with friends, or attending local church events. I dream of going back to school full-time one day, perhaps to become a community organizer or work in public service, aiming to uplift my community and ensure that future generations have more resources and opportunities than I did.

NOTE: Synthetic data generated by OpenAI’s (2024a) gpt-4o-mini model on 2025-02-03 10:50:14 using simulated characteristics based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

With life-histories now generated from the simulated characteristics, this synthesis of simulated data and LLM-based data can be considered to have produced what was previously termed synthetic units. When later being tasked to act as a survey respondent (see below), the LLM by merit of basing its behavior on the combined information in the life-history specific to each synthetic unit, rather than merely basing it on the information reflected in the assigned characteristics specific to each simulated unit, its behavior will likely be more realistic and in a more nuanced way reflect actual human behavior.

In terms of quality assurance, it is here recommended to validate that the life-histories reflect and are consistent with the assigned characteristics. This serves to ensure that subsequent output by the model (i.e., the counterfactual outcomes) correctly draws on the assigned characteristics. Analogous to procedures used in qualitative methodology (O’Connor and Joffe 2020) or computational text analysis (Theocharis and Jungherr 2021), validations can be performed by raters, who read the generated life-history and compare it to the assigned characteristics to assess consistency. Raters can provide their evaluations of consistency in numerous formats, for example, dichotomously (e.g., ‘Consistent’, ‘Inconsistent’), continuously (e.g., 7-point Likert (1932) scale), or in open-text format. The number of raters and evaluated life-histories can vary, for example, it can involve two independent raters, who evaluate all life-histories or a random sample, with the scope of the evaluation likely depending on the resources available for the given project. Rater evaluations can then be assessed to ascertain their inter-rater reliability (Saal et al. 1980), which expresses whether raters are generally in agreement on the (in)consistency of the output. If rater evaluations are provided in a quantifiable format, this can be estimated as a reliability coefficient, for example, as percent agreement, Sklar’s $\omega$ (Hughes 2022), Krippendorff’s $\alpha$ (2018), Fleiss’ $\kappa$ (1971), Cohen’s $\kappa$ (1960), Scott’s $\pi$ (1955), Spearman’s $\rho$ (1904), Kendall’s τ (1938), or the product-moment correlation coefficient (r, Bravais 1844; Pearson 1895a; Stigler 1989), depending on the number of raters and the scale used to measure (in)consistency. If the same rater has evaluated the same output multiple times, intra-rater reliability (Saal et al. 1980) can also be similarly estimated. An alternative procedure to human raters is to have LLMs act as raters, which can reduce financial and time costs related to evaluating the consistency of the model output. If doing this, the randomness inherent to the LLM can be maximally reduced (i.e., temperature = 0), which increases the determinism of the LLM-raters (cf. OpenAI 2025a), expectedly producing a perfect inter- and intra-rater reliability. However, this may under-/overestimate consistency, and until the rater reliability of LLMs has proven comparable to human raters in assessing consistency between life-histories and characteristics, caution in choosing to use LLM-raters is advised, as it reduces human supervision and quality control over the output.

For the present purposes of evaluating the consistency between the assigned characteristics and the characteristics mentioned in the generated life-history, a single-rater assessment was conducted by the author using a random sample of 100 synthetic units. Here, following standard practice for response options (e.g., ANES (2021)), the rater could choose between ‘Agree’, ‘Disagree’, ‘Didn’t know’, or ‘Prefer not to answer’ in relation to whether the information contained in the life-history matched each assigned characteristic. To estimate the intra-rater reliability of this human rater, after a 7-day window to reduce recall (Agnihotri et al. 2024; Ernst et al. 2019; Sørensen et al. 2024), a random sample of 50 of the previously-coded synthetic units were recoded by the same rater. Consistent with a consensus-based approach (Syed and Nelson 2015: 5), rater-discrepancies were resolved by comparing ratings across the two intervals, with the proportion of responses for each category being provided in table 4.4. Intra-rater reliability was estimated as percent agreement and Sklar’s $\omega$ (Hughes 2022) prior to resolving discrepancies and are similarly provided in the table.

Table 4.4: Rater Assessment of Consistency between the Assigned Characteristics & the Content of Life-Histories
Characteristics	Consistent (%)	Inconsistent (%)	Ambiguous (%)	Rater-Agreement (%)	Rating Reliability ($\hat{\omega}$)
State	100	0	0	100	0.999
Area	95	4	1	96	0.932
Children	98	0	2	100	0.999
Sex	100	0	0	100	0.999
Age	77	0	23	100	0.999
Marital Status	99	1	0	100	0.999
Ethnoracial Group	20	0	80	96	0.987
Education	98	2	0	100	0.999
Employment Status	97	0	3	100	0.999
Household Income	30	0	70	98	0.931
Political Party	90	0	10	100	0.999
Political Ideology	98	1	1	100	0.999

NOTE: Percentages may not sum to 100 due to rounding. $\hat{\omega}$ = Sklar’s $\omega$ reliability coefficient (Hughes 2022). Human assessment of consistency in synthetic data generated by OpenAI’s (2024a) gpt-4o-mini model using simulated characteristics based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

The results from the content analysis of the generated life-histories reveal a generally high overlap between the assigned characteristics and the LLM-generated life-histories. Few instances were found, where the LLM had generated information in direct contrast with the assigned characteristics, with most identified cases being inconsistencies in the unit’s area of residence (4% of the time). The relatively large number of unclear cases (i.e., ‘Don’t know’-situations) is the result of the LLM not explicitly stating the assigned characteristic in the life-history. This was most evident for the ‘Ethnoracial group’-category (80%), where the only instances of explicit mention were for individuals of ‘black’ or immigrant background. Similarly, the specific annual household gross income was rarely mentioned, with the LLM often making vague and implicit references to the unit’s income (70%, e.g., a ‘modest income’). Age was often implicitly mentioned in the life-histories (23%) by being stated as the synthetic unit’s year of birth, though it was typically entirely missing. While the LLM generally succeeded in explicitly stating the political ideology of the synthetic respondent, their preferred political party was often not stated (10%), perhaps because the LLM communicated it implicitly by stating the political values of the unit. While it would seem that the prompt could be improved by stating that the LLM must explicitly mention each assigned characteristic in the life-history, the results nonetheless indicate a generally high consistency between the assigned characteristics and life-histories in this sample, with few direct contradictions, indicating that the life-histories could be considered to reliably reflect the assigned characteristics. As such, this use of life-histories can be expected to enable a more realistic investigation of the effect of deservingness on welfare support, achieved by conducting a Synthetically Controlled Trial (SCT).

4.4 Conducting an SCT

With synthetic units having been produced through the synthesis of simulated data and LLM-based data, a Synthetically Controlled Trial (SCT) can then be conducted, which implements the ‘parallel worlds’-estimation procedure whilst emulating the survey design of Aarøe and Petersen (2014), enabling an exogenous manipulation of deservingness by simultaneously and independently exposing each synthetic respondent to both the ‘lazy’ and ‘unlucky’ welfare recipient vignettes, measuring their support for restricting the welfare eligibility of each recipient. The following subsections cover (1) the simultaneous and independent generation and measurement of the counterfactual outcomes across a ‘lazy’ and ‘unlucky’ manipulation condition; (2) an assessment of (im)balance between the measured characteristics of the synthetic respondents across these two manipulation conditions; (3) a statistical test of the simultaneity assumption to ascertain whether ‘parallel worlds’-estimation expectedly eliminates time as a confounder; and (4) a formalization of the statistical model, which in line with Bayesian inference (Gelman et al. 2014, 2021; Heck 2019; Kruschke 2014; McElreath 2019; Wagenmakers et al. 2010), will be used to estimate the causal effect of deservingness on welfare support.

4.4.1 Parallel Worlds

The procedure for generating the counterfactual outcomes is similar to the procedure used for generating life-histories. This will involve prompting the LLM to assume the character of the synthetic respondent by relying on the characteristics described in the previously generated life-histories. As mentioned, the novel concept of a Synthetically Controlled Trial (SCT) can be considered an SCF-based version of the Randomized Controlled Trial (RCT, Coover and Angell 1907; Peirce and Jastrow 1885; Woodworth and Thorndike 1901), since its borrows heavily on its design and ability to provide reliable causal inference (cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49). Just like an RCT, the SCT will involve exogenously assigning the synthetic respondents to group conditions with varying manipulations, followed by then measuring the outcome. However, an SCT does not rely on the randomized assignment mechanism (RAM) of an RCT, since every respondent can be assigned to every manipulation condition and have all their counterfactual outcomes measured. For this, it employs the aforementioned simultaneous and independent exogenous assignment mechanism (SIXAM), where each synthetic respondent is assigned to every manipulation condition, and with the measurement of simultaneous and independent counterfactual outcomes, intra-unit comparisons can be made that eliminate confounders by merit of the exogeneity of the SIXAM, providing reliable causal inference. While doing intra-unit comparisons in this manner will eliminate all unit-invariant confounders when estimating individual treatment effects (ITEs)²³ (cf. Westreich et al. 2015), to also eliminate time as a possible confounder of the ITEs, it is important that this process, at least with respect to each synthetic respondent, occurs simultaneously across manipulation conditions, which was previously introduced as the simultaneity assumption. Compliance with this assumption is implemented with the aforementioned concept of ‘parallel worlds’-estimation, which exploits parallel computation (cf. Gottlieb and Almasi 1989; Ozdoganlar and Endres 2021[1999]) to simultaneously generate and measure the counterfactual outcomes of each respondent by handling these processes in separate computational cores. If the simultaneity assumption is not violated, the intra-unit comparisons will also remove all time-invariant confounders in relation to the ITEs. A limitation of ‘parallel worlds’-estimation is that it requires implementation on a machine that has as many computational cores as there are manipulation conditions. With 2 manipulation conditions considered here and 8 cores available on the system used to implement parallel processing, ‘parallel worlds’-estimation is available for this replication.

With the above considerations in mind, the current demonstration involves replicating the survey design and two manipulation conditions invariably employed by Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023). The replication has so far managed to simulate a sample of similar size (n = 1,009) and representativeness to Aarøe and Petersen (2014) from the (theoretically) same population of adults from the US. Since details on the design of the original study are sparse, which is contrary to best practice (cf. Christensen et al. 2019), replicating the survey conditions is limited to the vignettes and outcome measurement. The vignettes that make out the manipulation conditions involve the description of a ‘lazy’ welfare recipient: ‘Imagine a man who is currently on social welfare. He has never had a regular job, but he is fit and healthy. He is not motivated to get a job’ (Aarøe and Petersen 2014: 689; Petersen et al. 2012) and an ‘unlucky’ welfare recipient: ‘Imagine a man who is currently on social welfare. He has always had a regular job, but has now been the victim of a work-related injury. He is very motivated to get back to work again’ (Aarøe and Petersen 2014: 689; Petersen et al. 2012). While these phrasings are evidently useful (cf. Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019; Petersen et al. 2012), to implement the ‘parallel worlds’-estimation procedure, the word length and character length of the manipulations would expectedly matter in relation to the simultaneity assumption. Since the vignettes differ in both word length (i.e., 30 and 36 words, respectively) and character length (i.e., 142 and 184 characters, respectively), it is unlikely that the parallel API-requests to gpt-4o-mini and its resulting responses will occur simultaneously due to differences in the time needed to process these differing inputs, and a violation of the simultaneity assumption is thus expected.²⁴ This is because the GPT-model uses time to process the input (e.g.,, it tokenizes a sentence by breaking it down into individual words), which can vary depending on the length of the input, though the exact impact of word and character length on processing time may differ. To address this issue, without changing the substantive meaning of the vignettes, they are rephrased to have the same word length (i.e., 33 words) and character length (i.e., 171 characters).²⁵ These can then be embedded into prompts that present the vignette to the gpt-4o-mini LLM, provided in table 4.5.

Table 4.5: Manipulation Conditions
‘Lazy’ Welfare Recipient	‘Unlucky’ Welfare Recipient
One day you decide to participate in a survey. You are informed that the responsible researchers are professional and impartial and care about your honest opinions. On the first few pages, you answer questions regarding your background. You turn a page and are then greeted with the following scenario: ‘Imagine a man who is currently on social welfare. He has never had a regular job, but he is fit and healthy. He lacks motivation to find employment and does not seek work.’ You take a few seconds to properly imagine this man before clicking to the next page. This page greets you with the following question about the man you just read about: ‘To what extent do you disagree or agree that the eligibility requirements for social welfare should be tightened for persons like him?’ Your response options are: 1 Strongly disagree 2 Disagree 3 Somewhat disagree 4 Neither agree or disagree 5 Somewhat agree 6 Agree 7 Strongly agree 99 Don’t know You answer:	One day you decide to participate in a survey. You are informed that the responsible researchers are professional and impartial and care about your honest opinions. On the first few pages, you answer questions regarding your background. You turn a page and are then greeted with the following scenario: ‘Imagine a man who is currently on social welfare. He has always had a regular job, but he has now suffered a work-related injury. He is highly motivated to return to work.’ You take a few seconds to properly imagine this man before clicking to the next page. This page greets you with the following question about the man you just read about: ‘To what extent do you disagree or agree that the eligibility requirements for social welfare should be tightened for persons like him?’ Your response options are: 1 Strongly disagree 2 Disagree 3 Somewhat disagree 4 Neither agree or disagree 5 Somewhat agree 6 Agree 7 Strongly agree 99 Don’t know You answer:

NOTE: Example of prompt engineered with the purpose of using vignettes to generate different manipulation conditions for synthetic respondents using LLMs. Rephrased vignettes based on Aarøe and Petersen (2014) and Petersen et al. (2012).

To further increase the external validity of the results, the prompt are formulated to create a realistic survey environment, for example, by stating that the respondent has chosen to participate in a survey and has already filled out questions regarding their background characteristics. This includes standard survey introductory text informing the synthetic respondent that the researchers are impartial and professional and care for the respondent’s honest answers. This flavor text serves to reduce the impact of social desirability (Edwards 1953; Krumpal 2013) and other demand characteristics (Masling 1966; Nichols and Maner 2008), such as expectancy effects (McDermott 1895: 34) and experimenter demand bias (Druckman 2015: 57-58), on respondent answers. To inform the LLM of the intended response format, the prompts includes the response scale, which in line with the replication is a 7-point Likert (1932) scale²⁶ (Aarøe and Petersen 2014: 689), though it should be noted that other types of scales may be of greater use to researchers when studying other phenomena. Lastly, the prompt explicitly states that the respondent takes their time to imagine the welfare recipient, which serves to ensure that the response reflects a contemplated opinion. While this may increase the reliability of the response, it could make results less generalizable, since not all human respondents take their time to read and comply with vignettes (Kane et al. 2023), and given how agreeable the GPT models generally are (cf. OpenAI 2024a; 2024b; 2024c; Ornstein et al. 2024), it is likely that responses generated by these current models will fail to reflect non-compliant and inattentive human respondents.

With one vignette being used in each manipulation condition, the synthetic respondents can then be simultaneously and independently assigned to both conditions to provide their counterfactual responses. Similar to generating the life-histories, this again involves a for-loop iterating over each synthetic respondent, thus abiding by the iid and SUTVA assumptions. Unlike before, however, the loop implements a nested for-loop that implements the ‘parallel worlds’-estimation using parallel processing (Microsoft and Weston 2022b). This involves two independent computational cores, with each one making a separate call to gpt-4o-mini API (OpenAI 2024a). As mentioned, this serves to reduce any time difference between observing each outcome across the manipulation conditions. API calls are again made continuously with a 3.1 second pause between each synthetic respondent, and a 10-second pause in case of an error, up to a total of 3 tries. It should be further noted that the code also implements randomization, with a Bernoulli-process (cf. Bernoulli 1713[2006]; Kachitvichyanukul and Schmeiser 1988) determining the initial index-value of the parallel process-loop (i.e., whether the index-sequence is [1,2] or [2,1]), which serves to ensure that any idiosyncratic waits attributable to the initial index-value are unbiased.

The below R-code specifies the for-loop that employs the simultaneous and independent exogenous assignment mechanism (SIXAM) to produce two counterfactual outcomes for each respondent, one for the ‘lazy’ welfare recipient and one for the ‘unlucky’ welfare recipient. The implementation of the ‘parallel worlds’-estimation procedure uses two independent cores, with one core being used to process each manipulation condition.

# Initialize columns for storing the 'lazy' outcome
simulated_data$outcome_gpt <- NA
simulated_data$outcome_id_gpt <- NA
simulated_data$outcome_object_gpt <- NA
simulated_data$outcome_created_gpt <- NA
simulated_data$outcome_model_gpt <- NA
simulated_data$outcome_prompt_tokens <- NA
simulated_data$outcome_completion_tokens <- NA
simulated_data$outcome_logprobs <- NA
simulated_data$outcome_finish_reason <- NA
simulated_data$outcome_message.refusal <- NA
simulated_data$outcome_message.role <- NA


# Initialize columns for storing the 'unlucky' counterfactual outcome
simulated_data$outcome_counterfactual_gpt <- NA
simulated_data$outcome_counterfactual_id_gpt <- NA
simulated_data$outcome_counterfactual_object_gpt <- NA
simulated_data$outcome_counterfactual_created_gpt <- NA
simulated_data$outcome_counterfactual_model_gpt <- NA
simulated_data$outcome_counterfactual_prompt_tokens <- NA
simulated_data$outcome_counterfactual_completion_tokens <- NA
simulated_data$outcome_counterfactual_logprobs <- NA
simulated_data$outcome_counterfactual_finish_reason <- NA
simulated_data$outcome_counterfactual_message.refusal <- NA
simulated_data$outcome_counterfactual_message.role <- NA


# Initialize column for recording time cost
simulated_data$outcome_time_cost <- NA


# Assign one computational core to each manipulation condition
doParallel::registerDoParallel(cores_parallel_worlds)


# Progress bar to keep track of the for-loop
pb <- txtProgressBar(min = 0, max = nrow(simulated_data), style = 3)


# Indices in the data for which to generate the counterfactual outcomes
indices <- simulated_data$id[is.na(simulated_data$outcome_gpt) | is.na(simulated_data$outcome_counterfactual_gpt)] %>% as.numeric()


# Reset attempts
attempt_thresholds <- 0


# For-loop for generating the counterfactual outcomes
for(i in c(indices)){
  
  
  # Check if threshold has been met
  if(attempt_thresholds >= max_attempt_thresholds){
    print("Threshold for maximum number of errors reached. Stopping...")
    break
  }
  
  
  # Start time recording
  t0 <- Sys.time()

  
  # Enforce a pause if this is the second iteration or higher
  if(i > 1){
    Sys.sleep(pause_duration)
  }
  

  # Adjust prompt to reflect respondent i's life-history
  message_history <- list(
    list(
      "role" = "system",
      "content" = "You are a human survey respondent."
    ),
    list(
      "role" = "assistant",
      "content" = simulated_data$life_history_gpt[i]
    ),
    list(
      "role" = "user",
      "content" = NA
    )
  )
  
  
  # Set initial attempt
  attempt <- 1
  
  
  # Use parallel processing to 'create parallel worlds', where each respondent is assigned to either condition and have their outcome measured. Randomization is used to decide the index sequence for unbiasedness
  if(rbinom(1, 1, .5)){
    gpt_results <- foreach::foreach(j = 1:2, .combine = "rbind") %dopar% {
      
      # Adjust prompt to reflect the manipulation condition
      message_history[[3]]$content <- group_condition[[j]]$content
      
      while(attempt <= max_attempts){
        
        # Record attempts
        attempt <- attempt + 1
    
        # Call GPT model (and retry if encountering an error)
        tryCatch(
          expr = {
            gpt_output <- openai::create_chat_completion(
              model = selected_gpt_model,
              messages = message_history,
              max_tokens = max_tokens$outcome,
              temperature = temperature$outcome,
              openai_api_key = OPENAI_API_KEY
            )
            return(data.frame(
              condition = group_condition[[j]]$condition,
              gpt_output = gpt_output
            ))
            break
          },
          error = function(e){
            Sys.sleep(time = 10)
          }
        )
      }
    }
  }else{
    gpt_results <- foreach::foreach(j = 2:1, .combine = "rbind") %dopar% {
      
      # Adjust prompt to reflect the manipulation condition
      message_history[[3]]$content <- group_condition[[j]]$content
      
      while(attempt <= max_attempts){
        
        # Record attempts
        attempt <- attempt + 1
    
        # Call GPT model (and retry if encountering an error)
        tryCatch(
          expr = {
            gpt_output <- openai::create_chat_completion(
              model = selected_gpt_model,
              messages = message_history,
              max_tokens = max_tokens$outcome,
              temperature = temperature$outcome,
              openai_api_key = OPENAI_API_KEY
            )
            return(data.frame(
              condition =  group_condition[[j]]$condition,
              gpt_output = gpt_output
            ))
            break
          },
          error = function(e){
            Sys.sleep(time = 10)
          }
        )
      }
    }
  }
  
  
  # Check is used attempts reach threshold
  if(attempt >= max_attempts){
    attempt_thresholds <- attempt_thresholds + 1
    print(paste0("Maximum number of errors reached threshold. Thresholds met: ", attempt_thresholds, "/", max_attempt_thresholds))
    
    
    # Update progress bar
    setTxtProgressBar(pb, i)
    
    
    # End and store time recording
    t1 <- Sys.time()
    simulated_data$outcome_time_cost <- t1-t0
    
    next
  }
  
  
  # Store 'lazy' outcome
  simulated_data$outcome_gpt[i] <- gpt_results$gpt_output.choices.message.content[gpt_results$condition == "lazy"]


  # Store 'lazy' utility info
  simulated_data$outcome_id_gpt[i] <- gpt_results$gpt_output.id[gpt_results$condition == "lazy"]
  simulated_data$outcome_object_gpt[i] <- gpt_results$gpt_output.object[gpt_results$condition == "lazy"]
  simulated_data$outcome_created_gpt[i] <- gpt_results$gpt_output.created[gpt_results$condition == "lazy"]
  simulated_data$outcome_model_gpt[i] <- gpt_results$gpt_output.model[gpt_results$condition == "lazy"]
  simulated_data$outcome_prompt_tokens[i] <- gpt_results$gpt_output.usage.prompt_tokens[gpt_results$condition == "lazy"]
  simulated_data$outcome_completion_tokens[i] <- gpt_results$gpt_output.usage.completion_tokens[gpt_results$condition == "lazy"]
  simulated_data$outcome_logprobs[i] <- gpt_results$gpt_output.choices.logprobs[gpt_results$condition == "lazy"]
  simulated_data$outcome_finish_reason[i] <- gpt_results$gpt_output.choices.finish_reason[gpt_results$condition == "lazy"]
  simulated_data$outcome_message.refusal[i] <- gpt_results$gpt_output.choices.message.refusal[gpt_results$condition == "lazy"]
  simulated_data$outcome_message.role[i] <- gpt_results$gpt_output.choices.message.role[gpt_results$condition == "lazy"]
  
  
  # Store 'unlucky' counterfactual outcome
  simulated_data$outcome_counterfactual_gpt[i] <- gpt_results$gpt_output.choices.message.content[gpt_results$condition == "unlucky"]


  # Store 'unlucky' utility info
  simulated_data$outcome_counterfactual_id_gpt[i] <- gpt_results$gpt_output.id[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_object_gpt[i] <- gpt_results$gpt_output.object[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_created_gpt[i] <- gpt_results$gpt_output.created[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_model_gpt[i] <- gpt_results$gpt_output.model[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_prompt_tokens[i] <- gpt_results$gpt_output.usage.prompt_tokens[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_completion_tokens[i] <- gpt_results$gpt_output.usage.completion_tokens[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_logprobs[i] <- gpt_results$gpt_output.choices.logprobs[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_finish_reason[i] <- gpt_results$gpt_output.choices.finish_reason[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_message.refusal[i] <- gpt_results$gpt_output.choices.message.refusal[gpt_results$condition == "unlucky"]
  simulated_data$outcome_counterfactual_message.role[i] <- gpt_results$gpt_output.choices.message.role[gpt_results$condition == "unlucky"]
  
  
  # Update progress bar
  setTxtProgressBar(pb, i)
  
  
  # End and store time recording
  t1 <- Sys.time()
  simulated_data$outcome_time_cost <- t1-t0
}


# Convert Unix Timestamp to Datatime
simulated_data$outcome_datetime <- lubridate::as_datetime(simulated_data$outcome_created_gpt)
simulated_data$outcome_counterfactual_datetime <- lubridate::as_datetime(simulated_data$outcome_counterfactual_created_gpt)

Executing the above for-loop, generating a counterfactual outcome for the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions for each of the 1,009 synthetic respondents, takes a total of 182.00 minutes, or 3.03 hours. This means that for these specified settings, the expected time-cost per simulated respondent is 10.82 (SD = .00) seconds. Following this, the generated outcomes can the be examined to ascertain whether the intended output has been generated and if the model ran into any issues generating it. Similar to the generated life-histories, this could involve using coders to check for consistency between the responses and the life-histories/characteristics, but this is beyond the scope of this paper. To help illustrate these LLM-generated counterfactual outcomes, an example of a synthetic respondent with both counterfactual outcomes observed is provided in table 4.6.

Table 4.6: Example of Counterfactual Outcomes
id	Outcome (‘Lazy’ Welfare Recipient)	Outcome (‘Unlucky’ Welfare Recipient)
495	I would answer: 3 Somewhat disagree. While I understand the concerns about individuals who may not be motivated to seek employment, I believe that social welfare should provide support for those in need, regardless of their employment history. It’s important to consider the broader context of why someone may lack motivation, which could include mental health issues, lack of access to education or job training, or other personal circumstances. Instead of tightening eligibility requirements, I think we should focus on providing resources and support to help individuals become self-sufficient.	1 Strongly disagree I believe that individuals who are on social welfare, especially those who have suffered work-related injuries and are motivated to return to work, should have access to support without additional barriers. Tightening eligibility requirements could make it harder for those who genuinely need assistance to get back on their feet. It’s important to provide help and resources to facilitate their recovery and reintegration into the workforce.

NOTE: Randomly-sampled counterfactual responses to the question: ‘To what extent do you disagree or agree that the eligibility requirements for social welfare should be tightened for persons like him?’ (Aarøe and Petersen 2014: 689), with responses using the 7-point Likert (Likert 1932) scale (‘Strongly disagree’ to ‘Strongly agree’). Synthetic data generated by OpenAI’s (2024a) gpt-4o-mini model on 2025-03-05 09:58:24 in response to vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), using simulated characteristics based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

The provided example gives an overview of the expected output produced by the gpt-4o-mini model, which can consist of responses of varying length and concision that may not be as easily processed as data generated by human respondents in a survey. This may not be an issue if the researcher wants to analyze the output in a qualitative manner, however, data can otherwise be converted to a format more easily processed quantitatively by using regular expressions (regex, Thompson 1968; Wickham 2023) to extract the part of the responses that fit with the intended response scale. Since Aarøe and Petersen (2014) treat the 7-point Likert scale (somewhat inappropriately, cf. Göb et al. 2007; Lantz 2013; Wu and Leung 2017) as interval-scaled (cf. Stevens 1932), this replication similarly converts the outcome to a numeric format for linear analysis. For an overview of the response distributions across these two manipulation conditions, descriptive statistics are provided metrically and graphically in the appendix. Following this, the group balance in synthetic respondent characteristics can then be assessed.

4.4.2 Group Balance

As outlined in the theoretical basis and identification strategy of the SCF, this paper has so far argued that the intra-unit comparisons achieved with ‘parallel worlds’-estimation, entirely equivalent to blocking (cf. Duflo et al. 2007: 3925; Gelman et al. 2021: 347-353; Imai et al. 2008: 489), ensures perfect balance in the respondent characteristics across manipulation conditions. Following recommended procedures (Imai et al. 2008; Mutz and Pemantle 2012: 3), to empirically assess whether the SIXAM of ‘parallel worlds’-estimation can be expected to not violate the simultaneity assumption, this subsection examines balance in characteristics across the manipulation conditions and tests whether the SIXAM is biased. This is done across all simulated characteristics, comparing their averages or frequencies across the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions. Consistent with similar recommendations (Imai et al. 2008: 498), unstandardized estimates of group differences are provided. The results are provided in table 4.7.

Table 4.7: Balance Table
Characteristic	‘Unlucky’ Recipient Condition	‘Lazy’ Recipient Condition
Females (%)	0.528	0.528
Age (M)	51.724	51.724
Household Income (M)	114,228.121	114,228.121
Republicans (%)	0.463	0.463
Conservatism (M)	0.572	0.572
Less Than High School Credential (%)	8.722	8.722
High School Credential (%)	26.065	26.065
Some Post-High School, No Bachelor’s Degree (%)	30.624	30.624
Bachelor’s Degree (%)	19.623	19.623
Graduate Degree (%)	14.965	14.965
0 Children (%)	64.618	64.618
1 Child (%)	17.542	17.542
2 Children (%)	11.992	11.992
3 Children (%)	4.361	4.361
4 Or More Children (%)	0.000	0.000
Married (%)	56.690	56.690
Divorced (%)	11.695	11.695
Not Married (%)	22.101	22.101
Separated (%)	0.991	0.991
Widowed (%)	8.523	8.523
White (%)	75.124	75.124
Bangladeshi (%)	0.198	0.198
Black/African American (%)	8.028	8.028
Cambodian (%)	0.099	0.099
Chinese (%)	1.883	1.883
Colombian (%)	0.793	0.793
Cuban (%)	0.496	0.496
Dominican (%)	0.396	0.396
Ecuadorian (%)	0.198	0.198
Filipino (%)	0.694	0.694
Honduran (%)	0.198	0.198
Indian (%)	1.784	1.784
Indonesian (%)	0.099	0.099
Japanese (%)	0.099	0.099
Korean (%)	0.396	0.396
Mexican (%)	5.649	5.649
Native American (%)	0.496	0.496
Nepalese (%)	0.099	0.099
Nicaraguan (%)	0.198	0.198
Pakistani (%)	0.396	0.396
Paraguayan (%)	0.099	0.099
Peruvian (%)	0.198	0.198
Puerto Rican (%)	0.793	0.793
Salvadoran (%)	0.297	0.297
Spaniard (%)	0.198	0.198
Thai (%)	0.099	0.099
Venezuelan (%)	0.297	0.297
Vietnamese (%)	0.694	0.694
Employed (%)	60.357	60.357
Not In Labor Force (%)	37.265	37.265
Unemployed (%)	2.379	2.379
California (%)	10.803	10.803
Alabama (%)	1.982	1.982
Alaska (%)	0.099	0.099
Arizona (%)	2.577	2.577
Arkansas (%)	1.189	1.189
Colorado (%)	1.586	1.586
Connecticut (%)	1.388	1.388
Delaware (%)	0.396	0.396
District Of Columbia (%)	0.198	0.198
Florida (%)	5.946	5.946
Georgia (%)	3.568	3.568
Hawaii (%)	0.099	0.099
Idaho (%)	0.595	0.595
Illinois (%)	4.361	4.361
Indiana (%)	1.586	1.586
Iowa (%)	0.892	0.892
Kansas (%)	1.388	1.388
Kentucky (%)	1.090	1.090
Louisiana (%)	1.982	1.982
Maine (%)	0.694	0.694
Maryland (%)	1.586	1.586
Massachusetts (%)	1.288	1.288
Michigan (%)	2.478	2.478
Minnesota (%)	1.586	1.586
Mississippi (%)	0.694	0.694
Missouri (%)	2.180	2.180
Montana (%)	0.198	0.198
Nebraska (%)	0.991	0.991
Nevada (%)	0.892	0.892
New Hampshire (%)	0.396	0.396
New Jersey (%)	3.271	3.271
New Mexico (%)	0.297	0.297
New York (%)	5.748	5.748
North Carolina (%)	2.577	2.577
Ohio (%)	4.559	4.559
Oklahoma (%)	1.784	1.784
Oregon (%)	1.288	1.288
Pennsylvania (%)	4.856	4.856
South Carolina (%)	2.478	2.478
South Dakota (%)	0.694	0.694
Tennessee (%)	1.685	1.685
Texas (%)	7.631	7.631
Utah (%)	1.189	1.189
Vermont (%)	0.099	0.099
Virginia (%)	1.883	1.883
Washington (%)	2.577	2.577
West Virginia (%)	0.297	0.297
Wisconsin (%)	2.081	2.081
Wyoming (%)	0.297	0.297
Sample Size	1,009.000	1,009.000

NOTE: Value comparisons of unstandardized differences across manipulation conditions. % = Percent; M = Mean. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini model based on vignettes and questions by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

The balance table reveals that there are invariably no differences in any of the synthetic respondent characteristics across the manipulation conditions, with even the sum of differences across characteristics being exactly zero. With balance being a property of the sample (Stuart 2010: 11; Imai et al. 2008: 497), this means that the results in this study cannot be confounded by these characteristics. While tests using inferential statistics could help ascertain whether the SIXAM is unbiased outside of this particular sample (Mutz and Pemantle 2012: 3), from merely considering the invariable differences of zero, which would yield extreme (e.g., p = 1), if not undefined, test statistics, it can be credibly inferred that the SIXAM guarantees perfect balance on the assigned characteristics on samples as well.

While the balance table only demonstrates perfect balance on the measured characteristics, given that every synthetic respondent is assigned to both manipulation conditions, thus providing every respondent with a perfect unit for comparison across conditions, it is here argued that there is perfect balance not only on the measured characteristics but also every unmeasured characteristic as well, since comparisons of the same unit across two counterfactual conditions implies that all unit-invariant characteristics are balanced as well. This demonstrates the high internal validity of the SCF and the equivalent efficiency between the SIXAM and blocking, especially when compared to other methods (e.g., the RCT) that only guarantees perfect balance on average, or asymptotically as a function of the sample size.

4.4.3 Testing Simultaneity

With group balance having been demonstrated to be perfect, a more relevant concern for researchers using the SCT is testing the simultaneity assumption. As mentioned, this assumption serves to eliminate the ability of time to confound results, which would be possible if the generation and measurement of outcomes across manipulation conditions for the same individual varied. This was why the ‘parallel worlds’-estimation procedure was used when generating the outcome data, since parallel processing expectedly helps ensure that the generation occurs simultaneously by using separate computational cores. However, while the use of parallel processing offers a reasonable argument for abiding by the simultaneity assumption, recall that a reasonable argument is seldom a guarantee for the validity of an assumption, and it can be considered best practice to empirically test assumptions when possible to see whether they hold (cf. Hartman and Hidalgo 2018). The simultaneity assumption can be tested since the time of receiving the response by gpt-4o-mini LLM was saved and stored in the data. This allows assessing whether the time difference between generating the outcomes across the two manipulation conditions is zero and whether the generation is biased towards any manipulation condition. If the difference is zero across all synthetic respondents, this suggests that the individual treatment effects (ITE) are unconfounded by time. However, should this be false, another option is that the difference across all individuals are expectedly 0, which means that the average treatment effect (ATE) can be considered unconfounded by time. Both conditions can be interesting to examine, though if the first condition holds, the second condition would necessarily hold as well.

Testing that the difference in outcome generation across manipulation conditions is zero can be done by specifying a statistical model and doing a Bayesian hypothesis test (cf. Heck 2019; Wagenmakers et al. 2010), with an a priori null hypothesis ($H_0$) that the time difference ($\delta$) in the measurement time across group conditions is zero ($H_0: \delta = 0$), and with an alternative hypothesis ($H_a$) being that it is non-zero ($H_a: \delta \ne 0$).²⁷ As was already made evident in the Bayesian meta-analysis, this study primarily subscribes to abductivist scientific reasoning (Dellsén 2024), in part, by relying on Bayes’ theorem (Bayes and Price 1763; Laplace 2009[1814]) for statistical inference (Gelman et al. 2014, 2021; Heck 2019; Kruschke 2010; McElreath 2019; Wagenmakers et al. 2010), though it complements this approach with deductive reasoning (Popper 2002[1959]) and Frequentist inference (Fisher 1970; Stock and Watson 2019; Wooldridge 2019), and the hybrid approach of this paper can be considered part of an emerging Bayesian-Frequentist framework (e.g., Han et al. 2023; Wulff and Taylor 2024; Yuan 2009). With this in mind, for the purposes of testing the simultaneity assumption, using notation based on McElreath (2019), a Bayesian model is specified in table 4.8. Note that a Frequentist version of the model can easily be derived by changing the priors to uniform distributions, or equivalently, omitting them entirely (Hastie et al. 2017: 272).

Table 4.8: Simultaneity Test Model

$D_{z, i} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{N}(\delta, \varsigma)$	1.1
$\delta \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: 1)$	1.2
$\varsigma \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(1, \: .077)$	1.3

NOTE: Bayesian Generalized Linear Model. 1.1 is the likelihood-function, which specifies that the observations of the outcome variable $D_{z, i}$ (i.e., the time difference between measurements across the manipulations conditions in time second standard deviations) are assumed to be independent and identically (i.e., iid) distributed as a normal (Gaussian) distribution, with the mean ($\delta$) and standard deviation ($\varsigma$) parameter being the desired estimands. 1.2 is the mean ($\delta$) prior, which is specified to be a Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b), with a degrees of freedom (df) parameter of 3, mean of 0, and a standard deviation of 1. 1.3 is the prior for the (residual) dispersion ($\varsigma$), specified as a non-negative Student t distribution, with a df of 3, mean of 1, and standard deviation of .077.

In the model, the time difference in time second standard deviations is treated as a continuous outcome that is assumed to be independent and identically distributed (iid), following a normal (Gaussian) distribution (Gauss 2012[1809]).²⁸ Following recent recommendations (Lundberg et al. 2021), the quantities of interest (i.e., the estimands) are explicitly stated as the mean (i.e., $\delta$) and dispersion ($\varsigma$) of this time difference distribution. The prior distribution of the mean is specified to a priori assume no difference, while its standard deviation is specified as 1, which serves to make the priors be ‘weakly informative’ relative to the data (Gelman et al. 2014: 55-56; Gelman et al. 2021: 123-127). Consistent with practices recommended by Paul-Christian Bürkner (2017, 2018), the priors are specified to follow a Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b), which, following Aki Vehtari, are specified with 3 degrees of freedom (df) to enable a finite mean and variance of the prior distribution. The dispersion (i.e., residual standard deviation) prior is specified as a non-negative Student t with a df of 3. Only non-negative values are allowed, since the dispersion by definition cannot be negative. Its mean is specified as a 1 standard deviation of the time difference, with a standard deviation of .077, which effectively means that the model is a priori expected to account for little to no variation in time difference, with the standard deviation being wide enough to let data prove it wrong. This can be considered more conservative than other approaches (e.g., McElreath 2019: 120-121), which typically specify the prior for the dispersion so that the model is a priori expected to explain all or most variation in the outcome, which is arguably unreasonable for most models, at least in the social sciences.

The model is then fit to the data in R using the brms R package (Bürkner 2017, 2018) to exploit STAN (Stan Development Team 2024) and the NUTS algorithm (Hoffman and Gelman 2014) to efficiently draw samples from the time difference posterior distribution. To assess whether the posterior has been reliably identified, 7 independent sampling chains are used, and to increase the reliability of the drawn samples, each chain will first draw 2,000 samples for warmup. Following this, approximately 40,000 samples are then be drawn in total, which serves to reliably identify the posterior and produce precise Bayes factors (BF, Makowski et al. 2019a) by invoking the law of large numbers (Angrist and Pischke 2015: 13-16; Sen and Singer 1993). With model diagnostics and posteriors being provided in the appendix, inspections of the fitted model using traceplots and the Gelman-Rubin Convergence metric ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021), the sampling chains are found to reliably converge to the same posterior distribution ($\hat{R}$ < 1.01), with an approximately zero sampling error as indicated by the negligible Monte-Carlo Standard Error (MCSE ≤ .001, Kruschke 2014). Accounting for the autocorrelation between chains, the effective sample size (ESS) can also be considered ‘sufficiently’ large (ESS ≥ 1,000, Bürkner 2017).

As previously mentioned, the specified priors may affect the results in an unwanted manner and to validate the priors as ‘noninformative’ relative to the data, a Frequentist version of the model is estimated using ordinary least squares (OLS, Legendre 1805). Inferential statistics from this model were derived using the non-parametric bootstrap (Efron 1979, 2003; Efron and Tibshirani 1994) with the number of permutations equivalent to the number of posterior samples (i.e., 40,005 permutations), exploiting the fact that the Frequentist bootstrap approximates a Bayesian model with uniform priors (Hastie et al. 2017: 272). Comparing the posterior distribution with its (maximum) likelihood counterpart, while the overlap is a mere 15.32%, the exact difference is nonetheless zero with 95% credibility. Visual comparisons of the distributions (provided in the appendix) suggest that the low overlap is due to the multimodal distribution of the (maximum) likelihood distribution, possibly a result of too few permutations to invoke the law of large numbers for this irregular data. Given the credible lack of difference between the distributions, the priors were nevertheless concluded to be ‘noninformative’ relative to the data. The posterior distribution for the estimated mean time difference ($\hat{\delta}$) parameter is displayed in figure 4.3, while summary statistics are provided in table 4.9.

Figure 4.3: Time Difference Between Measured Group Conditions

NOTE: Posterior distribution of the time difference (in time second standard deviations) between measurements across manipulation conditions (i.e., $\delta$). Posterior samples = 40,005. The dashed black line indicates a time difference of zero. The geometric below the distribution indicates the posterior mode (circle), the 95% HDCI (thick bar), 99% HDCI (thin bar), and 99.9% HDCI (line). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

The results reveal that the a priori expectation of a some time difference in generating and measuring the counterfactual outcomes across the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions has been updated by the data resulting in a estimated average time difference in time second standard deviations of .033 (SD = .031; 95% HDI[-.027; .097]; $BF_{1,0}$ = .03). Following King et al. (2000), it can be interpreted as, ceteris paribus, compared to the ‘unlucky’ welfare recipient manipulation condition, the time it takes OpenAI’s gpt-4o-mini model to receive and respond to the prompt containing the life-history and manipulation condition (see table 4.5) is on average .033 standard deviations (SD) slower for the ‘lazy’ welfare recipient manipulation condition, plus or minus .031SD, and with 95% credibility between -.027SD and .097SD. This difference is positive yet relatively small and converted into into raw seconds for ease of interpretation, the unstandardized time difference is merely .740 seconds. This means that despite the attempts to ensure perfect time equivalence across the manipulation conditions, some difference persists in the sample. Since balance is a property of the sample (Stuart 2010: 11; Imai et al. 2008: 497), this means that estimates of individual treatment effects could be biased by being confounded by time. To assess how detrimental this is for the analysis, a summary of the sample, however, reveals that the time difference is exactly zero for 93.76% of the synthetic respondents. Possibly biased estimates thus do not make out the majority of the sample. While it may be argued that the confounding effect of time will have to be large to substantively bias estimates from the data (cf. Cinelli and Hazlett 2020), a more reliable approach would be to complement the main analysis with a sensitivity analysis that omits the possibly time-confounded synthetic respondents to gauge its impact on the results.

Table 4.9: Measurement Time Difference Between Manipulation Conditions
$\hat{\theta}$	$Parameter$	$Estimate$	$p$	$ln(BF_{1,0})$	$\hat{R}$	$ESS$	$MCSE$	$Prior$
$\hat{\delta}$	Mean	.033 (.031) [-.027; .097]	0.295	-3	1	21,127	0	$\mathcal{T}^{(0; \infty]}_3(.000, 1.000)$

NOTE: Bayesian Generalized Linear Model, estimated using Bayes MCMC with NUTS (Bürkner 2017, 2018). Posterior means with standard deviations in parentheses and 95% HDI in brackets. n = 1,009. p = Two-tailed probability of at least as extreme a sample estimate if assuming no true difference: ${}^{\#}\!$p ≤ .1, ${}^{*}\!$p ≤ .05, ${}^{**}\!$p ≤ .01, ${}^{***}\!$p ≤ .001. $ln(BF_{1,0})$ = Natural-logarithm-scaled Bayes factor with the null-hypothesis ($H_0: \delta = 0$) in the denominator. $\hat{R}$ = Rubin-Gelman Convergence metric, ‘Excellent’ if < 1.01 (Vehtari et al. 2021), ‘Acceptable’ if < 1.1 (Gelman and Rubin 1992), otherwise ‘Unacceptable’. ESS = Effective sample size, ‘Sufficient’ if ≥ 1,000, otherwise ‘Insufficient’ (Bürkner 2017). MCSE = Monte-Carlo Standard Error (Kruschke 2014). $\mathcal{T}_3$ = Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b) with 3 degrees of freedom, validated as ‘noninformative’ relative to the data. $PP_P$ (Gelman et al. 1996) = 0.999. Bayesian $R^2$ (Gelman et al. 2019) = .000 (SD = .000; 95% HDI[.000; .000]). RMSE ($\hat{\varsigma}$) = 1.001 (SD = .021; 95% HDI[.959; 1.042]; $\hat{R}$ = 1; ESS = 21,918; MCSE = 0; Prior = $\mathcal{T}^{(0; \infty]}_3(1.000, .077)$). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Turning to the inferential statistics provided by the model, following (Mutz and Pemantle 2012: 3; Imai et al. 2008), these can be used to test whether the current implementation of the ‘parallel worlds’-estimation procedure can be expected to violate the simultaneity assumption. Consulting the 95% highest density (posterior) intervals (HD(P)I, McElreath 2019: 56-58), these reveal that the difference is expectly zero with 95% credibility. This can be similarly expressed by considering the two-tailed p-value (cf. Makowski et al. 2019a; 2019b), indicating that the probability of at least as an extreme difference assuming that there is no difference is 29.47%, which would be considered statistically indiscernible from zero at conventional levels ($p > .05$). In line with Bayesian inference (Heck 2019; Makowski et al. 2019b; Wagenmakers et al. 2010), an arguably more relevant conclusion can be derived by considering the computed Bayes factor (Makowski et al. 2019a). Assuming that a hypothesized time difference is as probable as no hypothesized time difference, taking then the inverse exponent (Euler 1862) of the resulting posterior log-odds reveals that the hypothesis that the ‘parallel worlds’-estimation procedure is unbiased is 20.20 times more probable than the hypothesis that the ‘parallel worlds’-estimation procedure is biased. Following the guidelines by Raftery (1995), this can be interpreted to provide ‘strong evidence against’ against the hypothesis that the ‘parallel worlds’-estimation violates the simultaneity assumption by generating non-zero time differences across manipulation conditions given the data. As such, while some time differences do exist in the sample, they make out a small portion, and with regards to the implementation of the ‘parallel worlds’-estimation procedure implemented here, the simultaneity assumption has been proven testable, with results indicating that it is unbiased and can be considered to produce no time differences.

4.5 Bayesian Inference

To replicate the meta-analytic estimate derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023) as closely as possible, the model used for statistical inference solely includes an intercept and coefficient for the difference between the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions. Contrary to these existing study, Bayesian inference is used, which can generally be considered superior in terms of interpretation and inference (Cohen 1994: 998-999; Pawitan 2001: 13; Clayton 2021; Gelman et al. 2014, 2021; Kruschke 2010; Kruschke and Liddell 2018; McElreath 2019), though for increased comparability with their results, Frequentist models are used for robustness. As previously alluded to, a caveat of the SCF’s ability to generate multiple counterfactual outcomes for each synthetic respondent is that the data can be transformed into a ‘long format’ to take advantage of these multiple observations for each unit. While this creates a respondent-clustered structure in the data, which would violate the iid assumption because observations in this format would not be independent, it is easily accounted for with a hierarchical/multilevel model (Gelman and Hill 2007; Yarkoni 2020) that specifies respondent-varying intercepts. To ease comparability with the meta-analytic estimates, as well as the estimates by Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023), the responses to the outcome ‘To what extent do you disagree or agree that the eligibility requirements for social welfare should be tightened for persons like him?’ (Aarøe and Petersen 2014: 689) are first recoded so that higher values indicate less agreement with tightening the eligibility requirements for the social welfare recipient, which is followed by then normalizing the outcome (i.e., min-maxing: rescaled 0-1). For similar reasons, the categorical variable indicating the manipulation condition is specified so that the reference category is the ‘lazy’ welfare recipient manipulation. Using notation similar to McElreath (2019) and Nalborczyk et al. (2019), the specified multilevel model is then formalized and provided in table 4.10.

Table 4.10: Modeled Relationship between Deservingness & Welfare Support

$Y_i \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_\nu(\mu_i, \sigma)$	1.1
$\mu_i = \alpha_{id[i]} \: + \: \beta_G \: \cdot \: G_i$	1.2
$\beta_G \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: .077)$	1.3
$\alpha_{id} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(\mu_\alpha, \: \sigma_\alpha)$	1.4
$\mu_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(.5, \: .039)$	1.5
$\sigma_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(0, \: .039)$	1.6
$\sigma \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(.5, \: .039)$	1.7
$\nu \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(1; 2018)}_3(2013, \: 155.757)$	1.8

NOTE: Bayesian Generalized Linear Model with respondent ID-varying intercepts. 1.1 is the likelihood-function, which specifies that the observations of the outcome variable $Y$ (i.e., welfare support) are assumed to be independent and identically (i.e., iid) distributed as a Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b), with a mean, dispersion (i.e., residual standard deviation), and degrees of freedom (df) parameters. 1.2 is the affine function (Boyd and Vandenberghe 2018) of the mean parameter, which consists of random (i.e., respondent-level) varying intercepts and a fixed coefficient indicating the difference across the groups (i.e., manipulation conditions), with the ‘lazy’ welfare recipient vignette being the reference category. 1.3 is the prior of the fixed coefficient, specified as a Student t distribution, with a 3 degrees of freedom (df) parameter, a mean of 0, and dispersion of .077. 1.4 is the prior for the respondent-level varying intercepts, Student t distributed, with a df of 3, an unknown mean, which is the fixed (i.e., US-level) intercept, and an unknown dispersion parameter. 1.5 is the hyper prior for the fixed/mean parameter of the respondent-varying intercepts, being Student t distributed, df of 3, mean of .5, and a dispersion of .039. 1.6 is the hyper prior for the dispersion parameter of the respondent-varying intercepts, being distributed as a non-negative Student t, with a df of 3, mean of 0, and dispersion of .039. 1.7 is the prior for the dispersion of the outcome (RMSE), specified as a non-negative Student t distribution, with a df of 3, mean of .5, and dispersion of .039. 1.8 is the prior for the df parameter of the model, being distributed as a truncated Student t, with a mean of 2,013, dispersion of 155.757, while the truncation includes values between 1 and 2,018.

The model assumes that the outcome distribution (i.e., welfare support) is independent and identically (iid) Student t distributed (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b), with its mean, dispersion (i.e., residual standard deviation), and degrees of freedom (df) being the estimands (cf. Lundberg et al. 2021). Using a Student t distribution for the likelihood allows estimating the df of the model and makes the model more robust (Gelman et al. 2021: 264-286; McElreath 2019: 233-234; O’Hagan 1979). The mean is modeled as an affine function (Boyd and Vandenberghe 2018), consisting of random (respondent-varying) intercepts and a fixed (i.e., US-level) coefficient indicating the difference in outcome of seeing an ‘unlucky’ compared to a ‘lazy’ welfare recipient. Since the purpose is to compare the results to the meta-analytic estimates derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023), to validate the SCF’s generalizability, the priors do not incorporate information from their results, but instead are specified as a priori ‘weakly informative’ relative to the data (cf. Gelman et al. 2014: 55-56; Gelman et al. 2021: 123-127), accounting for the 0 - 1 scaling of the welfare support outcome variable. For example, this is achieved by the prior for the mean of the intercepts (i.e., $\mu_\alpha$) exploiting the fact that a normalized variable (range: 0 - 1) with no information could expectedly be centered at .5, and its dispersion (i.e., .039) is specified to permit most values to be within the 0 - 1 bounds. In line with this, the prior for the coefficient (i.e., $\beta_G$) a priori assumes no direction, though with a dispersion of .039 it accounts for the coefficient likely being bounded between 0 and 1 due to the normalized scale. Similarly, the prior for the degrees of freedom (df) of the model (i.e., $\nu$) expects it to be the number of observations (i.e., n = 2,018) minus the 5 estimated ‘fixed’ parameters (i.e., $\beta_G$, $\mu_\alpha$, $\sigma_\alpha$, $\sigma$, and $\nu$), with the bounds permitting the lowest possible df of 1 and the highest possible df of 2,018 (i.e., the total number of observations in the ‘long’ data-format). The upper bound of 2,018 for the df prior is specified to account for the possibility of parameters being mutually informative, allowing for a df higher than 2,013, but also that the model cannot be more robust than permitted by the total number of observations in the data. All priors are specified using a Student t to similarly exploit its robust properties, with a df of 3 to ensure that they have a finite mean and variance. While these prior specifications may not be optimal and could be criticized if they impact the results in an unwanted manner, a property of Bayesian inference is that the likelihood (i.e., the estimates when omitting the priors) relatively quickly ‘overwhelms’ the priors as the sample size increases (Wagenmakers et al. 2010: 167). To validate this, in a manner similar to the meta-analysis and simultaneity test, priors will be checked to ascertain whether they are ‘weakly informative’ relative to the data by comparing Bayesian estimates with Frequentist estimates, which would indicate that their impact on results are inconsequential.

With this model specified, a few a priori considerations for testing the two aforementioned hypothesis can be made. In relation to the expectation that increasing deservingness increases welfare support ($H_1$), this is supported by the coefficient being (1) credibly different from zero as ascertained by a 95% highest density (contiguous) interval (HD(C)I, Kay 2024a; 2024b; Makowski et al. 2019a; McElreath 2019: 56-58), or similarly with a 95% bias-corrected and accelerated (BCa, Efron 1987) or a 95% equal-tailed confidence interval (CI, Stock and Watson 2019: 117-118) for Frequentist models, which following recommendations (Brandstätter 1999), are here preferred over p-values; and (2) that the relative evidence in the form of the posterior odds (Makowski et al. 2019a; i.e., BF with prior odds of 1, Wagenmakers et al. 2010) indicates that the hypothesis of a relationship between deservingness and welfare is more probable than the (null) hypothesis of no such relationship (i.e., $ln(BF) > 0$) given the data. It should also be noted that the sample size is listed as 2,018 in the analysis, due to the long-format of the data that accounts for each synthetic respondent in the nominal sample (n = 1,009) having had an outcome measured two times, one for each counterfactual outcome. With these considerations in mind, an analysis of the results can be conducted.

5 Analyses

With the overall aim of replicating the meta-analytic estimate of the effect of deservingness on welfare support, this section first covers (1) a confirmatory analysis that tests the hypothesized causal relationship between deservingness and welfare support (i.e., $H_1$). The robustness of the results are then assessed in (2) a sensitivity analysis, which includes numerous alternative model specifications, including a test of heterogenous effects across the specified characteristics, and a test of whether the partial violation of the simultaneity assumption results in time confounding the results. To test the hypothesis that results are the same whether derived from synthetic or human respondents (i.e., $H_2$), the section then (3) compares the results from the main analysis to the meta-analytic estimate derived from the human-based studies (i.e., Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). Lastly, to provide researchers with an overview of the costs induced by using the Synthetic Causal Framework (SCF), (4) a cost analysis assesses the financial and time costs of its implementation as a Synthetically Controlled Trial (SCT).

5.1 Confirmatory Analysis

To test the theorized relationship between deservingness and welfare support ($H_1$), the specified multilevel Bayesian statistical model (see table 4.10) is fitted to the synthetic data. Similar to estimating the Bayesian models for the meta-analysis and the test of the simultaneity assumption, this procedures uses the NUTS Hamiltonian MCMC (Hoffman and Gelman 2014) of STAN (Stan Development Team 2024), implemented by the brms R package (Bürkner 2017, 2018), using 7 independent sampling chains, with 2,000 warmup samples per chain, followed by a total of approximately 40,000 post-warmup samples. For ease of overview, results are provided graphically and metrically. To give a sense of the difference in welfare support induced by the manipulations, predicted welfare support across manipulation conditions are illustrated in figure 5.1. For the exact difference, the posterior of the estimated difference across manipulation conditions is provided in figure 5.2. For summary estimates and model diagnostics, summary statistics in a metric format are provided in table 5.1. Model diagnostics and results not covered here are provided in the appendix.

Figure 5.1: Differences in Welfare Support across Manipulation Conditions

NOTE: Posterior distributions of the normalized (0 - 1) welfare support across group conditions. Posterior samples = 40,005. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

To first assess, whether the results by the model are reliable, model diagnostics are inspected. These reveal that the posterior has been reliably identified, with chain convergence being ‘excellent’ ($\hat{R}$ < 1.01, Gelman and Rubin 1992; Vehtari et al. 2021), the effective sample size being ‘sufficient’ (ESS > 1,000, Bürkner 2017), and there is almost no sampling uncertainty (MCSE < .000, Kruschke 2014). By comparing the results to a Frequentist bootstrapped version of the same model (see the ‘Sentivity Analysis’-section), which found an overlap of 97.17% and no difference with 95% credibility, priors were validated as ‘noninformative’ relative to the data.²⁹ In terms of the model’s goodness-of-fit, the posterior predictive p-value ($PP_P$, Gelman et al. 1996) of 0.942 indicates that the model tends to make exaggerated estimates of welfare support, but the Bayesian $R^2$ (Gelman et al. 2019) of .721 (SD = .011; 95% HDI[.701; .742]), by contrast, indicates that the model’s explanatory power can with 99.99% credibility be interpreted as ‘substantial’ (Cohen 1988), suggesting a relatively high fit. Since the results of the model are thus relatively reliable, with the priors being ‘noninformative’ relative to the data, focus can be directed to the model parameters, which can be used to assess whether deservingness increases welfare support (i.e., $H_1$).

Figure 5.2: Difference in Welfare Support for an ‘Unlucky’ Welfare Recipient compared to a ‘Lazy’ Welfare Recipient

NOTE: Posterior distribution of the normalized (0 - 1) coefficient of deservingness on welfare support. Posterior samples = 40,005. The geometric below the distribution indicates the posterior mode (circle), the 95% HDCI (thick bar), 99% HDCI (thin bar), and 99.9% HDCI (line). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Does deservingness increase welfare support? Since the difference across the manipulation conditions are positive with 95% credibility, and the relative evidence (i.e., $BF_{1,0}$) overwhelmingly supports $H_1$, deservingness would seem to increase welfare support among synthetic respondents. This is indicated by the normalized (0 - 1) ‘deservingness’-coefficient of .607 (SD = .009; 95% HDI[.590; .625]; $ln(BF_{1,0})$ = 228.24), where the 95% highest density interval (HDI, cf. McElreath 2019: 56-58; Makowski et al. 2019a) indicates that the coefficient is positive and discernible from zero with 95% credibility. The estimated effect of deservingness on welfare support can, following King et al. (2000), specifically be interpreted as the following: Ceteris paribus, compared to seeing a ‘lazy’ welfare recipient (i.e., someone low in deservingness), an individual’s support for less strict requirements for welfare eligibility is on average 60.68 percentage points (pp) higher, plus or minus .91pp, and with 95% credibility between 58.96pp and 62.51pp, when seeing an ‘unlucky’ welfare recipient (i.e., someone high in deservingness).

Table 5.1: Relationship Between Deservingness & Welfare Support
$\hat{\theta}$	$Parameter$	$Summary$	$p$	$ln(BF_{1,0})$	$\hat{R}$	$ESS$	$MCSE$	$Prior$
$\hat{\beta_G}$	Deservingness	.607*** (.009) [.590; .625]	0	228	1	81,461	0	$\mathcal{T}_3(.000, .077)$
$\hat{\mu_\alpha}$	Intercept	.341*** (.008) [.326; .356]	0	152	1	49,026	0	$\mathcal{T}_3(.500, .039)$

NOTE: Bayesian Generalized Linear Model, estimated using Bayes MCMC with NUTS, with respondent ID-varying intercepts. Normalized coefficient posterior means (0 - 1) with standard deviations in parentheses and 95% HDI in brackets. Posterior samples = 40,005. $n$ = 2,018; $n_{ID}$ = 1,009. p = Two-tailed probability of at least as extreme a sample estimate if assuming no difference in the population: ${}^{\#}\!$p ≤ .1, ${}^{*}\!$p ≤ .05, ${}^{**}\!$p ≤ .01, ${}^{***}\!$p ≤ .001. $ln(BF_{1,0})$ = Natural-logarithm-scaled Bayes factor with the null-hypothesis ($H_0: \beta_G = 0$) in the denominator. $\hat{R}$ = Rubin-Gelman Convergence metric: ‘Excellent’ if < 1.01 (Vehtari et al. 2021), ‘Acceptable’ if < 1.1 (Gelman and Rubin 1992), otherwise ‘Unacceptable’. ESS = Effective sample size, ‘Sufficient’ if ≥ 1,000, otherwise ‘Insufficient’ (Bürkner 2017). MCSE = Monte-Carlo Standard Error (Kruschke 2014). $\mathcal{T}_3$ = Student t distribution with 3 degrees of freedom, both validated as ‘noninformative’ relative to the data. $PP_P$ (Gelman et al. 1996) = 0.942. Bayesian $R^2$ (Gelman et al. 2019) = .721 (SD = .011; 95% HDI[.701; .742]). RMSE ($\hat{\sigma}$) = .204 (SD = .005; 95% HDI[.195; .213]; $\hat{R}$ = 1; ESS = 15,715; MCSE = 0; Prior = $\mathcal{T}^{(0; \infty]}_3(.500, .039)$). Intercept (residual) dispersion ($\hat{\sigma_\alpha}$) = .126 (SD = .008; 95% HDI[.111; .141]; $\hat{R}$ = 1; ESS = 11,326; MCSE = 0; Prior = $\mathcal{T}^{(0; \infty]}_3(.000, .039)$). Model degrees of freedom ($\hat{\nu}$) = 1849.486 (SD = 172.901; 95% HDI[1531.889; 2017.997]; $\hat{R}$ = 1; ESS = 59,833; MCSE = 0.707; Prior = $\mathcal{T}^{(1; 2018)}_3(2013.000, 155.757)$). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions based on Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

These results are consistent with the theorized relationship and is further corroborated by the computed Bayes factor (i.e., $BF_{1,0}$, cf. Makowski et al. 2019a; Wagenmakers et al. 2010), which is reported on the natural-logarithmic scale due to its size. Assuming a priori that the hypothesized effect is as probable as the hypothesis of no effect, the $BF_{1,0}$ indicates that the hypothesized effect (i.e., $H_1$) is much more probable than the hypothesis of no effect given the data. Following the guidelines by Raftery (1995), the relative evidence provided by the $BF_{1,0}$ can specifically be interpreted to provide ‘very strong evidence in favour of’ the hypothesized relationship between deservingness and welfare support given the data. Having thus found substantial support for the hypothesized causal relationship between deservingness and welfare support among synthetic respondents ($H_1$), an assessment can be made regarding the robustness of these results.

5.2 Sensitivity Analysis

Are the results robust? With the estimates seeing no substantial difference as a result of changing the Bayesian model (table 4.10) to (1) a Frequentist bootstrapped version, (2) changing from random effects to fixed effects, (3) including interactions with the assigned characteristics, nor (4) omitting synthetic respondents for whom the measurement time difference was non-zero, this insensitivity to varying model specifications suggest that the results from the main analysis are generally robust. This is based on a sensitivity analysis (cf. Athey and Imbens 2017: 19-21), which involved 4 separate assessments, each focused on different aspects by which the results could vary as a function of the model specification. The first assessment addressed the reliance on Bayesian inference and the possible impact that the specification of priors could have on the results. The second assessment examined whether a change from the model’s random effects (i.e., ID-varying intercepts) to fixed effects impacted inference. The third assessment investigated the possibility of heterogenous treatment effects, which involved re-specifying the main model to include interactions with the assigned characteristics. The fourth assessment sought to address the possibility that the violation of the simultaneity assumption for some of the synthetic respondents (see the ‘Testing Simultaneity’-section) could confound the results.

Considering first the possibility that Bayesian inference involves specifying priors which may inadvertantly impact results due to being ‘strongly informative’ relatively to the data (Gelman et al. 2014: 55-56; Gelman et al. 2021: 123-127), the main model was changed to its Frequentist counterpart by omitting the priors. The Frequentist model was then estimated using restricted maximum likelihood (REML, Bates et al. 2015; Patterson and Thompson 1971), and for comparability with the Bayesian results, inferential statistics was estimated using the relatively computationally intensive non-parametric bootstrap (Efron 1979, 2003), with the number of permutations being equivalent to the Bayesian posterior samples (i.e., 40,005 permutations). Similar to the Frequentist counterpart-model estimated for the meta-analysis and simultaneity test, inferential statistics apply Frequentist estimators to produce bias-reduced standard errors (SE, Park et al. 2022; Park and Wang 2020, 2022) and bias-corrected and accelerated (BCa) confidence intervals (Efron 1987; Makowski et al. 2019a). The resulting estimated average effect of deservingness in this Frequentist model is .607 (SE = .009; 95% BCa[.590; .625]), indicating a 95% confident difference in welfare support of 60.74pp on average when exposed to a ‘unlucky’ welfare recipient compared to a ‘lazy’ welfare recipient, all else equal. This closely resembles the Bayesian estimate, and the exact difference is -.001 (SD = .013; 95% HDI[-.025; .025]), which indicates a negligible difference that is zero with 95% credibility. Accordingly, the impact of the priors is negligible, and they can be considered ‘noninformative’ relative to the data, showing that results are robust across the Bayesian and Frequentist approaches.

The second assessment involved changing the random effects of the Bayesian model to fixed effects (Wooldridge 2019: 463-472; Stock and Watson 2019: 367-376), with both random and fixed effects having their own advantages in different circumstances (Wooldridge 2019: 473-474; Rabe-Hesketh and Skrondal 2012: 95-97; McElreath 2019: 399-431). While fixed effects can remove all unit-invariant confounding (Wooldridge 2019: 463-472; Stock and Watson 2019: 367-376), since intra-unit comparisons are part of the identification strategy of the SCF, this property is already incorporated into the design of the SCF, and random effects were thus preferred for the main analysis due to its properties of partial-pooling making results more generalizable (Rabe-Hesketh and Skrondal 2012: 96; Yarkoni 2020) and less prone to overfitting (McElreath 2019: 399-431). Since a non-parametric bootstrap becomes more computationally demanding with fixed effects, inference for the Frequentist fixed-effects model solely relied on asymptotics (Stock and Watson 2019: 690-695; Wooldridge 2019: 117-180), employing 95% equal-tailed confidence intervals (Stock and Watson 2019: 117-118) and cluster-robust standard errors (Stock and Watson 2019: 376) to account for the multiple observations. Accordingly, the estimates are reported as provided by the statistical software used for this model (i.e., Bergé 2018) without applying extra bias-corrections. The results of this model are .607 (SE = .009; 95% CI[.590; .625]), and the fact that these estimates do not differ in any substantial way to the results from the main analysis indicates that the results are insensitive to the specification of random or fixed effects.

Having argued that assigning and generating life-histories based on nationally representative characteristics could make the behavior of the synthetic respondents more nuanced and realistic, a third assessment served to analyse whether the estimates varied across the assigned characteristics, which could imply that the assignment of nationally representative characteristics mattered for the results. This involved modifying the main model to include linear interactions between the manipulation condition and each characteristic. Similar to the Frequentist fixed-effects model, for computational ease, inference for the model relied on asymptotics. Due to the relatively large number of estimated parameters in this model, consistent with the Frequentist approach, the p-values are adjusted to account for the inflated false-discovery rate (FDR, Benjamini and Yekutieli 2001) associated with the multiple comparisons (Shaffer 1995; Wright 1992) made by the model. While a run-down of every parameter is beyond the scope of this paper, it should be briefly mentioned that, statistically discernible heterogenous deservingness-effects on welfare support were identified for political ideology and preferred political party ($p_{adj} \le .05$), and prior to the adjustments, interactions with sex, 2 levels of education, and 3 US states (i.e., Arkansas, Alabama, and Florida) were also confidently non-zero. Given the substantial statistical power necessary to identify interactions (Gelman et al. 2021), and the model’s assumption that the interactions for the continuous variables are linear, these results regarding the presence or absence of interactions should be taken with caution. The main takeaway of this assessment is that the interactions for some of the assigned characteristics are consistent with heterogenous effects, suggesting that the assignment of relevant characteristics from which to generate life-histories could indeed impact results and possibly make results more generalizable to the human population of interest.

The last assessment of robustness sought to address the finding that the simultaneity assumption appeared to have been violated for a relatively small subset of synthetic respondents. This was an issue, since it allowed time to possibly confound the results, because time-invariant confounders were no longer eliminated by design. An analysis of this possibility involved omitting all synthetic respondents for whom the time difference in measuring the counterfactual outcomes was non-zero, resulting in a nominal sample size of 946. Fitting a Frequentist non-parametric bootstrapped version of the main model to the long-form of this data, with inference again relying on the aforementioned bias-reduced estimators, this resulted in an estimated effect of deservingness of .607 (SE = .009; 95% BCa[.588; .625]). This estimate is with 95% confidence non-zero and of similar direction and magnitude as the main results. In fact, the exact difference is .000 (SD = .013; 95% HDI[-.025; .026]), which again is neglible and zero with 95% credibility. As such, when completely eliminating the possibility of time confounding results, there is no substantial impact on the main results, and the main results can be considered relatively robust to time confounding. With this sensitivity analysis in mind, the results can thus be considered robust. This leaves the question of whether this SCF-based approach has managed to adequately replicate the meta-analytic results derived from the human-based studies (i.e., $H_2$), which is addressed in the subsequent section.

5.3 Replication Analysis

Do the results replicate the meta-analysis? Comparing the SCF-based estimate derived from a synthetic sample to the meta-analytic estimate derived from human samples, while revealing varying magnitudes, they are both positive and discernible from zero with 95% credibility. Assessing the exact difference reveals support for $H_2$ with there being no difference with 95% credibility, and the results can be considered to replicate the meta-analysis. This is made evident by first comparing the results from the SCF-based main analysis to the human-based results from the Bayesian analysis. For robustness, this is followed by comparing the main results with the results from the aforementioned Frequentist meta-analysis.

To compare the Bayesian estimates from the main analysis with the meta-analytic estimate, recall that the results of the Bayesian meta-analysis of the human-based studies (i.e., Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019) revealed an uncertainty-weighted average estimated effect of deservingness on welfare support of .44 (SD = .12; 95% HDI[.12; .61]). Visually comparing the estimates reveals both to be positive and different from zero with 95% credibility, which can be taken as evidence that the SCF can produce estimates of similar direction and certainty as human-based studies. However, the SCF-based estimates do appear to be of a larger magnitude than the meta-analytic estimate, though it remains of a similar magnitude to the individual human-based studies by Mummelo and Peterson (2019, their MTurk study) and Kane et al. (2023) of approximately .550 (SE = .023; 95% CI[.505; .595]) and .583 (SE = .026; 95% CI[.532; .634], respectively. The fact that the human-based estimates are also of similarly larger magnitudes than the meta-analytic estimate is consistent with the SCF-based estimates similarly deviating from the meta-analytic mean due to the known properties of sampling variability and heterogeneous effects across studies. This can be further examined by computing the exact difference between the SCF-based and human-based results, revealing a difference (in pp) of 16.77 (SD = 11.92; 95% HDI[-1.14; 48.30]), which provides support for $H_2$ by indicating no systematic difference between the SCF-based and human-based estimates with 95% credibility.

To further ensure the robustness of this finding, the SCF-based results are also compared to the Frequentist-version of the meta-analysis, which yielded the aforementioned SE-weighted estimated average of .50 (SE = .06; 95% BCa[.38; .62]). Visually comparing these estimates to the SCF-based estimates leads to the same conclusions of similar direction and credibility of the results, with the exact difference (in pp) between the Frequentist meta-analytic estimate and the SCF-based estimate being 10.40 (SD = 6.43; 95% HDI[-1.37; 22.43]). While the SCF-based estimates again are somewhat exaggerated compared to the meta-analytic estimate, though now to a lesser extent and still of similar magnitude to results by Mummolo and Peterson (2019) and Kane et al. (2023), it can again be stated that there is no systematic difference between the estimates with 95% credibility, thus further supporting $H_2$.

Considering that the SCF manages to replicate the direction and credibility of human-based studies, with some exaggeration subject to sampling variability and study-based heterogeneity, this replication analysis do find relatively robust evidence that this SCF-based replication of the relationship between deservingness and welfare support among synthetic respondents succeeds in replicating the human-based results. With support for $H_2$ by merit of the similarity in the results, this demonstrates not just the high internal validity (cf. Stock and Watson 2019: 478-481) of the SCF, corroborating the theoretical justifications outlined in its conceptualization, but also its relatively high external validity (cf. Stock and Watson 2019: 481; Gelman et al. 2021: 354-355), as indicated by its ability to replicate the robust human-based causal relationship between deservingness and welfare support. As such, based on the analysis of this replication study, it is concluded that the LLM-driven SCF, at least as implemented in the form of an SCT, is capable of generating causal estimates using an adequately-powered and representative sample of synthetic respondents, that if investigated in human populations, can be expected to replicate. At the same time, despite the relatively large sample used here (n = 1,009), no single replication can ultimately validate a theoretical framework, and it thus remains a possibility that SCF-results can fail to generalize beyond the topic of deservingness and welfare support investigated in this study. Before discussing the implications of the formalization of the SCF, the novel concepts of synthetic units, ‘parallel worlds’-estimation, the SCT, its external validity, and this replication, a cost analysis will ascertain the financial and time costs of implementing the SCF as an SCT.

5.4 Cost Analysis

As the final part of the analysis of this demonstrative replication study and proof-of-concept of the SCF, this section ascertains the costs associated with the SCF. This serves to help researchers estimate the expected costs of relying on the SCF compared to using human samples. For this, it should be emphasized that all costs reported in this analysis are based on the application of the SCF specific to the replication study, such as the choice of the gpt-4o-mini LLM by Openai (2024a), the sample size, number of characteristics from which to generate life-histories, the length and wording of the life-history prompts, the token limitations on the output by the LLM, the length and wording of the prompt containing the manipulation conditions, and the number of manipulation conditions from which to generate counterfactual outcomes. Note also that the only costs considered for this analysis involves financial and time costs incurred from making requests to the LLM API, specifically the gpt-4o-mini LLM API by Openai (2024a), with usage costs being specific to the chosen LLM. Such costs incur due to the LLM using resources to process the prompt and generate a response output, with financial costs being tied to the length of the prompt in terms of tokens, with each token in a text-based prompt corresponding to approximately one word. At the same time, costs vary for input and output tokens. Researchers should keep in mind that while newer models may show better performance, that the selected gpt-4o-mini are among the cheapest and most cost-efficient at the time of writing (cf. OpenAI 2025b), charging on average 0.375 $US per 1,000,000 input/output tokens used. Besides considering financial costs, time costs are also considered, which involve the time that the LLM takes to process requests and the pauses enforced between making calls to the LLM to respect the time limits by OpenAI. With these considerations in mind, the financial and time costs incurred by relying on the SCF to generate the life-histories and the counterfactual outcomes can be estimated.

Financial Costs. In terms of financial costs, generating 1,009 life-histories and two counterfactual outcomes with the provided prompts and specified settings, the total cost is estimated to be .56 $US. This means that the expected financial cost per respondent is merely .001 (SD = .000) US dollars. Compared to studies that rely on human participants, this can be considered significantly cheap, especially considering that human-based studies may costs thousands of dollars to achieve a sample size compared to the 1,009 used in this replication study. At the same time, with an implementation of SCF as demonstrated here, saving the resources that it could otherwise cost to ensure that a human-based sample is representative could allow the researcher to re-allocate the resources where they are needed more. Similarly, while laboratory experiments and field experiments arguably do provide a high internal validity (Gerber and Green 2012), the SCF, as demonstrated here, provides a comparably high internal validity, and researchers may consider how many resources could be saved to working within the framework of the SCF.

Time Costs. Another relevant cost for researchers to consider is time costs. Generating the life-histories and the counterfactual outcomes for the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions took a total of 374.76 minutes (i.e., 6.25 hours.). With 1,009 synthetic respondents, the expected time cost per respondent is 22.29 (SD = .00) seconds. Compared to human studies that may take months to obtain such a sample size, this can generally be considered a substantial reduction in the time taking to obtain this many survey responses. This means that researchers can save a substantial amount of time by applying the SCF to conduct an SCT compared to having to rely on human respondents, which can be considered advantageous, since their work effort can be dedicated more to doing research rather than waiting for data collection to finish.

Taken together, this cost analysis thus reveals that the SCF, through is implementation as an SCT, can be considered substantively less costly, both in terms of the financial and time resources needed to conduct a study. The demonstration has revealed that the SCF is capable is generating large and representative samples of data, where synthetic respondents can be subjected to numerous manipulation conditions to have their counterfactual outcomes generated in simultaneous and independent parallel processes, producing results that can generalize to human-based studies. The SCT can thus be expected to save surmountable resources for researchers, and the financial bar needed to conduct a study with a high internal validity and external validity is substantially lowered, thus increasing the ability of researchers to make reliable causal inferences across disciplines, particularly when the theorized cause is non-manipulable and where both the theorized cause and outcome are meaningfully expressible through natural language.

6 Discussion

With this paper having introduced the conceptualization of a novel causal framework, the Synthetic Causal Framework (SCF), theoretically outlined to enable reliable causal inference when the theorized cause is non-manipulable in human populations, it solves the existing issue of researchers being unable to manipulate relevant phenomena of interest (e.g., Buunk and van Vugt 2013: 17). Solving this issue was found to be especially pertinent given that practically any result derived from non-manipulated (i.e., endogenous) phenomena could, in principle, be subject to confounding (Fisher 1935; Greenland et al. 1999; VanderWeele and Shpitser 2013). As mentioned, while the existing concept of the longitudinal design (Wooldridge 2019: 7-10, 334-335) does help address the issue of reverse causality (cf. Hume 2011), for any endogenous phenomenon, neither positive results nor negative results imply causality (cf. Kennaway 2015; Manninen 2019c; Pearson 2015[1900]) due to the respective possibilities of spuriousness and masking (Gumbel 1933; McElreath 2019: 123-153; Stock and Watson 2019: 584-586). By recognizing that the burden of proof rests upon the researcher making a causal claim (cf. Cargile 1988; Hume 2010[1748]; Russo 2019; Sagan 1979), existing methods for making causal inference proved generally unreliable (e.g., Baker et al. 2022; Bertrand et al. 2004; Ferman 2023; Kahn-Lang and Lang 2020; Ryan et al. 2015; Samii 2016; Wan et al. 2024) due to their reliance on untestable and dubious assumptions that, contrary to best practice (cf. Rohrer 2018: 34; Stuart 2010: 3; Samii 2016), were not guaranteed by design. By contrast, the SCF does not suffer these limitations due to eliminating confounding by design. To gather the insights derived from its development, a summary and discussion of its theoretical formalization and empirical validation are provided in this section.

With a definition of causality based on the Neyman-Rubin Potential Outcomes Framework (POF, Holland 1986; Imbens 2015: 374-377; Imbens and Rubin 2015; Neyman 1923; Rubin 2005) as the difference in potential outcomes between two physically and temporally distinct phenomena, with solely the temporally-precedent cause varying across the potential outcomes, the SCF is formalized using the POF. As such, it can be considered as an extension of it, with the causal identification strategy of the SCF being directly derived from POF-based equations, which involve intra-unit comparisons of simultaneous and independent counterfactual outcomes. To enable this counterfactual comparison, the SCF is based on the theoretical foundation that language and behavior are closely related and mutually informative: The behavior of individuals shape their language, and their language shape their behavior (cf. Bandura 1999; Chung et al. 2007; Galton 1884; Holtgraves 2013; Lucy 1992; Pennebaker et al. 2003; Thibodeau and Boroditsky 2011; Whorf 2011[1956]). With Large Language Models (LLMs, Shanahan 2024) having proved increasingly capable of emulating human natural language and showing human-comparable performance in numerous disciplines (cf. Colette et al. 2024; Holderried et al. 2024; Jansen et al. 2023; Ornstein et al. 2024; Pang et al. 2024), the use of LLMs to emulate human behavior is paramount to the SCF. This also means that the SCF is formalized to enable reliable causal inference with non-manipulable phenomena in human populations, limiting its intended scope of application to topics belonging to the social and behavioral sciences (cf. Wright 2015). Accordingly, through its reliance on LLMs to achieve this goal, a key assumption of the SCF is that the theorized cause and outcome must be meaningfully expressed through natural language, and for the ability of the LLM to emulate behavior associated with the theorized cause and outcome, it must have been trained on data containing expressions of these phenomena. While these restrictions can limit the usefulness of the SCF, they nonetheless reflect relevant considerations about the areas of application for which the SCF can be considered meaningful. Similarly, explicitly specifying the intended scope and expected limitations serve to make researchers aware that the SCF may not be equally valid nor as reliable across all topics of interest and that applications outside of its intended area of application warrants further considerations, justification, and validation.

The reliance of the SCF on LLMs in emulating human behavior to enable the manipulation of a non-manipulable phenomena specifically involves the generation of simulated units. These units are defined here to possess human characteristics, whose inclusion serve as a basis to guide the behavior of the LLM when controlling each unit in a more nuanced and human-like direction. It is here important that an adequate number of relevant characteristics are simulated so that the resulting behavior becomes as realistic as possible. At the same time, since applications of the SCF will likely involve simulated samples of human units drawn from a particular population of interest, to enable better inference, it is important that the characteristics assigned to these units are simulated in proportion to their prevalence in that particular population. To further make the behavior of the LLM-controlled simulated units more realistic and generalizable, life-histories (cf. Schütze 2008) are generated by the LLM based on the assigned characteristics, enabling a more fleshed-out and realistic characterization of each unit. This step is taken to infuse the simulated units with human-nuance not captured by the assigned characteristics, resulting in these units being conceptually distinguished as synthetic units. With the intended use of the SCF being the inference of causality to human populations, the use of simulated data may be seen as invalid or meaningless because human units are not directly used in an SCF-based analysis. However, this criticism would fail to account for indirect way that human text-based behavioral data are used to train the LLMs, not to mention the existing use of simulated and synthetic data for research into real-world phenomena (e.g., Abowd et al. 2004; Emam et al. 2020; Kinney et al. 2011; Macy and Willer 2002; Nikolenko 2021; van Lange et al. 2013: 31; Zivenko et al. 2025). Similarly, from the recognized disciplines of functionalist and behaviorist psychology (Farrell 2014: 88, 111, 362-403), the study of human behavior does not necessitate the studied unit necessarily being human, with the key requirement instead being the respective functional or behavioral equivalence between humans and the studied unit. As long as researchers can plausibly justify why the function or behavior is expectedly similar, if not equivalent, between synthetic units and human units, SCF-based research can be considered valid and meaningfully inferrable to human populations.

Once synthetic units have been generated, each of these can be manipulated with respect to the theorized cause enabling reliable causal inference through the elimination of all unit- and time-invariant confounders. Similar to a random assignment mechanism (RAM) employed in Randomized Controlled Trials (RCTs, cf. Angrist and Pischke 2015: 1-46; Angrist and Pischke 2009: 15-22; Stock and Watson 2019: 48-49), the SCF employs a simultaneous and independent exogenous assignment mechanism (SIXAM), which simultaneously and independently assigns each synthetic unit to every manipulation condition through a parallel process, jointly referred to as ‘parallel worlds’-estimation. The simultaneous and independent nature of this assignment helps eliminate most concerns for pre-treatment and post-treatment biases (cf. Linos and Twist 2018; Montgomery et al. 2018). Once a counterfactual outcome for each manipulation condition has been generated, intra-unit comparisons thus provide an unbiased and efficient strategy for causal identification (cf. Westreich et al. 2015). While this procedure essentially enables the researcher to multiply the initial sample of synthetic units by a factor equivalent to the number of manipulation conditions, this does result in a sample with autocorrelated observations (cf. Gelman et al. 2021: 459; Stock and Watson 2019: 375), since the multiple counterfactual observations are clustered around each synthetic unit. This, however, is not a critical concern, since the multiple observations of each unit easily addressed through the use of multilevel statistical models (cf. Gelman and Hill 2007) or alternatively through the use of cluster-robust standard errors (cf. Stock and Watson 2019: 376). One might criticize that the ability to eliminate unit-invariant and time-invariant confounders with these intra-unit comparisons of counterfactual outcomes may be invalid and/or unreliable by relying on dubious assumptions. For example, the ability to measure counterfactual outcomes for the same synthetic respondent assumes (1) that the LLM treats the synthetic unit as the exact same unit across the multiple manipulation conditions (e.g., prior to each manipulation conditions, unit i in manipulation condition k is exactly the same as unit i in manipulation condition k’), and (2) that the responses are generated simultaneously (i.e., the simultaneity assumption). As a response to these criticisms, it can be argued that the former assumption can be justified by stating that the units across the manipulation conditions are the same conditional on their life-histories, and should the LLM possibly infer different characteristics not specified in the life-histories conditional on the manipulation condition that this is not innately different from how varying manipulation conditions, such as framing or priming vignettes (cf. Bargh and Chartrand 2000; Tversky and Kahneman 1981), could affect the self-perceptions of human participants, with certain memories, characteristics, thoughts, and behaviors varying in saliency (see, e.g., Fausey and Boroditsky 2010; Thibodeau and Boroditsky 2011). Regarding the latter assumption, simultaneity can be promoted by implementing the outlined ‘parallel worlds’-estimation procedure and ensuring that the manipulation conditions, if implemented as text-based prompts, are of the same character length and word length. Similarly, as demonstrated in the proof-of-concept, the simultaneity assumption can be tested, with the analysis being capable of accounting for possible violations. Accordingly, the assumptions made by the SCF can be justified and they do not invoke premises innately contradicting existing research.

While the formalization of the SCF as reiterated above arguably provides a strong justification for the high internal validity of this framework (cf. Stock and Watson 2019: 478-481), there are justified concerns, especially regarding its external validity (cf. Stock and Watson 2019: 481; Gelman et al. 2021: 354-355). To address such concerns and provide a demonstrative proof-of-concept, the SCF was implemented with the purpose of comparing its results to human-based studies. Specifically, this sought to replicate the robustly established causal relationship between deservingness and welfare support (cf. Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). This relationship was theoretically grounded in evolutionary psychology (cf. Buss 2019), with the causal mechanism involving a distinct cognitive module (cf. Kurzban and Aktipis 2007) dubbed the deservingness heuristic (cf. Petersen 2012), an automatic and compulsory mechanism that evaluates the deservingness of individuals with respect to (welfare) support through use of information shortcuts, such as stereotypes (Aarøe and Petersen 2014; Petersen et al. 2011, 2012). A meta-analysis was used to aggregate results from existing studies (cf. Sharpe and Poets 2020), leading to the formalization of the hypotheses that deservingness increases welfare support and, based on the theoretical basis of the SCF, that this relationship did not differ between human and SCF-based studies. Conceptually replicating the ‘original study’ on deservingness and welfare support (i.e., Aarøe and Petersen 2014), a random sample of 1,009 simulated humans from the adult population of the United States (US) was drawn. Each simulated unit were assigned the relevant characteristics of US state of residence, area of residence, sex, age, number of children, ‘ethnoracial’ group, marital status, level of education, employment status, annual household gross income in 2022 US dollars, preferred political party, and political ideology, with each characteristics being jointly simulated using population-based estimates derived from the IPUMS (Ruggles et al. 2024) and ANES (2021), resulting in a sample representative on all the relevant characteristics. Based on these characteristics, life-histories were then generated using the cost-efficient and highly capable gpt-4o-mini LLM (OpenAI 2024a), with requests being made through its API for efficiency and privacy concerns. These life-histories were validated to consistently reflect the assigned characteristics, though each characteristic was not necessarily explicitly mentioned. To serve as the manipulation conditions in a Synthetically Controlled Trial (SCT), the SCF’s version of an RCT, vignettes detailing a ‘lazy’ and ‘unlucky’ welfare recipient were adapted to have the same word and character length without changing their substantive meaning, which served to ensure that the counterfactual outcomes could be generated simultaneously. Using an implementation of the ‘parallel worlds’-estimation procedure, the life-histories and vignettes were supplied as prompts to the gpt-4o-mini LLM, and a counterfactual outcome of welfare support was generated for each synthetic respondent in response to each manipulation condition. Assessments then unsurprisingly revealed perfect balance on the assigned characteristics across manipulation conditions, though a test of the simultaneity of the ‘parallel worlds’-estimation procedure revealed a non-zero time difference for a small subset of synthetic respondents, which could enable time to confound results. A multilevel Bayesian model was specified for the analysis (cf. Bürkner 2017, 2018), which confirmed the hypothesis that deservingness increases welfare support among synthetic respondents. This was reinforced by a sensitivity analysis (cf. Athey and Imbens 2017: 19-21), and the partial non-zero time differences was also found not to substantively impact the results. A replication analysis then found that the results did not credibly differ from the human-based meta-analytic results, indicating that results were externally valid and generalizable to a human population. A cost analysis was also conducted to reveal that compared to human studies that the SCF was substantially cheaper and more time-efficient.

While this conceptual replication study does provide evidence for both the internal and external validity of the SCF, corroborating its central aim of enabling causal effects to be identified from synthetic samples and generalized to human populations, no single study can sufficiently validate a framework. Despite the strong theoretical foundation for its internal validity, in part due to the SCT mimicking the strengths of the RCT, it is impossible to properly validate the ability of the SCF to derive the ‘correct’ causal estimate from a non-manipulable phenomenon and infer it to humans because it is by definition not manipulable in human populations. This means that justifying a reliance on the SCF will, in part, have to rely on a continued validation against results from theorized cause that are manipulable in human populations. Another concern of the SCF is that for it to fulfill its intended goal of enabling researchers to draw reliable causal inference for non-manipulable phenomena, it needs to be relatively easily implementable. With this study having done its best not only to describe the theoretical foundation and rationale of the SCF in an attempt to explain the scope, area of application, assumptions, and limitations behind the SCF, it has also provided a proof-of-concept and validation of the SCF, including making the data and code to replicate its entire implementation freely available. By doing this, the hope is that researchers can relatively easily implement the SCF for their particular research topic. However, since the SCF does require modifying the provided code to suit their particular research context, researchers may nonetheless find the SCF difficult to implement. It is the hope of the researcher that future work will serve to address these practical limitations by making the SCF more easily implementable. At the same time, future studies could be conducted to expand the implementation of the SCF to designs besides the SCT. For example, an SCF-based version of a field experiment (cf. Gerber and Green 2012), which by merit of simulating a more realistic study environment may provide more ecologically valid results than the SCT (cf. Schmuckler 2001). At the same time, following the mantra of garbage-in-garbage-out (GIGO, Yoder-Wise 2023), researchers implementing the SCF needs to be aware that a central premise for its reliability is the quality of the results are contingent on the quality of the input and that not all LLMs may be equally capable of emulating human behavior (cf. Edyko et al. 2023; Koga and Du 2024; Lecler et al. 2023; Obaid et al. 2023; Saka et al. 2024; Shea and Ma 2023). For these reasons, it is important that future research also targets the continued development of LLMs, including validations of their use within the SCF, though given the recent advances, it is likely that the ability of LLMs to emulate human behavior will only increase with time.

7 Conclusion

This paper sought to answer the research question of whether an LLM-based causal framework could be developed to provide reliable causal inference for instances where the theorized cause were otherwise non-manipulable. This endeavor was motivated by the substantial issues that researchers face when the theorized case is non-manipulable (e.g., Buunk and van Vugt 2013: 17), which prevent them from using the ‘golden’ standard of Randomized Controlled Trials (RCTs, Meldrum 2000) and effectively makes causal inference unreliable due to the possibility of confounding (Fisher 1970; McElreath 2019). While alternative methods have been developed to help mitigate these issues, a review found them to be generally unreliable due to relying on dubious and untestable assumptions that failed to be as reliable as RCTs (Baker et al. 2022; Bertrand et al. 2004; cf. Rohrer 2024; Ryan et al. 2015; Samii 2016; Wan et al. 2024). To mitigate these shortcomings within the social and behavioral sciences, using theory based in linguistics, psychology, and natural language processing, this paper developed the novel Synthetic Causal Framework (SCF). The SCF specifically exploits Large Language Models (LLMs, Shanahan 2024) to emulate human behavior in the form of synthetic units, produced through a synthesis of simulated characteristics randomly sampled in proportion to a population of interest. These synthetic units can be manipulated where humans cannot, thus making an otherwise non-manipulable phenomenon manipulable. Recognizing the limitations of LLMs, an application of the SCF within its intended area of application requires the theorized cause and outcome to be meaningfully expressible through natural language. The concept of ‘parallel worlds’-estimation was further introduced to exploit the fact that synthetic units can be independently, simultaneously, and exogenously assigned to an arbitrary number of manipulation conditions, enabling researchers to measure a counterfactual outcome for each type of manipulation. With every synthetic unit being in all manipulation conditions, the SCF ensures perfect balance between the manipulation groups. Being formalized using the Neyman-Rubin Potential Outcomes Framework (Imbens and Rubin 2015; Neyman 1923; Rubin 2005), the causal identification strategy enabled by the SCF specifically involves intra-unit comparisons of independently and simultaneously generated couterfactual outcomes, eliminating all unit- and time-invariant confounding by design.

A proof-of-concept and validation of the generalizability of SCF-based results to human populations followed in the form of a Synthetically Controlled Trial (SCT) that sought to replicate the established relationship between deservingness and welfare support (cf. Aarøe and Petersen 2014; Kane et al. 2023; Mummolo and Peterson 2019). This demonstration showed how to simulate respondent characteristics jointly proportional to the population of the United States using estimates derived from the IPUMS (Ruggles et al. 2024) and ANES (ANES 2021), making results more representative of that population. It was then shown how to create prompts to generate life-histories (cf. Bertaux 1981) for the simulated respondents, adding extra nuance to the simulated characteristics. This served to make the selected LLM-model, OpenAI’s (2024a) gpt-4o-mini, more capable of realistically imitating the behavior of a person with those traits. An implementation of ‘parallel worlds’-estimation was provided, which utilized parallel processing to generate and measure two counterfactual levels of welfare support from exposure to a ‘lazy’ and ‘unlucky’ welfare recipient vignette. The replication then proved perfect group balance in terms of the simulated characteristics and tested the simultaneity assumption, finding a partial violation among a small subset of the synthetic respondents, though a subsequent sensitivity analysis (cf. Athey and Imbens 2017: 19-21) revealed that time nonetheless did not confound the results. A multilevel Bayesian statistical model was specified (cf. Bürkner 2017, 2018), with a multilevel Frequentist model for robustness (cf. Bates et al. 2015), producing results that closely replicated the human-based results of an effect of deservingness on welfare support. A cost analysis found that the implementation of the SCF was substantively cheaper compared to studies on human samples. As such, this paper managed to successfully develop the SCF, which constitutes a novel causal framework that enables researchers to manipulate an otherwise non-manipulable phenomenon, thus making causal inference derived from this procedure reliable. While the SCF currently shows a promising generalizability to human populations, further validations are needed. Lastly, it should be noted that the SCF does not constitute a substitute for employing RCTs when the theorized cause is manipulable in human populations, and caution is advised if applying the SCF outside of its intended area of application. Development of the SCF is planned in future studies, which are expected to further increase and demonstrate its scientific utility.

8 Declarations

Author Contributions

Emil Meyer-Hansen independently conceived, conceptualized, and formalized the Synthetic Causal Framework (SCF), Synthetically Controlled Trials (SCTs), and ‘Parallel Worlds’-estimation.³⁰, implemented the demonstration, simulated and synthesized the data, conducted the analyses, and wrote this manuscript.

Funding

Emil Meyer-Hansen received no funding to assist with this project.

Conflicts of Interest

Emil Meyer-Hansen has no competing interests to declare that are relevant to the content of this project.

Availability of Data & Materials

Data and materials are not made available in this preview version. They will be made available in the preprint version of this study.

Code availability

The computer code for the simulation is not made available in this preview version. They will be made available in the preprint version of this study.

Acknowledgements

This project was made possible by building on insights and contributions from numerous researchers across disciplines. While the author is indebted to an insurmountable number of researchers, a special gratitude, in no particular order, is extended to Jerzy Neyman, Ronald A. Fisher, Donald B. Rubin, Guido W. Imbens, Joshua Angrist, Jörn-Steffen Pischke, Judea Pearl, Susan Athey, Alberto Abadie, Aexis Diamond, Jens Hainmueller, Peter M. Aronow, Cyrus Samii, Francis Galton, Karl Pearson, Egon Pearson, Bradley Efron, David R. Cox, Andrew Gelman, Aki Vehtari, Gary King, Jennifer Hill, the R Core Team, the Posit Team, Hadley Wickham, Thomas Bayes, Pierre-Simon Laplace, Carl Friedrich Gauss, Iegor Rudnytskyi, Steve Weston, Paul-Christian Bürkner, Dominique Makowski, Mattan S. Ben-Shachar, Daniel Lüdecke, Benjamin Goodrich, Jonah Gabry, Jacob Cohen, Gert Gigerenzer, Paul Meehl, Uri Simonsohn, Leif D. Nelson, Joseph Simmons, Richard McElreath, Tal Yarkoni, and Elizabeth A. Stuart. For the development of the GPT models, which were vital to the implementation of the Synthetic Causal Framework in this project, a special gratitude is extended to OpenAI and the numerous researchers contributing to the development of natural language processing.

9 References

Aarøe, L., and Petersen, M. B. (2014), “Crowding out culture: Scandinavians and americans agree on social welfare in the face of deservingness cues,” The Journal of Politics, 76, 684–697. https://doi.org/10.1017/S002238161400019X.

Aarøe, L., Petersen, M. B., and Arceneaux, K. (2017), “The behavioral immune system shapes political intuitions: Why and how individual differences in disgust sensitivity underlie opposition to immigration,” American Political Science Review, 111, 277–294. https://doi.org/10.1017/S0003055416000770.

Abadie, A., Diamond, A., and Hainmueller, J. (2010), “Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program,” Journal of the American Statistical Association, 105, 493–505. https://doi.org/10.1198/jasa.2009.ap08746.

Abadie, A., Diamond, A., and Hainmueller, J. (2011), “Synth: An r package for synthetic control methods in comparative case studies,” Journal of Statistical Software, 42. https://doi.org/10.18637/jss.v042.i13.

Abadie, A., Diamond, A., and Hainmueller, J. (2015), “Comparative politics and the synthetic control method,” American Journal of Political Science, 59, 495–510. https://doi.org/10.1111/ajps.12116.

Abadie, A., and L’Hour, J. (2021), “A penalized synthetic control estimator for disaggregated data,” Journal of the American Statistical Association, 116, 1817–1834. https://doi.org/10.1080/01621459.2021.1971535.

Abowd, J. M., Lane, J., Torra, V., Domingo-Ferrer, J., and Torra, V. (2004), “New approaches to confidentiality protection: Synthetic data, remote access and research data centers,” Lecture notes in computer science, 3050, 282–289. https://doi.org/10.1007/978-3-540-25955-8_22.

Agarwal, A., Agarwal, A., and Vijaykumar, S. (2023), “Synthetic combinations: A causal inference framework for combinatorial interventions,” arXiv, 2303.14226v2. https://doi.org/10.48550/arXiv.2303.14226.

Agnihotri, D. S., Bhise, A. R., Mahishale, A. V., Saini, V. K., Amersheda, S. B., Mange, Y. N., Gupta, S., Bhojani, U., and Biswas, G. (2024), “Inter-rater and intra-rater reliability of het’s MMT in pelvic floor assessment,” Journal of the Scientific Society (Belgaum), 51, 571–575. https://doi.org/10.4103/jss.jss_103_24.

Ajzen, I., and Fishbein, M. (1977), “Attitude-behavior relations: A theoretical analysis and review of empirical research,” Psychological Bulletin, 84, 888–918. https://doi.org/10.1037/0033-2909.84.5.888.

Alea, N. (2008), “Does the life story interview make us make sense? Spontaneous and cued redemption and contamination in life story scenes,” Imagination, Cognition and Personality, 37, 271–292. https://doi.org/10.1177/0276236617733837.

Alister, M., Vickers-Jones, R., Sewell, D. K., and Ballard, T. (2021), “How do we choose our giants? Perceptions of replicability in psychological science,” Advances in Methods and Practices in Psychological Science, 4, 251524592110181. https://doi.org/10.1177/25152459211018199.

Allaire, J. J., Xie, Y., Dervieux, C., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., Wickham, H., Cheng, J., Chang, W., and Iannone, R. (2024), Rmarkdown: Dynamic documents for r.

Allers, M. A., and Hoeben, C. (2010), “Effects of unit-based garbage pricing: A differences-in-differences approach,” Environmental & Resource Economics, 45, 405–428. https://doi.org/10.1007/s10640-009-9320-6.

Allport, G. W. (1954), The nature of prejudice, Addison-Wesley.

Alrababa’h, A., and Blaydes, L. (2021), “Authoritarian media and diversionary threats: Lessons from 30 years of syrian state discourse,” Political Science Research and Methods, 9, 693–708. https://doi.org/10.1017/psrm.2020.28.

Al-Shawaf, L., Conroy-Beam, D., Asao, K., and Buss, D. M. (2015), “Human emotions: An evolutionary psychological perspective,” Emotion Review, 8, 173–186. https://doi.org/10.1177/1754073914565518.

Altman, D. G., and Bland, J. M. (1994), “Diagnostic tests 2: Predictive values,” BMJ, 309, 102. https://doi.org/10.1136/bmj.309.6947.102.

American Statistical Association (2016), “American statistical association releases statement on statistical significance and p-values: Provides principles to improve the conduct and interpretation of quantitative science,” American Statistical Association; ASA News.

Anderson, J. R. (2020), Cognitive psychology and its implications, Worth Publishers.

ANES (2021), “ANES 2020 time series study full release [dataset and documentation],” American National Election Studies; ANES.

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of causal effects using instrumental variables,” Journal of the American Statistical Association, 91, 444–455. https://doi.org/10.1080/01621459.1996.10476902.

Angrist, J. D., and Krueger, A. B. (1991), “Does compulsory school attendance affect schooling and earnings?” The Quarterly Journal of Economics, 106, 979–1014. https://doi.org/10.2307/2937954.

Angrist, J. D., and Pischke, J.-S. (2015), Mastering ’metrics: The path from cause to effect, Princeton: Princeton University Press.

Angrist, J., and Pischke, J.-S. (2009), Mostly harmless econometrics, Princeton: Princeton University Press.

Appelbaum, L. D. (2002), “Who deserves help? Students’ opinions about the deservingness of different groups living in germany to receive aid,” Social Justice Research, 15, 201–225. https://doi.org/10.1023/A:1021025528408.

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., and Wager, S. (2021), “Synthetic difference-in-differences,” The American Economic Review, 111, 4088–4118. https://doi.org/10.1257/aer.20190159.

Aronow, P. M., and Samii, C. (2016), “Does regression produce representative estimates of causal effects?” American Journal of Political Science, 60, 250–267. https://doi.org/10.1111/ajps.12185.

Aronow, P. M., and Samii, C. (2017), “Estimating average causal effects under general interference, with application to a social network experiment,” The Annals Of Applied Statistics, 11, 1912–1947. https://doi.org/10.1214/16-AOAS1005.

Aronson, E., Wilson, T. D., Akert, R. M., and Sommers, S. R. (2018), Social psychology, Pearson Educated, Ltd.

Arp, R., Barbone, S., and Bruce, M. (2019), Bad arguments: 100 of the most important fallacies in western philosophy, John Wiley & Sons, Ltd.

Athey, S., and Imbens, G. W. (2017), “The state of applied econometrics: Causality and policy evaluation,” Journal of Economic Perspectives, 31, 3–32. https://doi.org/10.1257/jep.31.2.3.

Back, M. D., Schmukle, S. C., and Egloff, B. (2009), “Predicting actual behavior from the explicit and implicit self-concept of personality,” Journal of Personality and Social Psychology, 97, 533–548. https://doi.org/10.1037/a0016229.

Bail, C. A. (2024), “Can generative AI improve social science?” Proceedings of the National Academy of Sciences, 121, e2314021121. https://doi.org/10.1073/pnas.2314021121.

Baker, A. C., Larcker, D. F., and Wang, C. C. Y. (2022), “How much should we trust staggered difference-in-differences estimates?” Journal of Financial Economics, 144, 370–395. https://doi.org/10.1016/j.jfineco.2022.01.004.

Bakker, B. N., and Lelkes, Y. (2018), “Selling ourselves short? How abbreviated measures of personality change the way we think about personality and politics,” The Journal of Politics, 80, 1311–1325. https://doi.org/10.1086/698928.

Bakker, B. N., Schumacher, G., Gothreau, C. G., and Arceneaux, K. (2020), “Conservatives and liberals have similar physiological responses to threats,” Nature Human Behaviour, 4, 613–621. https://doi.org/10.1038/s41562-020-0823-z.

Bandalos, D. L. (2018), Measurement theory and applications for the social sciences, Guilford Press.

Bandura, A. (1999), “Moral disengagement in the perpetration of inhumanities,” Personality and Social Psychology Review, 3, 193–209. https://doi.org/10.1207/s15327957pspr0303_3.

Barber, J. A., and Thompson, S. G. (2000), “Analysis of cost data in randomized trials: An application of the non-parametric bootstrap,” Statistics in Medicine, 19, 3219–3236. https://doi.org/10.1002/1097-0258(20001215)19:23<3219::AID-SIM623>3.0.CO;2-P.

Barberá, P., and Rivero, G. (2014), “Understanding the political representativeness of twitter users,” Social Science Computer Review, 33, 712–729. https://doi.org/10.1177/0894439314558836.

Bargh, J. A., and Chartrand, T. L. (2000), “The mind in the middle: A practical guide to priming and automaticity research,” in Handbook of research methods in social psychology, eds. H. T. Reis and C. M. Judd, Cambridge University Press, pp. 1–39. https://doi.org/10.1017/CBO9780511996481.017.

Barrett, M. (2024), Ggdag: Analyze and create elegant directed acyclic graphs.

Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015), “Fitting linear mixed-effects models using lme4,” Journal of Statistical Software, 67, 1–48. https://doi.org/10.18637/jss.v067.i01.

Batinic, B., Reips, U.-D., and Bosnjak, M. (2002), Online social sciences, Hogrefe & Huber.

Baumeister, R. F., Vohsa, K. D., DeWall, C. N., and Zhang, L. (2007), “How emotion shapes behavior: Feedback, anticipation, and reflection, rather than direct causation,” Personality and Social Psychology Review, 11, 167–203. https://doi.org/10.1177/1088868307301033.

Baumrind, D. (1983), “Specious causal attributions in the social sciences: The reformulated stepping-stone theory of heroin use as exemplar,” Journal of Personality and Social Psychology, 45, 1289–1298. https://doi.org/10.1037/0022-3514.45.6.1289.

Bayes, T., and Price, R. (1763), “An essay towards solving a problem in the doctrine of chance. By the late rev. Mr. Bayes, communicated by mr. Price, in a letter to john canton, a.m.f.r.s.” Philosophical Transactions of the Royal Society of London, 53, 370–417. https://doi.org/10.1098/rstl.1763.0053.

Bebbington, D. (2011), “Argument from personal incredulity,” Think: Philosophy for Everyone, 10, 27–28. https://doi.org/10.1017/S1477175611000030.

Bechtel, M. M., and Hainmueller, J. (2011), “How lasting is voter gratitude? An analysis of the short-and long-term electoral returns to beneficial policy,” American Journal of Political Science, 55, 852–868. https://doi.org/10.1111/j.1540-5907.2011.00533.x.

Bellana, B., Mahabal, A., and Honey, C. J. (2022), “Narrative thinking lingers in spontaneous thought,” Nature Communications, 13, 4585. https://doi.org/10.1038/s41467-022-32113-6.

Benjamini, Y., and Yekutieli, D. (2001), “The control of the false discovery rate in multiple testing under dependency,” Annals of Statistics, 29, 1165–1188. https://doi.org/10.1214/aos/1013699998.

Ben-Shachar, M. S., Lüdecke, D., and Makowski, D. (2020), “effectsize: Estimation of effect size indices and standardized parameters,” Journal of Open Source Software, The Open Journal, 5, 2815. https://doi.org/10.21105/joss.02815.

Bergé, L. (2018), “Efficient estimation of maximum likelihood models with multiple fixed-effects: The R package FENmlm,” CREA Discussion Papers.

Berk, R. A. (1994), Regression analysis: A constructive critique, Sage.

Bernoulli, J. (1713[2006]), The art of conjecturing, together with letter to a friend on sets in court tennis, Johns Hopkins University Press.

Bertaux, D. (1981), Biography and society: The life history approach in the social sciences, SAGE Publications.

Bertrand, M., Duflo, E., and Mullainathan, S. (2004), “How much should we trust differences-in-differences estimates?” The Quarterly Journal of Economics, 119, 249–275. https://doi.org/10.1162/003355304772839588.

Bettinsoli, M. L., Maass, A., Kashima, Y., and Suitner, C. (2015), “Word-order and causal inference: The temporal attribution bias,” Journal of Experimental Social Psychology, 60, 144–149. https://doi.org/10.1016/j.jesp.2015.05.011.

Bhrolcháin, M. N., and Dyson, T. (2007), “On causation in demography: Issues and illustrations,” Population and Development Review, 33, 1–36. https://doi.org/10.1111/j.1728-4457.2007.00157.x.

Bilgel, F., and Galle, B. (2015), “Financial incentives for kidney donation: A comparative case study using synthetic controls,” Journal of Health Economics, 43, 103–117. https://doi.org/10.1016/j.jhealeco.2015.06.007.

Birdsall, C. (2011), “The synthetic control method for comparative case studies: An application estimating the effect of managerial discretion under performance management,” International Public Management Journal, 20, 49–77. https://doi.org/10.1080/10967494.2015.1121178.

Bishop, D. V. M. (2006), “The psychology of experimental psychologists: Overcoming cognitive constraints to improve research: The 47th sir frederic bartlett lecture,” Quarterly Journal of Experimental Psychology, 73, 1–19. https://doi.org/10.1177/1747021819886519.

Block, J., and Block, J. H. (2006), “Nursery school personality and political orientations two decades later,” Journal of Research in Personality, 40, 734–749. https://doi.org/10.1016/j.jrp.2005.09.005.

Bodenhausen, G. V., and Lichtenstein, M. (1987), “Social stereotypes and information-processing strategies: The impact of task complexity,” Journal of Personality and Social Psychology, 52, 871–880. https://doi.org/10.1037/0022-3514.52.5.871.

Bolker, B., and Robinson, D. (2024), Broom.mixed: Tidying methods for mixed models.

Bollen, K. A. (1989), Structural equations with latent variables, Wiley.

Bollen, K. A., and Pearl, J. (2013), “Eight myths about causality and structural equation models,” in Handbook of causal analysis for social research, ed. S. L. Morgan, Springer Netherlands, pp. 301–328. https://doi.org/10.1007/978-94-007-6094-3_15.

Boroditsky, L. (2001), “Does language shape thought?: Mandarin and english speakers’ conceptions of time,” Cognitive Psychology, 43, 1–22. https://doi.org/10.1006/cogp.2001.0748.

Bowley, A. L. (1910), An elementary manual of statistics, Legare Street Press.

Boyd, S. P., and Vandenberghe, L. (2018), Introduction to applied linear algebra: Vectors, matrices, and least squares, Cambridge University Press.

Boyer, P., and Petersen, M. B. (2018), “Folk-economic beliefs: An evolutionary cognitive model,” The Behavioral and Brain Sciences, 41, e158. https://doi.org/10.1017/S0140525X17001960.

Brandstätter, E. (1999), “Confidence intervals as an alternative to significance testing,” Methods of Psychological Research, 4, 33–46.

Bravais, A. (1844), Analyse mathematique sur les probabilités des erreurs de situation d’un point, Impr. royale.

Breiman, L. (2001), “Random forests,” Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011), Handbook of markov chain monte carlo, Chapman; Hall/CRC. https://doi.org/10.1201/b10905.

Bullock, J. G., Green, D. P., and Ha, S. E. (2010), “Yes, but what’s the mechanism? (Don’t expect an easy answer),” Journal of Personality and Social Psychology, 98, 550–558. https://doi.org/10.1037/a0018933.

Burgess, J. P. (2022[1948]), Set theory, Cambridge University Press.

Burkett, D. (2019), “Oversimplification,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, Chichester, UK: John Wiley & Sons, Ltd, pp. 286–288.

Bürkner, P.-C. (2017), “brms: An R package for Bayesian multilevel models using Stan,” Journal of Statistical Software, 80, 1–28. https://doi.org/10.18637/jss.v080.i01.

Bürkner, P.-C. (2018), “Advanced Bayesian multilevel modeling with the R package brms,” The R Journal, 10, 395–411. https://doi.org/10.32614/RJ-2018-017.

Buss, D. M. (2009), “How can evolutionary psychology successfully explain personality and individual differences?” Perspectives on Psychological Science, 4, 359–366. https://doi.org/10.1111/j.1745-6924.2009.01138.x.

Buss, D. M. (2019), Evolutionary psychology: The new science of the mind, Routledge.

Buunk, A. P., and van Vugt, M. (2013), Applying social psychology. From problems to solutions, Sage.

Cantor, G. (1874), “Über eine eigenschaft des inbegriffs aller reellen algebraischen zahlen,” Journal für die Reine und Angewandte Mathematik, 77, 258–262. https://doi.org/10.1515/crll.1874.77.258.

Card, D., and Krueger, A. B. (1994), “Minimum wages and employment: A case study of the fast-food industry in new jersey and pennsylvania,” The American Economic Review, 84, 772–793.

Cargile, J. (1988), “On the burden of proof,” Philosophy, 72, 59–83. https://doi.org/10.1017/s0031819100056655.

Carpenter, J. L., and Drabick, D. A. G. (2011), “Co-occurrence of linguistic and behavioural difficulties in early childhood: A developmental psychopathology perspective,” Early Child Development and Care, 181, 1021–1045. https://doi.org/10.1080/03004430.2010.509795.

Cartwright, N. (1994), “No causes in, no causes out,” in Nature’s capacities and their measurement, ed. N. Cartwright, Clarendon. https://doi.org/10.1093/0198235070.003.0003.

Chadefaux, T. (2014), “Early warning signals for war in the news,” Journal of Peace Research, 51, 5–18. https://doi.org/10.1177/0022343313507302.

Chen, M. K. (2013), “The effect of language on economic behavior: Evidence from savings rates, health behaviors, and retirement assets,” The American Economic Review, 103, 690–731. https://doi.org/10.1257/aer.103.2.690.

Chiu, A., Lan, X., Liu, Z., and Xu, Y. (2023), “Causal panel analysis under parallel trends: Lessons from a large reanalysis study,” arXiv, 2309.15983v4. https://doi.org/10.48550/arxiv.2309.15983.

Chopra, F., Haaland, I., Roth, C., and Stegmann, A. (2023), “The null result penalty,” The Economic Journal, 134, 193–219. https://doi.org/10.1093/ej/uead060.

Christensen, G., Freese, J., and Miguel, E. (2019), Transparent and reproducible social science research: How to do open science, Oakland, California: University of California Press.

Chung, C., Pennebaker, J., and Fiedler, K. (2007), “The psychological functions of function words,” in Social communication, ed. K. Fiedler, Psychology Press, pp. 343–359. https://doi.org/10.4324/9780203837702-15.

Cinelli, C., and Hazlett, C. (2020), “Making sense of sensitivity: Extending omitted variable bias,” Journal of the Royal Statistical Society. Series B, Statistical Methodology, 82, 39–67. https://doi.org/10.1111/rssb.12348.

Claessens, S., Fischer, K., Chaudhuri, A., Sibley, C. G., and Atkinson, Q. D. (2020), “The dual evolutionary foundations of political ideology,” Nature Human Behaviour, 4, 336–345. https://doi.org/10.1038/s41562-020-0850-9.

Clayton, A. (2021), Bernoulli’s fallacy: Statistical illogic and the crisis of modern science, Columbia University.

Cliff, N. (1983), “Some cautions concerning the application of causal modeling methods,” Multivariate Behavioral Research, 18, 115–126. https://doi.org/10.1207/s15327906mbr1801_7.

Cohen, J. (1960), “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104.

Cohen, J. (1988), Statistical power analysis for the behavioral sciences, Lawrence Erlbaum Associates.

Cohen, J. (1991), “A power primer,” Psychological Bulletin, 112, 155–159. https://doi.org/10.1037/0033-2909.112.1.155.

Cohen, J. (1994), “The earth is round (p < .05),” The American Psychologist, 49, 997–1003. https://doi.org/10.1037/0003-066X.49.12.997.

Colette, A., Dwarakanath, K., Liu, P., Vyetrenko, S., and Balch, T. (2024), “LLM-driven imitation of subrational behavior: Illusion or reality?” arXiv, 2402.08755v1.

Comtois, D. (2022), Summarytools: Tools to quickly and neatly summarize data.

Coover, J. E., and Angell, F. (1907), “General practice effect of special exercise,” The American Journal of Psychology, 18, 328–340. https://doi.org/10.2307/1412596.

Cosmides, L., Tooby, J., and Kurzban, R. (2003), “Perceptions of race,” Trends in Cognitive Sciences, 7, 173–179. https://doi.org/10.1016/S1364-6613(03)00057-3.

Cox, D. R. (1958), “The regression analysis of binary sequences (with discussion),” The Journal of the Royal Statistical Society, Series B (Statistical Methodology), 20, 215–242. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x.

Cranmer, S. J., and Desmarais, B. A. (2017), “What can we learn from predictive modeling,” Political Analysis, 25, 145–166. https://doi.org/10.1017/pan.2017.3.

Credé, M., Harms, P., Niehorster, S., Gaye-Valentine, A., and King, L. (2012), “An evaluation of the consequences of using short measures of the big five personality traits,” Journal of Personality and Social Psychology, 102, 874–888. https://doi.org/10.1037/a0027403.

Daigneault, P.-M., Duval, D., and Imbeau, L. M. (2018), “Supervised scaling of semi-structured interview transcripts to characterize the ideology of a social policy reform,” Quality & Quantity, 52, 2151–2162. https://doi.org/10.1007/s11135-017-0650-0.

Darwin, C. (1998), The origin of species, Wordsworth.

Data Colada (2022a), “[104] meaningless means: Some fundamental problems with meta-analytic averages,” Data Colada.

Data Colada (2022b), “[105] meaningless means #1: The average effect of nudging is d = .43,” Data Colada.

Data Colada (2022c), “[106] meaningless means #2: The average effect of nudging in academic publications is 8.7,” Data Colada.

Data Colada (2022d), “[107] meaningless means #3: The truth about lies,” Data Colada.

Dawkins, R. (1976), The selfish gene, Oxford University Press.

De, B. (2023), “API management,” in API management: An architect’s guide to developing and managing APIs for your organization, ed. B. De, Apress L. P., pp. 1–26. https://doi.org/10.1007/979-8-8688-0054-2_1.

de Leeuw, J. (1985), “Reviews,” Psychometrika, 50, 371–375.

Deaton, A., and Cartwright, N. (2018), “Understanding and misunderstanding randomized controlled trials,” Social Science & Medicine, 210, 2–21. https://doi.org/10.1016/j.socscimed.2017.12.005.

Dellsén, F. (2024), Abductive reasoning in science, Cambridge University Press.

Delton, A. W., Cosmides, L., Guemo, M., Robertson, T. E., and Tooby, J. (2012), “The psychosemantics of free riding: Dissecting the architecture of a moral concept,” Journal of Personality and Social Psychology, 102, 1252–1270. https://doi.org/10.1037/a0027026.

Druckman, J. N. (2015), Experimental thinking: A primer on social science experiments, Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108991353.

Drummond, P. D., and Reid, M. D. (2021), “Objective quantum fields, retrocausality and ontology,” Entropy (Basel, Switzerland), 23, 749. https://doi.org/10.3390/e23060749.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987), “Hybrid monte carlo,” Physics letters. B, 195, 216–222. https://doi.org/10.1016/0370-2693(87)91197-X.

Duflo, E., Glennerster, R., and Kremer, M. (2007), “Using randomization in development economics research: A toolkip,” in Handbook of development economics, eds. T. P. Schultz and J. A. Strauss, Elsevier, pp. 3895–3962. https://doi.org/10.1016/S1573-4471(07)04061-2.

Dunning, T. (2008), “Improving causal inference: Strengths and limitations of natural experiments,” Political Research Quarterly, 61, 282–293. https://doi.org/10.1177/1065912907306470.

Eddelbuettel, D. (2017), Random: True random numbers using RANDOM.ORG.

Edwards, A. (1953), “The relationship between the judged desirability of a trait and the probability that the trait will be endorsed,” Journal of Applied Psychology, 37, 90–93. https://doi.org/10.1037/h0058073.

Edyko, K., Petryla, P., Ostafin, K., Minkner, M., Bieńkowski, B., Feja, K., Suwała, Z., Rektor, N., Łuczak, E., and Marchewka, U. (2023), “Utilizing artificial intelligence tools using the GPT chatbot in medicine - a review of flaws, advantages, and limitations,” Journal of education, health and sport, 46, 122–133. https://doi.org/10.12775/JEHS.2023.46.01.008.

Efron, B. (1979), “Bootstrap methods: Another look at the jackknife,” The Annals of Statistics, 7, 1–26. https://doi.org/10.1214/aos/1176344552.

Efron, B. (1987), “Better bootstrap confidence intervals,” Journal of the American Statistical Association, 82, 171–185. https://doi.org/10.1080/01621459.1987.10478410.

Efron, B. (2003), “Second thoughts on the bootstrap,” Statistical Science, 18, 135–140. https://doi.org/10.1214/ss/1063994968.

Efron, B., and Tibshirani, R. (1994), An introduction to the bootstrap, Chapman & Hall.

Elster, J. (1989/1990), Nuts and bolts for the social sciences, Cambridge University Press.

Elster, J. (1983), Explaining technical change: A case study in the philosophy of science, Cambridge University Press.

Elster, J. (2007), Explaining social behavior: More nuts and bolts for the social sciences, Cambridge University Press.

Elwert, F. (2013), “Graphical causal models,” in Handbook of causal analysis for social research, ed. S. L. Morgan, Springer, pp. 245–273.

Elwert, F., and Winship, C. (2014), “Endogenous selection bias: The problem of conditioning on a collider variable,” Annual review of sociology, 40, 31–53. https://doi.org/10.1146/annurev-soc-071913-043455.

Emam, K. E., Mosquera, L., and Hoptroff, R. (2020), Practical synthetic data generation: Balancing privacy and the broad availability of data, O’Reilly Media.

Erikson, R. S., and Stoker, L. (2011), “Caught in the draft: The effects of vietnam draft lottery status on political attitudes,” American Political Science Review, 105, 221–237. https://doi.org/10.1017/S0003055411000141.

Ernst, M. J., Sommer, B. B., Meichtry, A., and Bauer, C. M. (2019), “Intra-rater reliability of determining positions of cervical spinous processes and measuring their relative distances : An update to define rigid bodies of the cervical spine in a movement laboratory settings,” BMC Research Notes, 12, 265. https://doi.org/10.1186/s13104-019-4299-8.

Eronen, M. I., and Romeijn, J.-W. (2020), “Philosophy of science and the formalization of psychological theory,” Theory & Psychology, 30, 786–799. https://doi.org/10.1177/0959354320969876.

Euler, L. (1862), “Meditatio in experimenta explosione tormentorum nuper instituta,” Opera Postuma, 2, 800–804.

Eysenck, H. J. (1995), “Meta-analysis or best-evidence synthesis?” Journal of Evaluation in Clinical Practice, 1, 29–36. https://doi.org/10.1111/j.1365-2753.1995.tb00005.x.

Eysenck, M. W., and Keane, M. T. (2020), Cognitive psychology. A student’s handbook, Routledge.

Farrell, M. (2014), Historical and philosophical foundations of psychology, Cambridge University Press.

Fausey, C. M., and Boroditsky, L. (2010), “Subtle linguistic cues influence perceived blame and financial liability,” Psychonomic Bulletin & Review, 17, 644–650. https://doi.org/10.3758/PBR.17.5.644.

Fausey, C. M., and Boroditsky, L. (2011), “Who dunnit? Cross-linguistic differences in eye-witness memory,” Psychonomic Bulletin & Review, 18, 150–157. https://doi.org/10.3758/s13423-010-0021-5.

Feather, N. T. (1999), Values, achievement, and justice: Studies in the psychology of deservingness, Kluwer Academic/Plenum Publishers.

Ferman, B. (2023), “Inference in difference‐in‐differences: How much should we trust in independent clusters?” Journal of Applied Econometrics (Chichester, England), 38, 358–369. https://doi.org/10.1002/jae.2955.

Feyerabend, P. (1993), Against method, Verso.

Fichte, J. G. (1970), The science of knowledge, Cambridge University Press.

Fisher, R. A. (1935), The design of experiments, Edinburg.

Fisher, R. A. (1970), Statistical methods for research workers, Hafner Publishing Company.

Fitzsimons, J. F., Jones, J. A., and Vedral, V. (2015), “Quantum correlations which imply causation,” Scientific Reports, 5, 18281. https://doi.org/10.1038/srep18281.

Fleiss, J. L. (1971), “Measuring nominal scale agreement among many raters,” Psychological Bulletin, 76, 378–382. https://doi.org/10.1037/h0031619.

Fong, C. M., Bowles, S., and Gintis, H. (2006), “Strong reciprocity and the welfare state,” in Handbook on the economics of giving, reciprocity and altruism, eds. S.-C. Kolm and J. M. Ythier, Elsevier, pp. 1439–1464. https://doi.org/10.1016/S1574-0714(06)02023-9.

França, A. R. M., Rocha, E., Bastos, L. S. L., Bozza, F. A., Kurtz, P., Maccariello, E., Silva, L. E., Roberto, J., and Salluh, J. I. F. (2024), “Development and validation of a machine learning model to predict the use of renal replacement therapy in 14,374 patients with COVID-19,” Journal of Critical Care, 80, 154480. https://doi.org/10.1016/j.jcrc.2023.154480.

Freedman, D. A. (2004), “Graphical models for causation, and the identification problem,” Evaluation Review, 28, 267–293. https://doi.org/10.1177/0193841X04266432.

Futuyma, D. J., and Kirkpatrick, M. (2023), Evolution, Oxford University Press.

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A. (2019), “Visualization in bayesian workflow,” The Journal of the Royal Statistical Society, Series A (Statistics in Society), 182, 389–402. https://doi.org/10.1111/rssa.12378.

Gaenssler, P., and Stute, W. (1979), “Empirical processes: A survey of results for independent and identically distributed random variables,” The Annals of Probability, 7, 193–243. https://doi.org/10.1214/aop/1176995085.

Gaines, B. J., and Kuklinski, J. H. (2011), “Treatment effects,” in Cambridge handbook of experimental political science, eds. J. N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia, Cambridge University Pres, pp. 445–458.

Gaines, B. J., Kuklinski, J. H., and Quirk, P. J. (2006), “The logic of the survey experiment reexamined,” Political Analysis, 15, 1–20. https://doi.org/10.1093/pan/mpl008.

Gallier, J. (2011), Discrete mathematics, Springer.

Galton, F. (1884), “Measurement of character,” Fortnightly Review, 36, 179–185.

Gao, S., Zika, O., Rogers, R. D., and Thierry, G. (2015), “Second language feedback abolishes the "hot hand" effect during even-probability gambling,” The Journal of Neuroscience, 35, 5983–5989. https://doi.org/10.1523/JNEUROSCI.3622-14.2015.

Gaulejac, V. de (1997), “Clinical sociology and life histories,” International Sociology, 12, 175–190. https://doi.org/10.1177/026858097012002005.

Gauss, C. F. (2012[1809]), Theoria motus corporum coelestium in sectionibus conicis solem ambientium, Cambridge University Press.

Gelman, A., and Carlin, J. B. (2014), “Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors,” Perspectives on Psychological Science, 9, 641–651. https://doi.org/10.1177/1745691614551642.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014), Bayesian data analysis, Chapman & Hall/CRC Press.

Gelman, A., Goodrich, B., Gabry, J., and Vehtari, A. (2019), “R-squared for bayesian regression models,” The American Statistician, 73, 307–309. https://doi.org/10.1080/00031305.2018.1549100.

Gelman, A., and Hill, J. (2007), Data analysis using regression and multilevel/hierarchical models, Cambridge: Cambridge University Press.

Gelman, A., Hill, J., and Vehtari, A. (2021), Regression and other stories, Cambridge University Press. https://doi.org/10.1017/9781139161879.

Gelman, A., and Little, T. C. (1997), “Poststratification into many categories using hierarchical logistic regression,” Survey Methodology, 23, 127–135.

Gelman, A., Meng, X.-L., and Stern, H. (1996), “Posterior predictive assessment of model fitness via realized discrepancies,” Statistica Sinica, 6, 733–760.

Gelman, A., and Rubin, D. B. (1992), “Inference from iterative simulation using multiple sequences,” Statistical Science, 7, 457–472. https://doi.org/10.1214/ss/1177011136.

Gerber, A. S., and Green, D. P. (2012), Field experiments: Design, analysis, and interpretation, W. W. Norton.

Gideon, L. (2012), Handbook of survey methodology in social sciences, Springer.

Gigerenzer, G. (2004), “Mindless statistics,” The Journal of Socio-Economics, 33, 587–606. https://doi.org/10.1016/j.socec.2004.09.033.

Gigerenzer, G. (2018), “Statistical rituals: The replication delusion and how we got there,” Advances in Methods and Practices in Psychological Science, 1, 198–218. https://doi.org/10.1177/251524591877132.

Gilchrist, D., Emery, T., Garoupa, N., and Spruk, R. (2024), “Synthetic control method: A tool for comparative case studies in economic history,” Journal of Economic Surveys, 37, 409–445. https://doi.org/10.1111/joes.12493.

Gilens, M. (2000), Why americans hate welfare: Race, media, and the politics of antipoverty policy, University of Chicago Press.

Glass, G. V. (1976), “Primary, secondary, and meta-analysis of research,” Educational Researcher, 5, 3–8. https://doi.org/10.3102/0013189X005010003.

Göb, R., McCollin, C., and Ramalhoto, M. F. (2007), “Ordinal methodology in the analysis of likert scales,” Quality & Quantity, 41, 601–626. https://doi.org/10.1007/s11135-007-9089-z.

Goldthorpe, J. H. (2001), “Causation, statistics, sociology,” European Sociological Review, 17, 1–20.

Goodrich, B., Gabry, J., Ali, I., and Brilleman, S. (2024), “Rstanarm: Bayesian applied regression modeling via Stan.”

Goodson, I. (2001), “The story of life history: Origins of the life history method in sociology,” Identity (Mahwah, N.J.), 1, 129–142. https://doi.org/10.1207/S1532706XID0102_02.

Gosset, W. S. (1908), “The probable error of a mean,” Biometrika, 6, 1–25. https://doi.org/10.2307/2331554.

Gottlieb, A., and Almasi, G. S. (1989), Highly parallel computing, Redwood City, California: Benjamin/Cummings.

Graham, J., and Haidt, J. (2012), “Sacred values and evil adversaries: A moral foundations approach,” in The social psychology of morality: Exploring the causes of good and evil, eds. M. Mikulincer and P. R. Shaver, American Psychological Association, pp. 11–31. https://doi.org/10.1037/13091-001.

Green, S. B., and Yang, Y. (2009), “Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha,” Psychometrika, 74, 155–167. https://doi.org/10.1007/s11336-008-9099-3.

Greenland, S., Robins, J. M., and Pearl, J. (1999), “Confounding and collapsibility in causal inference,” Statistical Science, 14, 29–46. https://doi.org/10.1214/ss/1009211805.

Grembi, V., Nannicini, T., and Troiano, U. (2012), “Policy responses to fiscal restraints: A difference-in-discontinuities design,” CESifo Working Paper NO. 3999.

Grolemund, G., and Wickham, H. (2011), “Dates and times made easy with lubridate,” Journal of Statistical Software, 40, 1–25. https://doi.org/10.18637/jss.v040.i03.

Gross, A., Schirm, S., and Scholz, M. (2014), “Ycasd - a tool for capturing and scaling data from graphical representations,” BMC Bioinformatics, 15. https://doi.org/10.1186/1471-2105-15-219.

Gruber, J. (2014), The markdown file extension, The Daring Fireball Company, LLC.

Gumbel, E. J. (1933), “Spurious correlation and its significance to physiology,” Mathematical Proceedings of the Cambridge Philosophical Society, 21, 179–194. https://doi.org/10.1080/01621459.1926.10502169.

Guttman, L. (1977), “What is not what in statistics,” Journal of the Royal Statistical Society: Series D (The Statistician), 26, 81–107.

Hacking, I. (1999), The social construction of what?, Harvard University Press.

Haidt, J. (2013), The righteous mind: Why good people are divided by politics and religion, Penguin Books.

Haidt, J., and Joseph, C. (2004), “Intuitive ethics: How innately prepared intuitions generate culturally variable virtues,” Daedalus (Cambridge, Mass.), 133, 55–66. https://doi.org/10.1162/0011526042365555.

Hainmueller, J., Hangartner, D., and Yamamoto, T. (2015), “Validating vignette and conjoint survey experiments against real-world behavior,” Proceedings of the National Academy of Sciences, 112, 2395–2400. https://doi.org/10.1073/pnas.1416587112.

Hamel, J. (2013), “Brief review of three methodological problems of life history in sociology,” Bulletin de méthodologie sociologique, 117, 61–75. https://doi.org/10.1177/0759106312465552.

Han, G., Santner, T. J., Lin, H., and Yuan, A. (2023), “Bayesian-frequentist hybrid inference in applications with small sample sizes,” The American Statistician, 77, 143–150. https://doi.org/10.1080/00031305.2022.2127897.

Hanmer, M. J., and Kalkan, K. O. (2013), “Behind the curve: Clarifying the best approach to calculating predicted probabilities and marginal effects from limited dependent variable models,” American Journal of Political Science, 57, 263–277. https://doi.org/10.1111/j.1540-5907.2012.00602.x.

Hariri, J. G. (2012), “The autocratic legacy of early statehood,” American Political Science Review, 106, 471–494. https://doi.org/10.1017/S0003055412000238.

Hartman, E. (2021), “Generalizing experimental results,” in Advances in experimental political science, eds. J. N. Druckman and D. P. Green, Cambridge: Cambridge University Press, pp. 385–410. https://doi.org/10.1017/9781108777919.027.

Hartman, E., and Hidalgo, F. D. (2018), “An equivalence approach to balance and placebo tests,” American Journal of Political Science, 62, 1000–1013. https://doi.org/10.1111/ajps.12387.

Hastie, T. J., and Tibshirani, R. J. (1990), Generalized additive models, Chapman & Hall/CRC.

Hastie, T. J., Tibshirani, R. J., and Friedman, J. (2017), The elements of statistical learning: Data mining, inference, and prediction, Springer.

Hausman, J. A., and Wise, D. A. (1979), “Correlation and causation in the study of personality,” Social and Personality Psychology Compass, 47, 455–473. https://doi.org/10.2307/1914193.

Heck, D. W. (2019), “A caveat on the savage–dickey density ratio: The case of computing bayes factors for regression parameters,” British Journal of Mathematical and Statistical Psychology, 72, 316–333. https://doi.org/10.1111/bmsp.12150.

Heckman, J. J. (1979), “Sample selection bias as a specification error,” Econometrica, 47, 153–161. https://doi.org/10.2307/1912352.

Hegel, G. W. F. (2014), The science of logic, Cambridge University Press.

Helmert, F. R. (1876a), “Über die wahrscheinlichkeit der potenzsummen der beobachtungsfehler und uber einige damit in zusammenhang stehende fragen,” Zeitschrift für Angewandte Mathematik und Physik, 21, 192–218.

Helmert, F. R. (1876b), “Die genauigkeit der formel von peters zur berechnung des wahrscheinlichen beobachtungsfehlers directer beobachtungen gleicher genauigkeit,” Astronomische Nachrichten, 88, 113–132.

Helmert, F. R. (1875), “Über die berechnung des wahrscheinlichen fehlers aus einer endlichen anzahl wahrer beobachtungsfehler,” Zeitschrift für Angewandte Mathematik und Physik, 20, 300–303.

Hempel, C. G. (1959), “The logic of functional analysis,” in Symposium on sociological theory, ed. L. Gross, Harper & Row, pp. 271–307.

Hewitt, L., Ashokkumar, A., Ghezae, I., and Willer, R. (2024), “Predicting results of social science experiments using large language models.”

Hibbing, J. R., Smith, K. B., and Alford, J. R. (2019), Predisposed: Liberals, conservatives, and the biology of political differences, Routledge.

Hilton, J. L., and von Hippel, W. (1996), “Stereotypes,” Annual Review of Psychology, 47, 237–271. https://doi.org/10.1146/annurev.psych.47.1.237.

Hinsz, V. B., Stoesser, C. J., and Matz, D. C. (2013), “The intermingling of social and evolutionary psychology influences on hair color preferences,” Current Psychology, 32, 136–149. https://doi.org/10.1007/s12144-013-9169-2.

Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007), “Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference,” Political Analysis, 15, 199–236. https://doi.org/10.1093/pan/mpl013.

Ho, T. K. (1995), “Random decision forests,” Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC: Institute of Electrical; Electronics Engineers, 278–282. https://doi.org/10.1109/ICDAR.1995.598994.

Hoffman, M. D., and Gelman, A. (2014), “The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo,” Journal of Machine Learning Research, 15, 1593–1623. https://doi.org/10.48550/arxiv.1111.4246.

Holderried, F., Stegemann-Philips, C., Herschbach, L., Moldt, J.-A., Nevins, A., Griewatz, J., Holderried, M., Herrmann-Werner, A., Festl-Wietek, T., and Mahling, M. (2024), “A generative pretrained transformer (GPT)—powered chatbot as a simulated patient to practice history taking: Prospective, mixed methods study,” JMIR Medical Education, 10, e53961. https://doi.org/10.2196/53961.

Holland, P. W. (1986), “Statistics and causal inference,” Journal of the American Statistical Association, 81, 945–960. https://doi.org/10.1080/01621459.1986.10478354.

Hollibaugh, G. E. (2019), “The use of text as data methods in public administration: A review and an application to agency priorities,” Journal of Public Administration Research and Theory, 29, 474–490. https://doi.org/10.1093/jopart/muy045.

Holtgraves, T. M. (2013), Language as social action: Social psychology and language use, Taylor And Francis.

Homburg, C., and Dobartz, A. (1992), “Covariance structure analysis via specification searches,” Statistical papers (Berlin, Germany), 33, 119–142. https://doi.org/10.1007/BF02925318.

Hu, L., and Bentler, P. M. (1999), “Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives,” Structural Equation Modeling, 6, 1–55. https://doi.org/10.1080/10705519909540118.

Hughes, J. (2022), “Sklar’s omega: A gaussian copula-based framework for assessing agreement,” Statistics and computing, 32, 46. https://doi.org/10.1007/s11222-022-10105-2.

Hughes, J. (2023), Sklarsomega: Measuring agreement using sklar’s omega coefficient.

Hume, D. (2010[1748]), “Of miracles,” in An enquiry concerning human understanding and concerning the principles of morals, eds. D. Hume, L. A. Selby-Bigge, and P. H. Nidditch, Oxford: Clarendon Press.

Hume, D. (1975[1777]), Enquiries concerning human understanding and concerning the principles of morals, Clarendon Press.

Hume, D. (2011), “Of knowledge and probability,” in The essential philosophical works, ed. T. Griffith, Wordsworth.

Hurault, G. (2021), HuraultMisc: Guillem hurault functions’ library.

Imai, K., King, G., and Stuart, E. A. (2008), “Misunderstandings between experimentalists and observationalists about causal inference,” Journal of the Royal Statistical Society, Series A, 171, 481–502. https://doi.org/10.1111/j.1467-985X.2007.00527.x.

Imai, K., and Tingley, D. (2012), “A statistical method for empirical testing of competing theories,” American Journal of Political Science, 56, 218–236. https://doi.org/10.1111/j.1540-5907.2011.00555.x.

Imbens, G. W. (2015), “Matching methods in practice: Three examples,” The Journal of Human Resources, 50, 373–419. https://doi.org/10.3368/jhr.50.2.373.

Imbens, G. W., and Rubin, D. B. (2015), Causal inference for statistics, social, and biomedical sciences: An introduction, Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139025751.

Ioannidis, J. P. A. (2016), “The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses,” The Milbank Quarterly, 94, 485–514. https://doi.org/10.1111/1468-0009.12210.

Isager, P. M., Aert, R. C. M. van, Bahník, Š., Brandt, M. J., DeSoto, K. A., Giner-Sorolla, R., Krueger, J. I., Perugini, M., Ropovik, I., Veer, A. E. van ’t, and Daniël Lakens, M. V. & (2023), “Deciding what to replicate: A decision model for replication study selection under resource and knowledge constraints,” Psychological Methods, 28, 438–451. https://doi.org/10.1037/met0000438.

Jackson, D. L. (2003), “Revisiting sample size and number of parameter estimates: Some support for the n:q hypothesis,” Structural Equation Modeling, 10, 128–141. https://doi.org/10.1207/S15328007SEM1001_6.

Jacquette, D. (1976), Ontology, Routledge.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021), An introduction to statistical learning with applications in r, New York, NY: Springer.

Jansen, B. J., Sung, S., and Salminen, J. (2023), “Employing large language models in survey research,” Natural Language Processing Journal, 4, 100020. https://doi.org/10.1016/j.nlp.2023.100020.

Johnson, D. H. (1999), “The insignificance of statistical significance testing,” Journal of Wildlife Management, 63, 763–772. https://doi.org/10.2307/3802789.

Johnson, D. K. (2019), “Confirmation bias,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 317–320.

Jones, B. F., and Olken, B. A. (2005), “Does compulsory school attendance affect schooling and earnings?” The Quarterly Journal of Economics, 120, 835–864. https://doi.org/10.1093/qje/120.3.835.

Jung, L. (2024), Scrutiny: Error detection in science.

Kachitvichyanukul, V., and Schmeiser, B. W. (1988), “Binomial random variate generation,” Communications of the ACM, 31, 216–222. https://doi.org/10.1145/42372.42381.

Kahneman, D. (2024), Thinking, fast and slow, Pengiun.

Kahn-Lang, A., and Lang, K. (2020), “The promise and pitfalls of differences-in-differences: Reflections on 16 and pregnant and other applications,” Journal of Business & Economic Statistics, 38, 613–620. https://doi.org/10.1080/07350015.2018.1546591.

Kalton, G., and Kasprzyk, D. (1982), “Imputing for missing survey responses,” Proceedings of the Section on Survey Research Methods, 22, 22–31.

Kam, C. D., and Palmer, C. L. (2008), “Reconsidering the effects of education on political participation,” The Journal of Politics, 70, 612–631. https://doi.org/10.1017/S0022381608080651.

Kane, J. V., Velez, Y. R., and Barabas, J. (2023), “Analyze the attentive and bypass bias: Mock vignette checks in survey experiments,” Political Science Research and Methods, 11, 293–310. https://doi.org/10.1017/psrm.2023.3.

Kass, R. E., and Raftery, A. E. (1995), “Bayes factors,” Journal of the American Statistical Association, 90, 773–795. https://doi.org/10.1080/01621459.1995.10476572.

Kassambara, A. (2023), Ggpubr: ’ggplot2’ based publication ready plots.

Kaufmann, W. (1959), Hegel: A reinterpretation, Doubleday.

Kay, M. (2024a), “ggdist: Visualizations of distributions and uncertainty in the grammar of graphics,” IEEE Transactions on Visualization and Computer Graphics, 30, 414–424. https://doi.org/10.1109/TVCG.2023.3327195.

Kay, M. (2024b), ggdist: Visualizations of distributions and uncertainty. https://doi.org/10.5281/zenodo.3879620.

Kay, M. (2023), tidybayes: Tidy data and geoms for Bayesian models. https://doi.org/10.5281/zenodo.1308151.

Kaźmierska, K., and Schütze, F. (2013), “Wykorzystanie autobiograficznego wywiadu narracyjnego w badaniach nad konstruowaniem obrazu przeszłości w biografii. Na przykładzie socjologicznego porównania narracji na temat życia w PRL-u i NRD,” Przegląd Socjologii Jakościowej, 9, 122–139. https://doi.org/10.18778/1733-8069.9.4.08.

Keele, L. (2015), “The statistics of causal inference: A view from political methodology,” Political Analysis, 23, 313–335. https://doi.org/10.1093/pan/mpv007.

Kendall, M. G. (1938), “A new measure of rank correlation,” Biometrika, 30, 81–89. https://doi.org/10.1093/biomet/30.1-2.81.

Kennaway, R. (2015), “When causation does not imply correlation: Robust violations of the faithfulness axiom,” ArXiv, 1505.03118v1. https://doi.org/10.48550/arXiv.1505.03118.

Kennedy, R. K. L., Villanustre, F., Khoshgoftaar, T. M., and Salekshahrezaee, Z. (2024), “Synthesizing class labels for highly imbalanced credit card fraud detection data,” Journal of Big Data, 11, 38. https://doi.org/10.1186/s40537-024-00897-7.

Khan, I. U., and Khan, M. U. S. (2024), “Social media profiling for political affiliation detection,” Human-Centric Intelligent Systems, 4, 437–446. https://doi.org/10.1007/s44230-024-00078-y.

King, G., Honaker, J., Joseph, A., and Scheve, K. (2001), “Analyzing incomplete political science data: An alternative algorithm for multiple imputation,” American Political Science Review, 95, 49–69. https://doi.org/10.1017/S0003055401000235.

King, G., and Nielsen, R. (2019), “Why propensity scores should not be used for matching,” Political Analysis, 27, 435–454. https://doi.org/10.1017/pan.2019.11.

King, G., and Roberts, M. E. (2014), “How robust standard errors expose methodological problems they do not fix, and what to do about it,” Political Analysis, 23, 159–179. https://doi.org/10.1093/pan/mpu015.

King, G., Tomz, M., and Wittenberg, J. (2000), “Making the most of statistical analyses: Improving interpretation and presentation,” American Journal of Political Science, 44, 347–361. https://doi.org/10.2307/2669316.

Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011), “Towards unrestricted public use business microdata: The synthetic longitudinal business database,” International Statistical Review, 79, 362–384. https://doi.org/10.1111/j.1751-5823.2011.00153.x.

Kjell, O. N. E., Sikström, S., Kjell, K., and Schwartz, H. A. (2022), “Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy,” Scientific Reports, 12, 3918. https://doi.org/10.1038/s41598-022-07520-w.

Koga, S., and Du, W. (2024), “Integrating AI in medicine: Lessons from chat-GPT’s limitations in medical imaging,” Digestive and Liver Disease, 56, 1114–1115. https://doi.org/10.1016/j.dld.2024.02.014.

Kokkonen, A., and Sundell, A. (2020), “Leader succession and civil war,” Comparative Political Studies, 53, 434–468. https://doi.org/10.1177/0010414019852712.

Kozlowski, A. C., Taddy, M., and Evans, J. A. (2019), “The geometry of culture: Analyzing the meanings of class through word embeddings,” American Sociological Review, 84, 905–949. https://doi.org/10.1177/0003122419877135.

Krippendorff, K. (2018), Content analysis: An introduction to its methodology, Los Angeles: Sage.

Krotov, V., and Tennyson, M. (2018), “Research note: Scraping financial data from the web using the r language,” Journal of Emerging Technologies in Accounting, 15, 169–181. https://doi.org/10.2308/jeta-52063.

Krumpal, I. (2013), “Determinants of social desirability bias in sensitive surveys: A literature review,” Quality & Quantity, 47, 2025–2047. https://doi.org/10.1007/s11135-011-9640-9.

Kruschke, J. K. (2010), “Bayesian data analysis,” Cognitive Science, 1, 658–676. https://doi.org/10.1002/wcs.72.

Kruschke, J. K. (2014), Doing bayesian data analysis: A tutorial with r, JAGS, and stan, Academic Press.

Kruschke, J. K., and Liddell, T. M. (2018), “The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective,” Psychonomic Bulletin and Review, 25, 178–206. https://doi.org/10.3758/s13423-016-1221-4.

Kuhn, M. (2008), “Building predictive models in r using the caret package,” Journal of Statistical Software, 28, 1–26. https://doi.org/10.18637/jss.v028.i05.

Kuhn, T. S. (1970), The structure of scientific revolutions, University of Chicago Press.

Kuklinski, J. H., and Quirk, P. J. (2000), “Reconsidering the rational public: Cognition, heuristics and mass opinion,” in Elements of reason: Cognition, choice and the bounds of rationality, eds. A. Lupia, M. D. McCubbins, and S. L. Popkin, Cambridge University Press, pp. 153–182. https://doi.org/10.1017/CBO9780511805813.008.

Kurzban, R., and Aktipis, C. A. (2007), “Modularity and the social mind: Are psychologists too self-ish?” Personality and Social Psychology Review, 11, 131–149. https://doi.org/10.1177/1088868306294906.

Lakatos, I. (1970), “Falsification and the methodology of scientific research,” in Criticism and the growth of knowledge, eds. I. Lakatos and A. Musgrave, Cambridge University Press, pp. 91–196.

Lamport, L. (1986), LATEX: A document preparation system, Addison-Wesley Pub. Co.

Lantz, B. (2013), “Equidistance of likert-type scales and validation of inferential methods using experiments and simulations,” The Electronic Journal of Business Research Methods, 11, 16–28.

Laplace, P. S. (2009[1814]), Essai philosophique sur les probabilités, Cambridge: Cambridge University Press.

Larmarange, J. (2024), Labelled: Manipulating labelled data.

Larsen, C. A. (2006), The institutional logic of welfare attitudes: How welfare regimes influence public support, Ashgate.

Larsen, R., Buss, D., Wismeijer, A., Song, J., van den Berg, S., and Jeronimus, B. (2021), Personality psychology: Domains of knowledge about human nature, McGraw Hill.

Lau, R. R., and Redlawsk, D. P. (2001), “Advantages and disadvantages of cognitive heuristics in political decision making,” American Journal of Political Science, 45, 951–971. https://doi.org/10.2307/2669334.

Lecler, A., Duron, L., and Soyer, P. (2023), “Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,” Diagnostic and Interventional Imaging, 104, 269–274. https://doi.org/10.1016/j.diii.2023.02.003.

Lee, J. J. (2012), “Correlation and causation in the study of personality,” European Journal of Personality, 26, 372–390. https://doi.org/10.1002/per.1863.

Legendre, A. M. (1805), Nouvelles méthodes pour la détermination des orbites des comètes, F. Didot.

Leifer, M. S., and Pusey, M. F. (2017), “Is a time symmetric interpretation of quantum theory possible without retrocausality?” Proceedings of the Royal Society. A, Mathematical, physical, and engineering sciences, 473, 20160607. https://doi.org/10.1098/rspa.2016.0607.

Lenz, G. S., and Sahn, A. (2021), “Achieving statistical significance with covariates and without transparency,” Political Analysis, 29, 356–369. https://doi.org/10.1017/pan.2020.31.

Levy, R., and Mislevy, R. J. (2020), Bayesian psychometric modeling, Chapman & Hall/CRC.

Li, N. P., van Vugt, M., and Colarelli, S. M. (2018), “The evolutionary mismatch hypothesis: Implications for psychological science,” Current Directions in Psychological Science, 27, 38–44. https://doi.org/10.1177/0963721417731378.

Liaw, A., and Wiener, M. (2002), “Classification and regression by randomForest,” R News, 2, 18–22.

Likert, R. (1932), “A technique for the measurement of attitudes,” Archives of Psychology, 22, 55.

Linos, K., and Twist, K. (2018), “Diverse pre-treatment effects in survey experiments,” Journal of Experimental Political Science, 5, 148–158. https://doi.org/10.1017/XPS.2017.29.

Lucy, J. A. (1992), Language diversity and thought: A reformulation of the linguistic relativity hypothesis, Cambridge University Press.

Lüdecke, D. (2022), Sjlabelled: Labelled data utility functions (version 1.2.0). https://doi.org/10.5281/zenodo.1249215.

Lundberg, I., Johnson, R., and Stewart, B. M. (2021), “What is your estimand? Defining the target quantity connects statistical evidence to theory,” American Sociological Review, 86, 532–565. https://doi.org/10.1177/00031224211004187.

Lüroth, J. (1876), “Vergleichung von zwei werten des wahrscheinlichen fehlers,” Astronomische Nachrichten, 87, 209–220. https://doi.org/10.1002/asna.18760871402.

Lyons, J. (1991), Natural language and universal grammar, Cambridge University Press.

Mac Giolla, E., and Kajonius, P. J. (2019), “Sex differences in personality are larger in gender equal countries: Replicating and extending a surprising finding,” International Journal of Psychology, 54, 705–711. https://doi.org/10.1002/ijop.12529.

MacCallum, R. C. (1990), “The need for alternative measures of fit in covariance structure modeling,” Multivariate Behavioral Research, 25, 157–162. https://doi.org/10.1207/s15327906mbr2502_2.

Mackie, J. L. (1980), The cement of the universe: A study of causation, Clarendon Press.

Macrae, C. N., Milne, A. B., and Bodenhausen, G. V. (1994), “Stereotypes as energy-saving devices: A peek inside the cognitive toolbox,” Journal of Personality and Social Psychology, 66, 37–47. https://doi.org/10.1037/0022-3514.66.1.37.

Macy, M. W., and Willer, R. (2002), “From factors to actors: Computational sociology and agent-based modeling,” Annual Review of Sociology, 28, 143–166. https://doi.org/10.1146/annurev.soc.28.110601.141117.

Maier, M., Bartos, F., Stanley, T. D., Shanks, D. R., Harris, A. J. L., and Wagenmakers, E.-J. (2022), “No evidence for nudging after adjusting for publication bias,” Proceedings of the National Academy of Sciences, 119, e2200300119. https://doi.org/10.1073/pnas.2200300119.

Makowski, D., Ben-Shachar, M. S., Chen, S. H. A., and Lüdecke, D. (2019b), “Indices of effect existence and significance in the bayesian framework,” Frontiers in Psychology, 2767. https://doi.org/10.3389/fpsyg.2019.02767.

Makowski, D., Ben-Shachar, M. S., and Lüdecke, D. (2019a), “bayestestR: Describing effects and their uncertainty, existence and significance within the bayesian framework,” Journal of Open Source Software, 4, 1541. https://doi.org/10.21105/joss.01541.

Manninen, B. A. (2019c), “False cause: Cum hoc ergo propter hoc,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 335–337.

Manninen, T. W. (2019a), “False cause: Post hoc ergo propter hoc,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 342–345.

Manninen, T. W. (2019b), “Appeal to personal incredulity,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 115–117.

Mansournia, M. A., Etminan, M., Danaei, G., Kaufman, J. S., and Collins, G. (2017), “Handling time varying confounding in observational research,” BMJ, 359, j4587. https://doi.org/10.1136/bmj.j4587.

March, J. G. (1994), A primer on decision making: How decisions happen, The Free Press.

Masi, M. (2023), “Quantum indeterminism, free will, and self-causation,” Journal of Consciousness Studies, 30, 32–56. https://doi.org/10.53765/20512201.30.5.032.

Masling, J. (1966), “Role-related behavior of the subject and psychologist and its effect upon psychological data,” in The nebraska symposium on motivation, ed. D. Levine, Lincoln, NE: University of Nebraska Press, pp. 67–103.

Mazaheri, B., Squires, C., and Uhler, C. (2024), “Synthetic potential outcomes and causal mixture identifiability,” arXiv.

Mazur, J. E. (2006), “Mathematical models and the experimental analysis of behavior,” Journal of the Experimental Analysis of Behavior, 85, 275–291. https://doi.org/10.1901/jeab.2006.65-05.

McCraw, B. W. (2019), “Appeal to ignorance,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 106–111.

McDermott, R. (1895), “Experimental methods in political science,” Annual Review of Political Science, 5, 31–61. https://doi.org/10.1146/annurev.polisci.5.091001.170657.

McDermott, R. (2011), “Internal and external validity,” in Cambridge handbook of experimental political science, eds. J. N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia, Cambridge University Pres, pp. 27–40.

McElreath, R. (2019), Statistical rethinking: A bayesian course with examples in r, Chapman; Hall/CRC. https://doi.org/10.1201/9780429029608.

McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019), “Abandon statistical significance,” The American Statistician, 73, 235–245. https://doi.org/10.1080/00031305.2018.1527253.

Meehl, P. E. (1978), “Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology,” Journal of Consulting and Clinical Psychology, 46, 806–834. https://doi.org/10.1037/0022-006X.46.4.806.

Meldrum, M. L. (2000), “A brief history of the randomized controlled trial. From oranges and lemons to the gold standard,” Hematology/Oncology Clinics of North America, 14, 745–760. https://doi.org/10.1016/S0889-8588(05)70309-9.

Memon, M. A., Thurasamy, R., Ting, H., Cheah, J.-H., and Chuah, F. (2024), “Control variables: A review and proposed guidelines,” Journal of Applied Structural Equation Modelling, 8, 1–14. https://doi.org/10.47263/JASEM.8(2)01.

Meschiari, S. (2022), latex2exp: Use LaTeX expressions in plots.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953), “Equation of state calculations by fast computing machines,” Journal of Chemical Physics, 21, 1087–1092. https://doi.org/10.1063/1.1699114.

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2023), e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU wien.

Meyer, J. P., and Mulherin, A. (1980), “From attribution to helping: An analysis of the mediating effects of affect and expectancy,” Journal of Personality and Social Psychology, 39, 201–209. https://doi.org/10.1037/0022-3514.39.2.201.

Microsoft, and Weston, S. (2022a), doParallel: Foreach parallel adaptor for the ’parallel’ package.

Microsoft, and Weston, S. (2022b), Foreach: Provides foreach looping construct.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013), “Distributed representations of words and phrases and their compositionality,” arXiv, arXiv:1310.4546v1.

Miller, D. (1974), “Popper’s qualitative theory of verisimilitude,” The British Journal for the Philosophy of Science, 25, 166–177. https://doi.org/10.1093/bjps/25.2.166.

Mitchell, K. J. (2018), Innate: How the wiring of our brains shapes who we are, Princeton University Press. https://doi.org/10.1515/9780691184999.

Mondak, J. M. (2010), Personality and the foundations of political behavior, Cambridge University Press. https://doi.org/10.1017/CBO9780511761515.

Monroe, C., Meekhof, D. M., King, B. E., and Wineland, D. J. (1996), “A "schrödinger cat" superposition state of an atom,” Science, 272, 1131–1136. https://doi.org/10.1126/science.272.5265.1131.

Montgomery, J. M., Nyhan, B., and Torres, M. (2018), “How conditioning on posttreatment variables can ruin your experiment and what to do about it,” American Journal of Political Science, 62, 760–775. https://doi.org/10.1111/ajps.12357.

Morgan, S. L., and Winship, C. (2007), Counterfactuals and causal inference: Methods and principles for social research, Cambridge University Press.

Mummolo, J., and Peterson, E. (2019), “Demand effects in survey experiments: An empirical assessment,” American Political Science Review, 113, 517–529. https://doi.org/10.1017/S0003055418000837.

Mutz, D. C., and Pemantle, R. (2012), “The perils of randomization checks in the analysis of experiments,” University of Pennsylvania.

Nalborczyk, L., Batailler, C., Hélène Lœvenbruck, and, Vilain, A., and Bürkner, P.-C. (2019), “An introduction to bayesian multilevel models using brms: A case study of gender effects 35 on vowel variability in standard indonesian,” Journal of Speech, Language, and Hearing Research, 62, 1225–1242. https://doi.org/10.1044/2018_JSLHR-S-18-0006.

Neal, R. M. (1993), “Probabilistic inference using markov chain monte carlo methods,” Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.

Neal, R. M. (2011), “MCMC using hamiltonian dynamics,” in Handbook of markov chain monte carlo, eds. S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Chapman; Hall/CRC.

Neyman, J. (1923), “Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes,” Statistical Science, 5, 463–472.

Neyman, J., and Pearson, E. S. (1933), “The testing of statistical hypotheses in relation to probabilities a priori,” Mathematical Proceedings of the Cambridge Philosophical Society, 29, 492–510. https://doi.org/10.1017/S030500410001152X.

Nichols, A. L., and Maner, J. K. (2008), “The good-subject effect: Investigating participant demand characteristics,” The Journal of General Psychology, 135, 151–166. https://doi.org/10.3200/GENP.135.2.151-166.

Nikolenko, S. I. (2021), Synthetic data for deep learning, Springer. https://doi.org/10.1007/978-3-030-75178-4.

Nola, R., and Sankey, H. (2007), Theories of scientific method: An introduction, Stockfield: Acumen. https://doi.org/10.4324/9781315711959.

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., Ishiyama, J., Karlan, D., Kraut, A., Lupia, A., Mabry, P., Madon, T., Malhotra, N., Mayo-Wilson, E., McNutt, M., Miguel, E., Paluck, E. L., Simonsohn, U., Soderberg, C., Spellman, B. A., Turitto, J., VandenBos, G., Vazire, S., Wagenmakers, E. J., Wilson, R., and Yarkoni, T. (2015), “Promoting an open research culture: Author guidelines for journals could help to promote transparency, openness, and reproducibility,” Science, 348, 1422–1425. https://doi.org/10.1126/science.aab2374.

Nosek, B. A., and Bar-Anan, Y. (2012), “Scientific utopia: I. Opening scientific communication,” Psychological Inquiry, 23, 217–243. https://doi.org/10.1080/1047840X.2012.692215.

Nosek, B. A., Spies, J. R., and Motyl, M. (2012), “Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability,” Perspectives on Psychological Science, 7, 615–631. https://doi.org/10.1177/1745691612459058.

O’Connor, C., and Joffe, H. (2020), “Intercoder reliability in qualitative research: Debates and practical guidelines,” International Journal of Qualitative Methods, 19. https://doi.org/10.1177/1609406919899220.

O’Hagan, A. (1979), “On outlier rejection phenomena in bayes inference,” Journal of the Royal Statistical Society. Series B (Methodological), 41, 358–367. https://doi.org/10.1111/j.2517-6161.1979.tb01090.x.

Obaid, O. I., Ali, A. H., and Yaseen, M. G. (2023), “Impact of chat GPT on scientific research: Opportunities, risks, limitations, and ethical issues,” Iraqi Journal for Computer Science and Mathematics, 4. https://doi.org/10.52866/ijcsm.2023.04.04.002.

Ooms, J. (2024a), Magick: Advanced graphics and image-processing in r.

Ooms, J. (2024b), Rsvg: Render SVG images into PDF, PNG, (encapsulated) PostScript, or bitmap arrays.

Open Science Collaboration (2015), “Estimating the reproducibility of psychological science,” Science, 349, 943. https://doi.org/10.1126/science.aac4716.

OpenAI (2024a), “GPT-4o mini,” OpenAI.

OpenAI (2024b), “Hello GPT-4o,” OpenAI.

OpenAI (2024c), “Introducing OpenAI o1,” OpenAI.

OpenAI (2025a), API reference manual, OpenAI Platform.

OpenAI (2024d), Enterprise privacy, OpenAI.

OpenAI (2025b), Pricing, OpenAI.

OpenAI (2022), “Introducing ChatGPT,” OpenAI.

Ornstein, J. T., Blasingame, E. N., and Truscott, J. S. (2024), “How to train your stochastic parrot: Large language models for political texts,” GitHub.

Osmundsen, M., Hendry, D. J., Laustsen, L., Smith, K. B., and Petersen, M. B. (2022), “The psychophysiology of political ideology: Replications, reanalyses, and recommendations,” The Journal of Politics, 84, 50–66. https://doi.org/10.1086/714780.

Oxley, D. R., Smith, K. B., Alford, J. R., Hibbing, M. V., Miller, J. L., Scalora, M., Hatemi, P. K., and Hibbing, J. R. (2008), “Political attitudes vary with physiological traits,” Science, 321, 1667–1670. https://doi.org/10.1126/science.1157627.

Ozdoganlar, O. B., and Endres, W. J. (2021[1999]), “Parallel-process (simultaneous) machining and its stability,” ASME 1999 International Mechanical Engineering Congress and Exposition, 14–19, 361–368. https://doi.org/10.1115/IMECE1999-0693.

Pang, C., Jiang, X., Pavinkurve, N. P., Kalluri, Krishna. S., Minto, E. L., Patterson, J., Zhang, L., Hripcsak, G., Gürsoy, G., Elhadad, N., and Natarajan, K. (2024), “CEHR-GPT: Generating electronic health records with chronological patient timelines,” arXiv, 2402.04400v2. https://doi.org/10.48550/arxiv.2402.04400.

Pang, X., Liu, L., and Xu, Y. (2022), “A bayesian alternative to synthetic control for comparative case studies,” Political Analysis, 30, 269–288. https://doi.org/10.1017/pan.2021.22.

Park, C., Kim, H., and Wang, M. (2022), “Investigation of finite-sample properties of robust location and scale estimators,” Communications in Statistics - Simulation and Computation, 51, 2619–2645. https://doi.org/10.1080/03610918.2019.1699114.

Park, C., and Wang, M. (2020), “A study on the x-bar and s control charts with unequal sample sizes,” Mathematics, 8, 698. https://doi.org/10.3390/math8050698.

Park, C., and Wang, M. (2022), rQCC: Robust quality control chart.

Park, K.-S., and Choi, H. (2024), “How to harness the power of GPT for scientific research: A comprehensive review of methodologies, applications, and ethical considerations,” Nuclear Medicine and Molecular Imaging, 58, 323–331. https://doi.org/10.1007/s13139-024-00876-z.

Pasek, J., Tahk, A., Culter, G., and Schwemmle, M. (2021), Weights: Weighting and weighted statistics.

Patterson, H. D., and Thompson, R. (1971), “Recovery of inter-block information when block sizes are unequal,” Biometrika, 58, 545–554. https://doi.org/10.1093/biomet/58.3.545.

Pawitan, Y. (2001), In all likelihood: Statistical modelling and inference using likelihood, Clarendon Press.

Pearl, J. (1988), Probabilistic reasoning in intelligent systems: Networks of plausible inference, San Mateo, Calif.: Morgan Kaufman Publishers, Inc.

Pearl, J. (1993), “Graphical models, causality and intervention,” Statistical Science, 8, 266–269. https://doi.org/10.1214/ss/1177010894.

Pearl, J. (1995), “Causal diagrams for empirical research,” Biometrika, 82, 669–710. https://doi.org/10.1093/biomet/82.4.669.

Pearl, J. (2000), Causality: Models, reasoning, and inference, Cambridge University Press.

Pearson, K. (1895a), “Notes on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, 58, 240–242.

Pearson, K. (2015[1900]), The grammar of science, Cambridge: Cambridge University Press.

Pearson, K. (1895b), “Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 186, 343–414. https://doi.org/10.1098/rsta.1895.0010.

Pedersen, M. J., and Stritch, J. M. (2018), “RNICE model: Evaluating the contribution of replication studies in public administration and management research,” Public Administration Review, 78, 606–612. https://doi.org/10.1111/puar.12910.

Peirce, C. S., and Jastrow, J. (1885), “On small differences in sensation,” Memoirs of the National Academy of Sciences, 3, 73–83.

Peng, R. D. (2011), “Reproducible research in computational science,” Science, 334, 1226–1227. https://doi.org/10.1126/science.1213847.

Pennebaker, J. W., Mehl, M. R., and Niederhoffer, K. G. (2003), “Psychological aspects of natural language use: Our words, our selves,” Annual Review of Psychology, 54, 547–577. https://doi.org/10.1146/annurev.psych.54.101601.145041.

Petersen, M. B. (2012), “Social welfare as small-scale help: Evolutionary psychology and the deservingness heuristic,” American Journal of Political Science, 56, 1–16. https://doi.org/10.1111/j.1540-5907.2011.00545.x.

Petersen, M. B., Slothuus, R., Stubager, R., and Togeby, L. (2011), “Deservingness versus values in public opinion on welfare: The automaticity of the deservingness heuristic,” European Journal of Political Research, 50, 25–52. https://doi.org/10.1111/j.1475-6765.2010.01923.x.

Petersen, M. B., Sznycer, D., Cosmides, L., and Tooby, J. (2012), “Who deserves help? Evolutionary psychology, social emotions, and public opinion about welfare,” Political Psychology, 33, 395–418. https://doi.org/10.1111/j.1467-9221.2012.00883.x.

Pietraszewski, D., Curry, O. S., Petersen, M. B., Cosmides, L., and Tooby, J. (2015), “Constituents of political cognition: Race, party politics, and the alliance detection system,” Cognition, 140, 24–39. https://doi.org/10.1016/j.cognition.2015.03.007.

Pillai, A. S., and Tedesco, R. (2023), Machine learning and deep learning in natural language processing, CRC Press.

Popper, K. (2002[1959]), The logic of scientific discovery, New York: Routledge.

Posit Team (2024), RStudio: Integrated development environment for r, Boston, MA: Posit Software, PBC.

Powell, D. (2022), “Synthetic control estimation beyond comparative case studies: Does the minimum wage reduce employment?” Journal of Business & Economic Statistics, 40, 1302–1314. https://doi.org/10.1080/07350015.2021.1927743.

R Core Team (2024), R: A language and environment for statistical computing, Vienna, Austria: R Foundation for Statistical Computing.

Rabe-Hesketh, S., and Skrondal, A. (2012), Multilevel and longitudinal modeling using stata, Stata Press.

Raftery, A. E. (1995), “Bayesian model selection in social research,” Sociological Methodology, 25, 111–164. https://doi.org/10.2307/271063.

Ramezani, M., Feizi-Derakhshi, M.-R., and Balafar, M.-A. (2022), “Text-based automatic personality prediction using KGrAt-net: A knowledge graph attention network classifier,” Scientific Reports, 12, 21453. https://doi.org/10.1038/s41598-022-25955-z.

Rheault, L., and Cochrane, C. (2020), “Word embeddings for the analysis of ideological placement in parliamentary corpora,” Political Analysis, 28, 112–133. https://doi.org/10.1017/pan.2019.26.

Riemann, G. (2014), “A situation, a narrative, and a life history. The case of natalia,” Qualitative Sociology Review, 10, 20–36. https://doi.org/10.18778/1733-8077.10.1.02.

Ritchie, S. (2021), Science fictions: Exposing fraud, bias, negligence and hype in science, Vintage.

Rivera, H. (2019), “Begging the question,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 308–310.

Robert, C., and Casella, G. (2013), Monte carlo statistical methods, Springer Nature.

Rodriguez, P. L., Spirling, A., and Stewart, B. M. (2023a), “Embedding regression: Models for context-specific description and inference,” American Political Science Review, 117, 1255–1274. https://doi.org/10.1017/S0003055422001228.

Rodriguez, P. L., Spirling, A., and Stewart, B. M. (2023b), “Multilanguage word embeddings for social scientists,” ALC Embeddings.

Rohrer, J. M. (2018), “Thinking clearly about correlations and causation: Graphical causal models for observational data,” Advances in Methods and Practices in Psychological Science, 1, 27–42. https://doi.org/10.1177/25152459177456.

Rohrer, J. M. (2024), “Causal inference for psychologists who think that causal inference is not for them,” Social and Personality Psychology Compass, 18. https://doi.org/10.1111/spc3.12948.

Rosenbaum, P. R. (2007), “Interference between units in randomized experiments,” Journal of the American Statistical Association, 102, 191–200. https://doi.org/10.1198/016214506000001112.

Rosenbaum, P. R., and Rubin, D. B. (1983), “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41–55. https://doi.org/10.1093/biomet/70.1.41.

Rosenberg, A. (2016), Philosophy of social science, Routledge.

Rosenthal, G. (2018), Interpretive social research: An introduction, Universitätsverlag Göttingen. https://doi.org/10.17875/gup2018-1103.

Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology, 66, 688–701. https://doi.org/10.1037/h0037350.

Rubin, D. B. (1988), “An overview of multiple imputation,” Proceedings of the Survey Research Methods Section, American Statistical Association, 79–84.

Rubin, D. B. (2005), “Causal inference using potential outcomes,” Journal of the American Statistical Association, 100, 322–331. https://doi.org/10.1198/016214504000001880.

Rubinstein, R. Y., and Kroese, D. P. (2016), Simulation and the monte carlo method, Wiley.

Rudnytskyi, I. (2023), Openai: R wrapper for OpenAI API.

Ruggles, S., Flood, S., Sobek, M., Backman, D., Chen, A., Cooper, G., Richards, S., Rodgers, R., and Megan Schouweiler, and (2024), “IPUMS USA,” Minneapolis, MN: IPUMS USA; IPUMS. https://doi.org/10.18128/D010.V15.0.

Russo, A. (2019), “Burden of proof,” in Bad arguments: 100 of the most important fallacies in western philosophy, eds. R. Arp, S. Barbone, and M. Bruce, John Wiley & Sons, Ltd, pp. 137–139.

Ryan, A. M., Burgess Jr, J. F., and Dimick, J. B. (2015), “Why we should not be indifferent to specification choices for difference-in-differences,” Health Services Research, 50, 1211–1235. https://doi.org/10.1111/1475-6773.12270.

Saal, F. E., Downey, R. G., and Lahey, M. A. (1980), “Rating the ratings: Assessing the psychometric quality of rating data,” Psychological Bulletin, 88, 413. https://doi.org/10.1037/0033-2909.88.2.413.

Sagan, C. (1979), Broca’s brain: Reflections on the romance of science, New York: Random House.

Saka, A., Taiwo, R., Saka, N., Salami, B. A., Ajayi, S., Akande, K., and Kazemi, H. (2024), “GPT models in construction industry: Opportunities, limitations, and a use case validation,” Developments in the built environment, 17, 100300. https://doi.org/10.1016/j.dibe.2023.100300.

Samii, C. (2016), “Causal empiricism in quantitative research,” The Journal of Politics, 78, 941–955. https://doi.org/10.1086/686690.

Schmuckler, M. A. (2001), “What is ecological validity? A dimensional analysis,” Infancy, 2, 419–436. https://doi.org/10.1207/S15327078IN0204_02.

Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021), “Towards causal representation learning,” arXiv. https://doi.org/10.48550/arxiv.2102.11107.

Schütze, F. (1997), “Autobiographical accounts of war experiences. An outline for the analysis of topically focused autobiographical texts—using the example of the ‘robert rasmus’ account in studs terkel’s book the good war,” Qualitative Sociology Review, 10, 224–283. https://doi.org/10.18778/1733-8077.10.1.10.

Schütze, F. (2008), “Biography analysis on the empirical base of autobiographical narratives: How to analyze autobiographical narrative interviews—part one and two,” European Studies on Inequalities and Social Cohesion, 153–297.

Scott, W. A. (1955), “Reliability of content analysis: The case of nominal scale coding,” Public Opinion Quarterly, 19, 321–325. https://doi.org/10.1086/266577.

Seawright, J. (2002), “Testing for necessary and/or sufficient causation: Which cases are relevant?” Political Analysis, 10, 178–193. https://doi.org/10.1093/pan/10.2.178.

Sen, P. K., and Singer, J. M. (1993), Large sample methods in statistics: An introduction with applications, Chapman; Hall/CRC.

Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002), Experimental & quasi-experimental designs for generalized causal inference, Boston: Houghton Mifflin.

Shaffer, J. P. (1995), “Multiple hypothesis testing,” Annual Review of Psychology, 46, 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021.

Shanahan, M. (2024), “Talking about large language models,” Communications of the ACM, 67, 68–79. https://doi.org/10.1145/3624724.

Sharpe, D. (1997), “Of apples and oranges, file drawers and garbage: Why validity issues in meta-analysis will not go away,” Clinical Psychology Review, 17, 881–901. https://doi.org/10.1016/S0272-7358(97)00056-1.

Sharpe, D., and Poets, S. (2020), “Meta-analysis as a response to the replication crisis,” Canadian Psychology / Psychologie Canadienne, 61, 377–387. https://doi.org/10.1037/cap0000215.

Shaughnessy, J. J., Zechmeister, E. B., and Zechmeister, J. S. (2015), Research methods in psychology, McGraw-Hill.

Shea, Y.-F., and Ma, N.-Y. C. (2023), “Limitations of GPT‐4 in analyzing real‐life medical notes related to cognitive impairment,” Psychogeriatrics, 23, 885–887. https://doi.org/10.1111/psyg.13002.

Sheagley, G., and Clifford, S. (2025), “No evidence that measuring moderators alters treatment effects,” American Journal of Political Science, 69, 49–63. https://doi.org/10.1111/ajps.12814.

Sheng, Y., and Sheng, Z. (2012), “Is coefficient alpha robust to non-normal data?” Frontiers in Psychology, 3, 34. https://doi.org/10.3389/fpsyg.2012.00034.

Silge, J., and Robinson, D. (2016), “Tidytext: Text mining and analysis using tidy data principles in r,” JOSS, The Open Journal, 1. https://doi.org/10.21105/joss.00037.

Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011), “False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant,” Psychological Science, 22, 1359–1366. https://doi.org/10.1177/0956797611417632.

Simon, H. A. (1955), “A behavioral model of rational choice,” The Quarterly Journal of Economics, 69, 99–118. https://doi.org/10.2307/1884852.

Skitka, L. J., and Tetlock, P. E. (1993), “Providing public assistance: Cognitive and motivational processes underlying liberal and conservative policy preferences,” Journal of Personality and Social Psychology, 65, 1205–1223. https://doi.org/10.1037/0022-3514.65.6.1205.

Skovron, C., and Titiunik, R. (2015), “A practical guide to regression discontinuity designs in political science.”

Slothuus, R. (2016), “Assessing the influence of political parties on public opinion: The challenge from pretreatment effects,” Political Communication, 33, 302–327. https://doi.org/10.1080/10584609.2015.1052892.

Sniderman, P. M. (2011), “The logic and design of the survey experiment: An autobiography of a methodological innovation,” in Cambridge handbook of experimental political science, eds. J. N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia, Cambridge University Pres, pp. 102–114.

Sniderman, P. M., and Grob, D. B. (1996), “Innovations in experimental design in attitude surveys,” Annual Review of Sociology, 22, 377–399. https://doi.org/10.1146/annurev.soc.22.1.377.

Sørensen, B., Magnusson, S. P., Couppé, C., Suetta, C., Johannsen, F. E., Kristensen, M. T., and Aagaard, P. (2024), “Intra‐rater reliability of panoramic ultrasound imaging for determining quadriceps muscle cross‐sectional area in middle‐aged and elderly adults,” Scandinavian Journal of Medicine & Science in Sports, 34, e14749. https://doi.org/10.1111/sms.14749.

Spearman, C. (1904), “The proof and measurement of association between two things,” The American Journal of Psychology, 15, 72–101. https://doi.org/10.2307/1412159.

Sperber, D. (1976), Explaining culture: A naturalistic approach, Blackwell Publishing.

Sperber, D., and Hirschfeld, L. A. (2004), “The cognitive foundations of cultural stability and diversity,” Trends in Cognitive Science, 8, 40–46. https://doi.org/10.1016/j.tics.2003.11.002.

Stan Development Team (2024), Stan reference manual.

Stanfield, J. H. (1987), “Life history analysis and racial stratification research: Conventional life history approaches in the sociology of race relations universal problems with conventional approaches another life history analysis strategy in the social stratification of race relations references,” The American Behavioral Scientist (Beverly Hills), 30, 429.

StataCorp (2023), “Stata statistical software,” College Station, TX: StataCorp LLC.

Steegen, S., Tuerlinckx, F., Gelman, A., and Vanpaemel, W. (2016), “Increasing transparency through a multiverse analysis,” Perspectives on Psychological Science, 11, 702–712. https://doi.org/10.1177/1745691616658637.

Stevens, S. S. (1932), “On the theory of scales of measurement,” Science, 103, 677–680. https://doi.org/10.1126/science.103.2684.677.

Stigler, S. M. (1989), “Francis galton’s account of the invention of correlation,” Statistical Science, 4, 73–79. https://doi.org/10.1214/ss/1177012580.

Stock, J. H., and Trebbi, F. (2003), “Retrospectives who invented instrumental variable regression?” The Journal of Economic Perspectives, 17, 177–194. https://doi.org/10.1257/089533003769204416.

Stock, J., and Watson, M. W. (2019), Introduction to econometrics, Pearson Education Limited.

Stone, M. (1974), “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society, Series B (Methodological), 36, 111–147. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x.

Strübing, A., McKibbon, C., Ruan, H., Mackay, E., Dennis, N., Velummailum, R., He, P., Tanaka, Y., Xiong, Y., Springford, A., and Rosenlund, M. (2023), “CO90 augmenting synthetic control arms using bayesian borrowing: A case study in first-line non-small cell lung cancer,” Value in health, 26, S30. https://doi.org/10.1016/j.jval.2023.09.162.

Stuart, E. A. (2010), “Matching methods for causal inference: A review and a look forward,” Statistical Science, 25, 1–21. https://doi.org/10.1214/09-sts313.

Studer, M. (2021), “Validating sequence analysis typologies using parametric bootstrap,” Sociological Methodology, 51, 290–318. https://doi.org/10.1177/00811750211014232.

Sullivan, L. M. (2008), “Repeated measures,” Circulation (New York, N.Y.), 117, 1238–1243. https://doi.org/10.1161/CIRCULATIONAHA.107.654350.

Syed, M., and Nelson, S. C. (2015), “Guidelines for establishing reliability when coding narrative data,” Emerging Adulthood, 3, 375–387. https://doi.org/10.1177/2167696815587648.

Tadesse, M. M., Lin, H., Xu, B., and Yang, L. (2018), “Personality predictions based on user behavior on the facebook social media platform,” IEEE Access, 6, 61969. https://doi.org/10.1109/ACCESS.2018.2876502.

Textor, J., van der Zander, B., Gilthorpe, M. S., Liśkiewicz, M., and Ellison, G. T. H. (2016), “Robust causal inference using directed acyclic graphs: The r package ’dagitty’,” International Journal of Epidemiology, 45, 1887–1894. https://doi.org/10.1093/ije/dyw341.

Tharwat, A. (2018), “Classification assessment methods,” Applied Computing and Informatics, 17, 168–192. https://doi.org/10.1016/j.aci.2018.08.003.

Theocharis, Y., and Jungherr, A. (2021), “Computational social science and the study of political communication,” Political Communication, 38, 1–22. https://doi.org/10.1080/10584609.2020.1833121.

Thibodeau, P. H., and Boroditsky, L. (2011), “Metaphors we think with: The role of metaphor in reasoning,” PloS One, 6, e16782. https://doi.org/10.1371/journal.pone.0016782.

Thistlethwaite, D. L., and Campbell, D. T. (1960), “Regression-discontinuity analysis: An alternative to the ex post facto experiment,” Journal of Educational Psychology, 51, 309–317. https://doi.org/10.1037/h0044319.

Thompson, K. (1968), “Programming techniques: Regular expression search algorithm,” Communications of the ACM, 11, 419–422. https://doi.org/10.1145/363347.363387.

Thompson, S. K. (2012), Sampling, John Wiley & Sons.

Thomson, G. P., and Reid, A. (1927), “Diffraction of cathode rays by a thin film,” Nature, 119, 890. https://doi.org/10.1038/119890a0.

Toshkov, D. (2016), Research design in political science, Palgrave.

Tukey, J. W. (1910), Exploratory data analysis, Addison-Wesley Publishing Company.

Tversky, A., and Kahneman, D. (1981), “The framing of decisions and the psychology of choice,” Science, 211, 453–458. https://doi.org/10.1126/science.7455683.

Uhlmann, E. L., Ebersole, C. R., Chartier, C. R., Errington, T. M., Kidwell, M. C., Lai, C. K., McCarthy, R. J., Riegelman, A., Silberzahn, R., and Nosek, B. A. (2019), “Scientific utopia III: Crowdsourcing science,” Perspectives on Psychological Science, 14, 711–733. https://doi.org/10.1177/1745691619850561.

Van Atteveldt, W., Sheafer, T., Shenhav, S. R., and Fogel-Dror, Y. (2017), “Clause analysis: Using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008-2009 gaza war,” Political Analysis, 25, 207–222. https://doi.org/10.1017/pan.2016.12.

van Buuren, S., and Groothuis-Oudshoorn, K. (2011), “mice: Multivariate imputation by chained equations in r,” Journal of Statistical Software, 45, 1–67. https://doi.org/10.18637/jss.v045.i03.

Van Fraassen, B. C. (2002), The empirical stance, Yale University Press.

van Lange, P. A. M., Balliet, D., Parks, C. D., and van Vugt, M. (2013), Social dilemmas: The psychology of human cooperation, Oxford University Press.

van Oorschot, W. (2000), “Who should get what, and why? On deservingness criteria and the conditionality of solidarity among the public,” Policy and Politics, 28, 33–48. https://doi.org/10.1332/0305573002500811.

Van Prooijen, J.-W., and van Vugt, M. (2018), “Conspiracy theories: Evolved functions and psychological mechanisms,” Perspectives on Psychological Science, 13, 770–788. https://doi.org/10.1177/1745691618774270.

VanderWeele, T. J., and Shpitser, I. (2013), “On the definition of a confounder,” Annals of Statistics, 41, 196–220. https://doi.org/10.1214/12-aos1058.

Vectara (2025), Hallucination evaluation leaderboard, Huggingface.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021), “Rank-normalization, folding, and localization: An improved $\hat{R}$ for assessing convergence of MCMC (with discussion),” Bayesian Analysis, 16, 667–718. https://doi.org/10.1214/20-BA1221.

Viechtbauer, W. (2010), “Conducting meta-analyses in R with the metafor package,” Journal of Statistical Software, 36, 1–48. https://doi.org/10.18637/jss.v036.i03.

von Hippel, P. T. (2020), “How many imputations do you need? A two-stage calculation using a quadratic rule,” Sociological Methods & Research, 49, 699–718. https://doi.org/10.1177/0049124117747303.

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2010), “Bayesian hypothesis testing for psychologists: A tutorial on the savage-dickey method,” Cognitive Psychology, 60, 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001.

Wallace, D. S., Abduk-Khaliq, A., Czuchry, M., and Sia, T. L. (2009), “African american’s political attitudes, party affiliation, and voting behavior,” Journal of African American studies (New Brunswick, N.J.), 13, 139–146. https://doi.org/10.1007/s12111-008-9040-y.

Wan, F., Sutcliffe, S., Zhang, J., and Small, D. (2024), “Does matching introduce confounding or selection bias into the matched case-control design?” Observational Studies, 10, 1–9. https://doi.org/10.1353/obs.2024.a929114.

Weiner, B. (1995), Judgments of responsibility: A foundation for a theory of social conduct, Guilford Press.

Weingarten, E., Chen, Q., McAdams, M., Yi, J., Hepler, J., and Albarracín, D. (2016), “From primed concepts to action: A meta-analysis of the behavioral effects of incidentally presented words,” Psychological Bulletin, 142, 472–497. https://doi.org/10.1037/bul0000030.

Welbers, K., van Atteveldt, W., and Benoit, K. (2017), “Text analysis in r,” Communication Methods and Measures, 11, 245–265. https://doi.org/10.1080/19312458.2017.1387238.

Westreich, D., Edwards, J. K., Cole, S. R., Platt, R. W., Mumford, S. L., and Schisterman, E. F. (2015), “Imputation approaches for potential outcomes in causal inference,” International Journal of Epidemiology, 44, 1731–1737. https://doi.org/10.1093/ije/dyv135.

Whorf, B. L. (2011[1956]), Language, thought, and reality - selected writings of benjamin lee whorf, Martino Fine Books.

Wickham, H. (2016), ggplot2: Elegant graphics for data analysis, Springer-Verlag New York.

Wickham, H. (2023), Stringr: Simple, consistent wrappers for common string operations.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., and Yutani, H. (2019), “Welcome to the tidyverse,” Journal of Open Source Software, 4, 1686. https://doi.org/10.21105/joss.01686.

Wickham, H., François, R., Henry, L., Müller, K., and Vaughan, D. (2023a), Dplyr: A grammar of data manipulation.

Wickham, H., Hester, J., and Bryan, J. (2024b), Readr: Read rectangular text data.

Wickham, H., Miller, E., and Smith, D. (2023c), Haven: Import and export ’SPSS’, ’stata’ and ’SAS’ files.

Wickham, H., Pedersen, T. L., and Seidel, D. (2023b), Scales: Scale functions for visualization.

Wickham, H., Vaughan, D., and Girlich, M. (2024a), Tidyr: Tidy messy data.

Wild, F. (2022), Lsa: Latent semantic analysis.

Wilkinson, L., and The Task Force on Statistical Inference (1999), “Statistical methods in psychology journals: Guidelines and explanations,” American Psychologist, 54, 594–604. https://doi.org/10.1037/0003-066X.54.8.594.

Williams, T., and Khan, R. M. (2020), The vanishing indian upper class: Life history of raza mohammed khan, Anthem Press.

Wood, S. N. (2004), “Stable and efficient multiple smoothing parameter estimation for generalized additive models,” Journal of the American Statistical Association, 99, 673–686. https://doi.org/10.1198/016214504000000980.

Wood, S. N. (2011), “Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 73, 3–36. https://doi.org/10.1111/j.1467-9868.2010.00749.x.

Wood, S. N. (2017), Generalized additive models: An introduction with r, New York, NY: Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279.

Woodworth, R. S., and Thorndike, E. L. (1901), “The influence of improvement in one mental function upon the efficiency of other functions (i),” Psychological Review, 8, 247. https://doi.org/10.1037/h0074898.

Wooldridge, J. M. (2019), Introductory econometrics, Cengage.

Wright, J. D. (2015), International encyclopedia of the social & behavioral sciences, Elsevier.

Wright, P. G. (1928), The tariff on animal and vegetable oils, The MacMillan Company.

Wright, S. P. (1992), “Adjusted p-values for simultaneous inference,” Biometrics, 48, 1005–1013. https://doi.org/10.2307/2532694.

Wu, H., and Leung, S.-O. (2017), “Can likert scales be treated as interval scales?-a simulation study,” Journal of Social Service Research, 43, 527–532. https://doi.org/10.1080/01488376.2017.1329775.

Wulff, J. N., and Taylor, L. (2024), “How and why alpha should depend on sample size: A bayesian-frequentist compromise for significance testing,” Strategic Organization, 22, 550–581. https://doi.org/10.1177/14761270231214429.

Wygnańska, J. (2019), “Migration as a source of suffering in the context of the biographical experience of the process of transformation in poland. Case study of weronika’s life history,” Qualitative Sociology Review, 15, 112–139. https://doi.org/10.18778/1733-8077.15.4.06.

Xie, Y. (2024b), Knitr: A general-purpose package for dynamic report generation in r.

Xie, Y. (2024a), Bookdown: Authoring books and technical documents with r markdown.

Xie, Y. (2014), “Knitr: A comprehensive tool for reproducible research in R,” in Implementing reproducible computational research, eds. V. Stodden, F. Leisch, and R. D. Peng, Chapman; Hall/CRC.

Xie, Y. (2015), Dynamic documents with R and knitr, Boca Raton, Florida: Chapman; Hall/CRC.

Xie, Y. (2016), Bookdown: Authoring books and technical documents with R markdown, Boca Raton, Florida: Chapman; Hall/CRC.

Xie, Y., Allaire, J. J., and Grolemund, G. (2018), R markdown: The definitive guide, Boca Raton, Florida: Chapman; Hall/CRC. https://doi.org/10.1201/9781138359444.

Xie, Y., Allaire, J., and Horner, J. (2024), Markdown: Render markdown with ’commonmark’.

Xie, Y., Dervieux, C., and Riederer, E. (2020), R markdown cookbook, Boca Raton, Florida: Chapman; Hall/CRC.

Xu, Q.-S., and Liang, Y.-Z. (2001), “Monte carlo cross validation,” Chemometrics and Intelligent Laboratory Systems, 56, 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2.

Xu, X., and Peterson, J. B. (2017), “Differences in media preference mediate the link between personality and political orientation,” Political Psychology, 38, 55–72. https://doi.org/10.1111/pops.12307.

Yang, A. C., Peng, C.-K., and Huang, N. E. (2018), “Causal decomposition in the mutual causation system,” Nature Communications, 9, 3378. https://doi.org/10.1038/s41467-018-05845-7.

Yarkoni, T. (2020), “The generalizability crisis,” The Behavioral and Brain Sciences, 45, e1. https://doi.org/10.1017/S0140525X20001685.

Yenduri, G., Ramalingam, M., Selvi, G. C., Supriya, Y., Srivastava, G., Maddikunta, P. K. R., Raj, G. D., Jhaveri, R. J., Prabadevi, B., Wang, W., Vasilakos, A. V., and Gadekallu, T. R. (2024), “GPT (generative pre-trained transformer)—a comprehensive review on enabling technologies, potential applications, emerging challenges, and future direction,” IEEE Access, 12, 54608–54649. https://doi.org/10.1109/ACCESS.2024.3389497.

Yerushalmy, J. (1947), “Statistical problems in assessing methods of medical diagnosis with special reference to x-ray techniques,” Public Health Reports, 62, 1432–1439. https://doi.org/10.2307/4586294.

Yoder-Wise, P. S. (2023), “From GIGO to BIBO,” The Journal of Continuing Education in Nursing, 54, 543. https://doi.org/10.3928/00220124-20231113-01.

Yuan, A. (2009), “Bayesian frequentist hybrid inference,” The Annals of Statistics, 37, 2458–2501. https://doi.org/10.1214/08-AOS649.

Zanna, M. P., Olson, J. M., and Fazio, R. H. (1980), “Attitude-behavior consistency: An individual difference perspective,” Journal of Personality and Social Psychology, 38, 432–440. https://doi.org/10.1037/0022-3514.38.3.432.

Zeilinger, A. (1999), “Experiment and the foundations of quantum physics,” Reviews of Modern Physics, 71, S288–S297. https://doi.org/10.1103/revmodphys.71.s288.

Zhang, C., Lu, J., and Zhao, Y. (2024), “Generative pre-trained transformers (GPT)-based automated data mining for building energy management: Advantages, limitations and the future,” Energy and built environment, 5, 143–169. https://doi.org/10.1016/j.enbenv.2023.06.005.

Zhu, H. (2024), kableExtra: Construct complex table with ’kable’ and pipe syntax.

Zivenko, O., Walton, N. A. W., Fritsch, W., Forbes, J., Lewis, A. M., Clark, A., Brown, J. M., and Sobes, V. (2025), “Validating automated resonance evaluation with synthetic data,” Annals of Nuclear Energy, 212, 111081. https://doi.org/10.1016/j.anucene.2024.111081.

Znaniecki, F., and Thomas, W. I. (2020[1974]), The polish peasant in europe and america, Alpha Editions.

10 Appendix

10.1 Changelog

2025-03-18 18:51 CET
- [Version 2025-03-18-18-51] - Preview Version (Large-N Release).
2025-01-20 14:42 CET
- [Version 2025-01-20-14-42] - Preview Version (Introduction & Theory Release).
2024-10-16 15:16 CET
- [Version 2024-10-16-15-16] - Preview Version (Initial Release).
2024-09-15 15:31:41 CET
- [Version 2024-09-15-15-31-41] - First version

10.2 Software Specifications

This document was created on a Windows 10 x64 (build 19045) operating system (OS) on the 2025-03-18 18:51 CET (Timezone: Europe/Copenhagen). The machine has 8 cores available, and the range of doubles is $2.23e^{-308}$ - $1.8e^{+308}$.

All code is written in R version 4.4.1 (2024-06-14 ucrt), while the document is written with markdown (Gruber 2014) and LaTeX (Lamport 1986) in the RStudio version 2024.4.2.764, using the R packages knitr (Xie 2024b; v1.47, Xie 2014, 2015), markdown (v1.13, Xie et al. 2024), rmarkdown (Allaire et al. 2024; v2.27, Xie et al. 2018, 2020), and bookdown (Xie 2024a; v0.40, Xie 2016). The code makes extensive use of R packages that are part of the tidyverse (v2.0.0, Wickham et al. 2019), including dplyr (v1.1.4, Wickham et al. 2023a), tidyr (v1.3.1, Wickham et al. 2024a), readr (v2.1.5, Wickham et al. 2024b), stringr (v1.5.1, Wickham 2023), lubridate (v1.9.3, Grolemund and Wickham 2011), and tidytext (v0.4.2, Silge and Robinson 2016). Graphs were produced with ggplot2 (v3.5.1, Wickham 2016), bayesplot (v1.11.1, Gabry et al. 2019), tidybayes (v3.0.6, Kay 2023), dagitty (v0.3.4, Textor et al. 2016), ggdag (v0.2.13, Barrett 2024), latex2exp (v0.9.6, Meschiari 2022), magick (v2.8.3, Ooms 2024a), rsvg (v2.6.1, Ooms 2024b), ggdist (v3.3.2, Kay 2024a; 2024b), and ggpubr (v0.6.0, Kassambara 2023). Calls to OpenAI’s API was done with the openai R package (v0.4.1, Rudnytskyi 2023), while the implementation of parallel processing necessary for the ‘parallel worlds’-estimation procedure was achieved using the parallel (v4.4.1, R Core Team 2024), doParallel (v1.0.17, Microsoft and Weston 2022a), and foreach (v1.5.2, Microsoft and Weston 2022b) R packages. The seed of SEED for random number generation (RNG) was exogenously determined using the random (v0.2.6, Eddelbuettel 2017) R package, which relies on randomness by random.org. Rater reliability was assessed using caret (v6.0.94, Kuhn 2008) and sklarsomega (v3.0.2, Hughes 2023). Statistical estimation and inference was conducted using the R packages lme4 (v1.1.35.4, Bates et al. 2015), broom.mixed (v0.2.9.6, Bolker and Robinson 2024), fixest (v0.12.1, Bergé 2018), mice (v3.16.0, van Buuren and Groothuis-Oudshoorn 2011), howManyImputations (v0.2.5, von Hippel 2020), randomForest (v4.7.1.1, Liaw and Wiener 2002), mgcv (v1.9.1, Wood 2004, 2011, 2017), metafor (v4.6.0, Viechtbauer 2010), brms (v2.21.0, Bürkner 2017, 2018), bayestestR (v0.13.2, Makowski et al. 2019a), effectsize (v0.8.9, Ben-Shachar et al. 2020), e1071 (v1.7.14, Meyer et al. 2023), rQCC (Park et al. 2022; v2.22.12, Park and Wang 2020, 2022), HuraultMisc (v1.1.1, Hurault 2021), lsa (v0.73.3, Wild 2022), and summarytools (v1.0.1, Comtois 2022). Data was formatted using the scrutiny (v0.4.0, Jung 2024), weights (v1.0.4, Pasek et al. 2021), scales (v1.3.0, Wickham et al. 2023b), and kableExtra (v1.4.0, Zhu 2024) R packages. Data from the IPUMS (Ruggles et al. 2024) and ANES (2021) were read and processed using the haven (v2.5.4, Wickham et al. 2023c), labelled (v2.13.0, Larmarange 2024), and sjlabelled (v1.2.0, Lüdecke 2022) R packages.

10.3 Meta-Analysis

This section details the meta-analysis conducted on the studies of the effects of deservingness on welfare attitudes by Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023). The specific estimates used for the meta-analysis are provided in table 10.1. Based on McElreath (2019) and Nalborczyk et al. (2019), a formalization of the Bayesian Generalized Linear Model with varying-intercepts used for the meta-analysis is provided in table 10.2. Traceplots for assessing model convergence are provided in figure 10.1. The posterior predictive distribution of the model is provided visually in figure 10.2. The explanatory power of the Bayesian Meta-Analysis model (i.e., Bayesian $R^2$, Gelman et al. 2019) is provided metrically in table 10.3 and visually in figure 10.3. For the estimated parameters, following Bowley (1910) and Tukey (1910), posterior descriptions are provided as a 13-point summaries. For the estimated fixed/mean intercept ($\hat{\mu_\alpha}$), it is provided metrically in table 10.4 and visually in figure 10.4. Similarly, a metric format of the intercept dispersion ($\hat{\sigma_\alpha}$) is provided in table 10.5 and visually in figure 10.5. The model’s estimated degrees of freedom ($\hat{\nu}$) is provided metrically in table 10.6 and visually in figure 10.6. When possible and meaningful, the mode, median, mean, interquartile range (IQR), skewness, kurtosis, posterior predictive p-value ($PP_P$, Gelman et al. 1996), chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021), effective sample size (ESS, Bürkner 2017), and Monte-Carlo Standard Error (MCSE, Kruschke 2014) are listed in the table notes.

10.3.1 Studies

Table 10.1: Studies used in the Meta-Analysis
Study	Estimate	Sample Size	Collection Method
Aarøe & Petersen (2014)	.422 (.030) [.363; .481]	1,009	YouGov
Mummolo & Peterson (2019)	.550 (.023) [.505; .595]	1,874	MTurk
Mummolo & Peterson (2019)	.457 (.016) [.426; .488]	5,550	Qualtrics
Kane et al. (2023)	.583 (.026) [.532; .634]	804	MTurk

NOTE: Normalized coefficients (0 - 1) with robust standard errors in parentheses and 95% confidence intervals in brackets. The estimates listed under Aarøe and Petersen (2014) were derived from recomputed values by Mummolo and Peterson (2019), provided visually in their Online Appendix (p. 21, Figure B6, under ‘Welfare Support Study’). Estimates from Mummolo and Peterson (2019) and some by Kane et al. (2023) were only provided visually and were imputed using visual tools by Gross et al. (2014).

10.3.2 Bayesian Meta-Analysis Model

Table 10.2: Bayesian Meta-Analysis Model

$Y_i \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_\nu(\mu_i, \sigma)$	1.1
$\mu_i = \alpha_{study[i]}$	1.2
$\alpha_{study} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(\mu_\alpha, \: \sigma_\alpha)$	1.3
$\mu_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: .077)$	1.4
$\sigma_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(0, \: .077)$	1.5
$\nu \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(1; 4)}_3(1, \: .077)$	1.6

NOTE: Bayesian Generalized Linear Model with study-varying intercepts. 1.1 is the likelihood-function, which specifies that the observations of the outcome variable $Y$ (i.e., welfare support) are assumed to be independent and identically (i.e., iid) distributed as a Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b), with a mean, dispersion (i.e., residual standard deviation), and degrees of freedom (df) parameters. 1.2 is the affine function (Boyd and Vandenberghe 2018) of the mean parameter, which consists of random (i.e., study-varying) intercepts. 1.3 is the prior for the study-varying intercepts, Student t distributed, with a df of 3, an unknown fixed (i.e., US-level) mean intercept, and an unknown dispersion parameter. 1.4 is the hyper prior for the fixed/mean of the study-varying intercepts, being Student t distributed, with a df of 3, mean of 0, and a dispersion of .077. 1.5 is the hyper prior for the mean dispersion parameter of the study-varying intercepts, being distributed as a non-negative Student t, with a df of 3, mean of 0, and standard deviation of .077. 1.6 is the prior for the df parameter of the model, being distributed as a truncated Student t, with a mean of 1, standard deviation of .077, while the truncation includes values between 1 and 4. Note that no prior for the dispersion of the outcome (i.e., $\sigma$) is specified, because the model weighs the estimates from each study based on their estimated standard errors.

10.3.3 Model Diagnostics

Figure 10.1: Chain Convergence of the Bayesian Meta-Analysis Model

NOTE: Traceplot. Total chains = 7, total samples per chain = 7,715, warmup samples per chain = 2,000. Total posterior samples = 40,005. The gray-shaded area indicates the warmup samples. Effective Sample Size (ESS, Bürkner 2017) = 5,553, 4,949, and 17,731, respectively. Chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021) = 1.001, 1.000, and 1.000, respectively. Monte Carlo Standard Error (MCSE, Kruschke 2014) = .002, .001 and .001, respectively. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Figure 10.2: Posterior Predictive Distribution of the Bayesian Meta-Analysis Model

NOTE: Traceplot with 100 random draws shown. $PP_P$ (cf. Gelman et al. 1996) = 0.763. Posterior samples = 40,005. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Table 10.3: 13-Number Summary of the Explanatory Power of the Bayesian Meta-Analysis Model
Percentile	Estimate
0%	0.000
0.1%	0.000
1%	0.000
5%	0.004
10%	0.016
25%	0.108
50%	0.399
75%	0.660
90%	0.746
95%	0.773
99%	0.795
99.9%	0.799
100%	0.800

NOTE: Posterior distribution of the estimated explained variance ($R^2$, cf. Gelman et al. 2019). Posterior samples = 40,005. Mean = .390. IQR = .552. Skewness = -.045. Kurtosis = -1.510. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Figure 10.3: Explanatory Power of the Bayesian Meta-Analysis Model

NOTE: Posterior distribution of the estimated explained variance ($R^2$, cf. Gelman et al. 2019). Posterior samples = 40,005. The dashed black line indicates an explanatory power of zero. The geometric below the distribution indicates the posterior mode (circle) and the 95% HDCI (thick bar). Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

10.3.4 Model Parameters

Table 10.4: 13-Number Summary of the Meta-Analytical Effect of Deservingness on Welfare Support
Percentile	Posterior	Likelihood
0%	-0.168	-1.069
0.1%	-0.090	0.134
1%	-0.007	0.334
5%	0.153	0.415
10%	0.308	0.443
25%	0.418	0.475
50%	0.464	0.503
75%	0.506	0.531
90%	0.545	0.563
95%	0.566	0.590
99%	0.610	0.669
99.9%	0.688	0.904
100%	0.808	1.801

NOTE: Posterior and (maximum) likelihood distributions of the estimated intercept mean parameter ($\hat{\mu_\alpha}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based Student t-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .439 and .503, respectively. IQR = .088 and .057, respectively. Skewness = -1.965 and .136, respectively. Kurtosis = 42.177 and 4.588, respectively. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Figure 10.4: Meta-Analytical Effect of Deservingness on Welfare Support

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated intercept mean parameter ($\hat{\mu_\alpha}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The dashed black line indicates an effect of zero. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Table 10.5: 13-Number Summary of Meta-Analytical Variability in Effect of Deservingness on Welfare Support
Percentile	Posterior	Likelihood
0%	0.000	0.000
0.1%	0.000	0.000
1%	0.002	0.000
5%	0.008	0.000
10%	0.017	0.000
25%	0.038	0.019
50%	0.069	0.071
75%	0.115	0.122
90%	0.197	0.182
95%	0.286	0.230
99%	0.488	0.383
99.9%	0.801	0.808
100%	1.224	5.461

NOTE: Posterior and (maximum) likelihood distributions of the estimated intercept dispersion parameter ($\hat{\sigma_\alpha}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based Student t-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .095 and .085, respectively. IQR = .077 and .103, respectively. Skewness = 2.874 and 7.713, respectively. Kurtosis = 317.679 and 12.464, respectively. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Figure 10.5: Meta-Analytical Variability in Effect of Deservingness on Welfare Support

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated intercept dispersion parameter ($\hat{\sigma_\alpha}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The dashed black line indicates an effect variability of zero across studies. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Table 10.6: 13-Number Summary of the Degrees of Freedom of the Bayesian Meta-Analysis Model
Percentile	Posterior
0%	1.00
0.1%	1.00
1%	1.00
5%	1.00
10%	1.01
25%	1.03
50%	1.06
75%	1.11
90%	1.18
95%	1.24
99%	1.44
99.9%	1.93
100%	3.06

NOTE: Posterior distribution of the estimated model degrees of freedom parameter ($\hat{\nu}$). Posterior samples = 40,005. Mean = 1.084. IQR = .082. Skewness = 4.437. Kurtosis = 40.748. Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

Figure 10.6: Degrees of Freedom of the Bayesian Meta-Analysis Model

NOTE: Posterior and prior distributions of the estimated model degrees of freedom parameter ($\hat{\nu}$). Posterior/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Data for meta-analysis derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023).

10.4 Simulated Data

10.4.1 US State Residence

This section gives an overview of the distribution of US State residencies in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.7 and graphically in figure 10.7. The mode of this discrete distribution of the sample is ‘California’ (10.80%). With a sample size of 1,009, the simulated sample manages to capture 98.82% of the information about the US State residencies in the population.

Table 10.7: Distribution of US States
State	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Alabama	20	1.982	1.601	0.381
Alaska	1	0.099	0.192	-0.092
Arizona	26	2.577	2.170	0.407
Arkansas	12	1.189	0.950	0.239
California	109	10.803	10.185	0.618
Colorado	16	1.586	1.812	-0.226
Connecticut	14	1.388	1.115	0.273
Delaware	4	0.396	0.318	0.078
District Of Columbia	2	0.198	0.214	-0.016
Florida	60	5.946	6.671	-0.724
Georgia	36	3.568	3.292	0.276
Hawaii	1	0.099	0.455	-0.355
Idaho	6	0.595	0.572	0.023
Illinois	44	4.361	3.852	0.509
Indiana	16	1.586	2.117	-0.531
Iowa	9	0.892	1.013	-0.121
Kansas	14	1.388	0.890	0.497
Kentucky	11	1.090	1.452	-0.362
Louisiana	20	1.982	1.400	0.582
Maine	7	0.694	0.466	0.227
Maryland	16	1.586	1.896	-0.311
Massachusetts	13	1.288	2.217	-0.929
Michigan	25	2.478	3.190	-0.712
Minnesota	16	1.586	1.780	-0.194
Mississippi	7	0.694	0.916	-0.223
Missouri	22	2.180	1.966	0.215
Montana	2	0.198	0.350	-0.152
Nebraska	10	0.991	0.604	0.387
Nevada	9	0.892	0.922	-0.030
New Hampshire	4	0.396	0.460	-0.063
New Jersey	33	3.271	2.775	0.496
New Mexico	3	0.297	0.570	-0.273
New York	58	5.748	5.959	-0.211
North Carolina	26	2.577	3.301	-0.724
North Dakota	0	0.000	0.245	-0.245
Ohio	46	4.559	3.741	0.818
Oklahoma	18	1.784	1.138	0.646
Oregon	13	1.288	1.351	-0.063
Pennsylvania	49	4.856	4.106	0.750
Rhode Island	0	0.000	0.320	-0.320
South Carolina	25	2.478	1.698	0.780
South Dakota	7	0.694	0.276	0.418
Tennessee	17	1.685	2.267	-0.582
Texas	77	7.631	8.222	-0.590
Utah	12	1.189	0.975	0.214
Vermont	1	0.099	0.221	-0.122
Virginia	19	1.883	2.691	-0.808
Washington	26	2.577	2.431	0.146
West Virginia	3	0.297	0.585	-0.287
Wisconsin	21	2.081	1.907	0.175
Wyoming	3	0.297	0.186	0.111
<NA>	0	0.000	0.000	0.000
Total	1,009	100.000	100.000	0.000

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. pp = Percentage points. NA = Missing data. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.7: Distribution of US States

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.2 Ethnoracial Group

This section gives an overview of the distribution of ‘ethnoracial groups’ (i.e., a racial/ethnic/national label employed in the US) in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.8 and graphically in figure 10.8. The mode of this discrete distribution of the sample is ‘White’ (75.12%). With a sample size of 1,009, the simulated sample manages to capture 99.98% of the information about the ‘ethnoracial groups’ in the population.

Table 10.8: Distribution of Race/Ethnicity/Nationality
Race/Ethnicity/Nationality	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Argentinean	0	0.000	0.085	-0.085
Bangladeshi	2	0.198	0.073	0.125
Bhutanese	0	0.000	0.004	-0.004
Black/African American	81	8.028	8.831	-0.803
Bolivian	0	0.000	0.025	-0.025
Burmese	0	0.000	0.049	-0.049
Cambodian	1	0.099	0.088	0.012
Chilean	0	0.000	0.046	-0.046
Chinese	19	1.883	1.709	0.174
Colombian	8	0.793	0.282	0.511
Costa Rican	0	0.000	0.038	-0.038
Cuban	5	0.496	0.643	-0.148
Dominican	4	0.396	0.245	0.151
Ecuadorian	2	0.198	0.105	0.093
Fijian	0	0.000	0.011	-0.011
Filipino	7	0.694	1.155	-0.461
Guatemalan	0	0.000	0.186	-0.186
Hawaiian	0	0.000	0.050	-0.050
Honduran	2	0.198	0.125	0.073
Indian	18	1.784	1.405	0.379
Indonesian	1	0.099	0.030	0.069
Japanese	1	0.099	0.314	-0.215
Korean	4	0.396	0.541	-0.145
Laotian	0	0.000	0.057	-0.057
Malaysian	0	0.000	0.009	-0.009
Mexican	57	5.649	5.417	0.233
Mongolian	0	0.000	0.007	-0.007
Native American	5	0.496	0.856	-0.361
Nepalese	1	0.099	0.060	0.039
Nicaraguan	2	0.198	0.081	0.117
Pakistani	4	0.396	0.172	0.224
Panamanian	0	0.000	0.051	-0.051
Paraguayan	1	0.099	0.006	0.093
Peruvian	2	0.198	0.129	0.069
Puerto Rican	8	0.793	0.941	-0.148
Salvadoran	3	0.297	0.250	0.047
Spaniard	2	0.198	0.251	-0.053
Sri Lankan	0	0.000	0.020	-0.020
Taiwanese	0	0.000	0.100	-0.100
Thai	1	0.099	0.078	0.021
Uruguayan	0	0.000	0.018	-0.018
Venezuelan	3	0.297	0.144	0.153
Vietnamese	7	0.694	0.663	0.030
White	758	75.124	74.647	0.477
<NA>	0	0.000	0.000	0.000
Total	1,009	100.000	100.000	0.000

Figure 10.8: Distribution of Race/Ethnicity/Nationality

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.3 Education

This section gives an overview of the distribution of levels of education in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.9 and graphically in figure 10.9. The mode of this discrete distribution of the sample is ‘Some Post-High School, No Bachelor’s Degree’ (30.62%). With a sample size of 1,009, the simulated sample manages to capture 99.80% of the information about the levels of education in the population.

Table 10.9: Distribution of Education
Education	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Bachelor’s Degree	198	19.62	21.93	-2.308
Graduate Degree	151	14.96	14.57	0.395
High School Credential	263	26.07	26.45	-0.380
Less Than High School Credential	88	8.72	8.31	0.410
Some Post-High School, No Bachelor’s Degree	309	30.62	28.74	1.883
<NA>	0	0.00	0.00	0.000
Total	1,009	100.00	100.00	0.000

Figure 10.9: Distribution of Education

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.4 Age

This section gives an overview of the distribution of age in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.10 and graphically in figure 10.10. While arguably continuous, the mode of this discretized distribution of the sample is ‘53’ (2.48%), complemented by the mean, standard deviation (SD), median, and interquartile range (IQR) of the sample and population being listed under the table and figure. With a sample size of 1,009, the simulated sample manages to capture 97.00% of the information about age in the population.

Table 10.10: Distribution of Age
Age	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
18	15	1.487	1.206	0.281
19	4	0.396	1.049	-0.653
20	12	1.189	1.074	0.116
21	12	1.189	1.142	0.047
22	10	0.991	1.208	-0.217
23	10	0.991	1.237	-0.246
24	16	1.586	1.253	0.333
25	20	1.982	1.290	0.693
26	14	1.388	1.311	0.077
27	13	1.288	1.349	-0.060
28	16	1.586	1.383	0.203
29	13	1.288	1.413	-0.124
30	17	1.685	1.511	0.174
31	12	1.189	1.504	-0.315
32	15	1.487	1.550	-0.064
33	12	1.189	1.520	-0.330
34	7	0.694	1.496	-0.802
35	12	1.189	1.513	-0.324
36	20	1.982	1.519	0.463
37	18	1.784	1.520	0.264
38	19	1.883	1.507	0.376
39	9	0.892	1.500	-0.608
40	15	1.487	1.558	-0.072
41	18	1.784	1.509	0.275
42	13	1.288	1.515	-0.226
43	19	1.883	1.444	0.439
44	13	1.288	1.429	-0.140
45	20	1.982	1.432	0.550
46	19	1.883	1.385	0.498
47	9	0.892	1.410	-0.518
48	13	1.288	1.413	-0.124
49	19	1.883	1.427	0.456
50	15	1.487	1.565	-0.079
51	15	1.487	1.642	-0.155
52	19	1.883	1.638	0.245
53	25	2.478	1.597	0.881
54	18	1.784	1.570	0.214
55	15	1.487	1.613	-0.126
56	14	1.388	1.649	-0.262
57	18	1.784	1.764	0.020
58	18	1.784	1.825	-0.041
59	11	1.090	1.851	-0.761
60	20	1.982	1.910	0.072
61	23	2.279	1.924	0.356
62	21	2.081	1.973	0.108
63	20	1.982	1.938	0.044
64	25	2.478	1.967	0.511
65	18	1.784	1.974	-0.190
66	17	1.685	1.904	-0.219
67	18	1.784	1.862	-0.078
68	19	1.883	1.831	0.052
69	18	1.784	1.770	0.014
70	9	0.892	1.709	-0.817
71	15	1.487	1.618	-0.131
72	21	2.081	1.562	0.519
73	13	1.288	1.516	-0.228
74	15	1.487	1.487	-0.001
75	14	1.388	1.499	-0.112
76	9	0.892	1.125	-0.233
77	15	1.487	1.029	0.458
78	11	1.090	0.999	0.092
79	2	0.198	0.976	-0.778
80	6	0.595	0.838	-0.244
81	14	1.388	0.736	0.652
82	7	0.694	0.656	0.038
83	4	0.396	0.600	-0.203
84	3	0.297	0.531	-0.233
85	6	0.595	0.462	0.133
86	2	0.198	0.408	-0.210
87	1	0.099	0.361	-0.262
88	1	0.099	0.294	-0.195
89	3	0.297	0.231	0.067
90	1	0.099	0.162	-0.063
91	1	0.099	0.071	0.029
92	3	0.297	0.151	0.147
93	4	0.396	0.145	0.251
94	3	0.297	0.278	0.019
95	5	0.496	0.206	0.289
<NA>	0	0.000	0.000	0.000
Total	1,009	100.000	100.000	0.000

NOTE: Comparison of simulated versus ‘population’ data. Mean age = 51.724 (SD = 18.535) and 51.845 (SD = 18.594), respectively. Median age = 53.000 (IQR = 29.000) and 53.000 (IQR = 31.000), respectively. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. pp = Percentage points. NA = Missing data. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.10: Distribution of Age

NOTE: Comparison of simulated versus ‘population’ data. Mean age = 51.724 (SD = 18.535) and 51.845 (SD = 18.594), respectively. Median age = 53.000 (IQR = 29.000) and 53.000 (IQR = 31.000), respectively. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.5 Household Income

This section gives an overview of the distribution of annual household gross incomes (in 2022 $US) in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.11 and graphically in figure 10.11. While arguably continuous, the mode of this discretized distribution of the sample is ‘100,000’ (1.09%), complemented by the mean, standard deviation (SD), median, and interquartile range (IQR) of the sample and population being listed under the table and figure. With a sample size of 1,009, the simulated sample manages to capture 77.01% of the information about household incomes in the population.

Table 10.11: Distribution of Household Incomes
Household Income ($US)	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
-1996	1	0.099	0.000	0.099
-6910	1	0.099	0.000	0.099
0	10	0.991	1.030	-0.039
100	1	0.099	0.012	0.087
10000	2	0.198	0.243	-0.045
100000	11	1.090	0.718	0.373
100200	1	0.099	0.029	0.070
100900	1	0.099	0.016	0.083
10100	1	0.099	0.102	-0.003
101000	2	0.198	0.156	0.042
101300	1	0.099	0.020	0.079
101500	1	0.099	0.030	0.069
101600	1	0.099	0.029	0.070
102000	2	0.198	0.209	-0.011
102800	1	0.099	0.024	0.075
103000	1	0.099	0.182	-0.083
103640	1	0.099	0.001	0.098
103800	2	0.198	0.023	0.175
104000	2	0.198	0.177	0.021
105000	7	0.694	0.340	0.354
106000	2	0.198	0.163	0.035
106800	1	0.099	0.023	0.076
107000	2	0.198	0.164	0.034
107200	1	0.099	0.024	0.075
107500	1	0.099	0.029	0.070
10800	1	0.099	0.062	0.037
108000	1	0.099	0.194	-0.095
108600	1	0.099	0.021	0.078
108700	2	0.198	0.017	0.181
109600	1	0.099	0.027	0.072
109700	1	0.099	0.012	0.087
10980	1	0.099	0.001	0.099
110000	2	0.198	0.460	-0.262
110100	1	0.099	0.020	0.079
110300	1	0.099	0.018	0.081
110400	1	0.099	0.025	0.075
11100	1	0.099	0.027	0.072
111000	2	0.198	0.144	0.055
111100	2	0.198	0.016	0.183
112000	3	0.297	0.194	0.103
11210	1	0.099	0.001	0.098
11300	2	0.198	0.027	0.171
113000	2	0.198	0.166	0.033
113700	1	0.099	0.017	0.082
113830	1	0.099	0.001	0.099
114000	2	0.198	0.158	0.041
114100	1	0.099	0.016	0.083
114550	1	0.099	0.000	0.099
114700	2	0.198	0.015	0.183
115200	1	0.099	0.021	0.078
115600	1	0.099	0.023	0.077
116000	1	0.099	0.150	-0.051
116510	1	0.099	0.000	0.099
117000	2	0.198	0.153	0.045
117100	1	0.099	0.016	0.083
117400	1	0.099	0.018	0.081
11800	1	0.099	0.030	0.069
118000	2	0.198	0.166	0.032
119000	1	0.099	0.124	-0.025
119600	1	0.099	0.020	0.079
119800	1	0.099	0.017	0.082
12000	1	0.099	0.301	-0.201
120000	6	0.595	0.516	0.078
120200	1	0.099	0.023	0.076
120300	1	0.099	0.017	0.082
120600	2	0.198	0.020	0.178
120700	1	0.099	0.012	0.087
12100	1	0.099	0.026	0.073
121000	1	0.099	0.133	-0.034
12200	3	0.297	0.031	0.266
122000	2	0.198	0.164	0.034
122800	1	0.099	0.018	0.081
122900	1	0.099	0.014	0.085
12300	2	0.198	0.030	0.168
123000	2	0.198	0.142	0.056
123400	1	0.099	0.016	0.083
123710	1	0.099	0.000	0.099
124000	1	0.099	0.141	-0.042
124560	1	0.099	0.000	0.099
12500	1	0.099	0.051	0.048
125000	2	0.198	0.331	-0.133
125020	1	0.099	0.002	0.097
125200	1	0.099	0.020	0.079
125500	2	0.198	0.023	0.176
125600	1	0.099	0.020	0.079
126000	5	0.496	0.143	0.352
126300	1	0.099	0.016	0.083
126500	1	0.099	0.022	0.077
127000	2	0.198	0.137	0.062
127230	1	0.099	0.000	0.099
128000	1	0.099	0.155	-0.055
12900	3	0.297	0.026	0.272
129000	1	0.099	0.113	-0.014
129500	1	0.099	0.020	0.079
129900	1	0.099	0.013	0.087
1300	1	0.099	0.036	0.063
13000	1	0.099	0.122	-0.022
130330	1	0.099	0.000	0.099
130530	1	0.099	0.000	0.099
13100	1	0.099	0.026	0.073
131000	1	0.099	0.120	-0.021
131500	1	0.099	0.021	0.078
131700	1	0.099	0.013	0.086
13200	1	0.099	0.061	0.038
132000	3	0.297	0.154	0.144
132500	1	0.099	0.019	0.080
13300	1	0.099	0.027	0.072
133000	1	0.099	0.131	-0.032
133630	1	0.099	0.000	0.099
134000	1	0.099	0.120	-0.021
134500	1	0.099	0.021	0.078
135000	3	0.297	0.264	0.034
135200	1	0.099	0.015	0.084
136000	1	0.099	0.127	-0.027
1360000	1	0.099	0.000	0.099
136100	1	0.099	0.012	0.087
13650	1	0.099	0.001	0.098
137000	2	0.198	0.128	0.070
137210	1	0.099	0.000	0.099
138000	1	0.099	0.131	-0.032
138190	1	0.099	0.000	0.099
139000	2	0.198	0.097	0.101
139100	1	0.099	0.011	0.088
139600	1	0.099	0.015	0.084
1400	1	0.099	0.042	0.057
14000	1	0.099	0.136	-0.037
140000	9	0.892	0.355	0.537
140400	1	0.099	0.014	0.085
14100	1	0.099	0.025	0.074
141000	2	0.198	0.105	0.093
141200	1	0.099	0.017	0.082
141400	1	0.099	0.015	0.084
14200	1	0.099	0.030	0.069
142000	1	0.099	0.134	-0.034
142300	1	0.099	0.012	0.087
142400	1	0.099	0.013	0.086
14300	1	0.099	0.027	0.072
143000	3	0.297	0.115	0.183
143510	1	0.099	0.000	0.099
144600	1	0.099	0.013	0.086
14500	1	0.099	0.039	0.060
145500	1	0.099	0.018	0.081
14600	1	0.099	0.030	0.069
146000	2	0.198	0.111	0.088
146200	1	0.099	0.014	0.085
147000	2	0.198	0.110	0.088
147200	1	0.099	0.013	0.086
148000	2	0.198	0.118	0.081
149000	2	0.198	0.094	0.104
1500	2	0.198	0.056	0.142
15000	5	0.496	0.236	0.259
150000	7	0.694	0.416	0.278
15030	1	0.099	0.001	0.098
151400	2	0.198	0.012	0.187
1520	1	0.099	0.001	0.098
152000	2	0.198	0.125	0.073
152400	1	0.099	0.013	0.086
153000	1	0.099	0.109	-0.010
154400	1	0.099	0.010	0.089
154500	1	0.099	0.016	0.083
154800	1	0.099	0.013	0.087
155000	4	0.396	0.217	0.179
155800	1	0.099	0.013	0.087
156000	1	0.099	0.101	-0.002
156600	1	0.099	0.013	0.086
15700	1	0.099	0.029	0.071
157900	1	0.099	0.007	0.092
158000	2	0.198	0.104	0.094
15900	1	0.099	0.027	0.072
159000	1	0.099	0.077	0.022
160000	1	0.099	0.298	-0.199
160040	1	0.099	0.001	0.098
160400	1	0.099	0.011	0.088
161000	1	0.099	0.083	0.016
161900	1	0.099	0.005	0.094
162000	1	0.099	0.114	-0.014
163200	1	0.099	0.012	0.087
163210	1	0.099	0.000	0.099
16400	1	0.099	0.033	0.066
164000	1	0.099	0.087	0.012
164100	1	0.099	0.009	0.090
164200	1	0.099	0.011	0.089
165000	2	0.198	0.196	0.002
165200	1	0.099	0.012	0.087
165500	1	0.099	0.012	0.087
166000	1	0.099	0.086	0.013
16700	1	0.099	0.031	0.068
16800	1	0.099	0.074	0.025
168000	2	0.198	0.100	0.098
168300	1	0.099	0.009	0.090
168600	2	0.198	0.010	0.188
168900	1	0.099	0.007	0.092
17000	1	0.099	0.118	-0.019
170000	1	0.099	0.249	-0.150
170200	2	0.198	0.011	0.187
170600	1	0.099	0.011	0.088
17100	1	0.099	0.031	0.068
171000	2	0.198	0.069	0.129
171960	1	0.099	0.000	0.099
172000	2	0.198	0.091	0.107
172030	1	0.099	0.001	0.098
172800	1	0.099	0.008	0.091
173250	1	0.099	0.000	0.099
173400	1	0.099	0.008	0.091
173500	1	0.099	0.012	0.087
173800	1	0.099	0.008	0.091
175000	2	0.198	0.193	0.006
17550	1	0.099	0.001	0.098
175820	1	0.099	0.000	0.099
177790	1	0.099	0.000	0.099
178000	1	0.099	0.078	0.021
178100	1	0.099	0.006	0.094
179000	1	0.099	0.065	0.034
18000	4	0.396	0.215	0.181
180000	1	0.099	0.253	-0.154
181000	1	0.099	0.061	0.038
181200	1	0.099	0.010	0.089
182500	1	0.099	0.012	0.087
183800	1	0.099	0.006	0.093
184000	1	0.099	0.062	0.037
185000	1	0.099	0.150	-0.051
185500	1	0.099	0.012	0.087
18600	1	0.099	0.040	0.059
186000	1	0.099	0.071	0.028
186700	1	0.099	0.007	0.092
187000	2	0.198	0.066	0.132
187100	1	0.099	0.006	0.094
189000	1	0.099	0.057	0.042
189500	1	0.099	0.009	0.090
189700	1	0.099	0.005	0.094
19000	2	0.198	0.108	0.090
190000	1	0.099	0.189	-0.090
191500	2	0.198	0.011	0.188
191800	1	0.099	0.011	0.088
193000	1	0.099	0.058	0.041
19400	1	0.099	0.034	0.065
194000	1	0.099	0.056	0.043
19500	1	0.099	0.038	0.061
19600	1	0.099	0.032	0.067
197600	1	0.099	0.009	0.091
20000	3	0.297	0.390	-0.093
200000	4	0.396	0.303	0.093
20250	1	0.099	0.001	0.098
202500	2	0.198	0.011	0.187
203100	1	0.099	0.005	0.094
20340	1	0.099	0.001	0.099
204200	1	0.099	0.006	0.093
204400	1	0.099	0.006	0.094
205000	2	0.198	0.122	0.076
205100	1	0.099	0.006	0.093
20600	1	0.099	0.034	0.065
206800	1	0.099	0.006	0.093
20700	1	0.099	0.029	0.071
208000	3	0.297	0.053	0.244
21000	2	0.198	0.130	0.069
210000	3	0.297	0.147	0.150
211600	1	0.099	0.004	0.095
211700	1	0.099	0.004	0.095
212100	1	0.099	0.004	0.095
213000	1	0.099	0.050	0.049
21400	1	0.099	0.037	0.062
215000	3	0.297	0.103	0.194
21600	2	0.198	0.072	0.126
217000	2	0.198	0.046	0.153
21730	1	0.099	0.001	0.099
21900	2	0.198	0.028	0.170
219300	1	0.099	0.004	0.095
22000	1	0.099	0.178	-0.078
220000	2	0.198	0.153	0.045
220600	1	0.099	0.004	0.095
221000	1	0.099	0.040	0.059
222000	1	0.099	0.049	0.051
223900	1	0.099	0.003	0.097
224000	1	0.099	0.038	0.061
225000	1	0.099	0.108	-0.009
226000	1	0.099	0.040	0.059
226300	1	0.099	0.004	0.096
226900	1	0.099	0.002	0.097
23000	2	0.198	0.139	0.059
230000	2	0.198	0.129	0.069
231000	2	0.198	0.033	0.165
231300	1	0.099	0.003	0.096
233200	1	0.099	0.004	0.095
234400	1	0.099	0.004	0.096
235000	1	0.099	0.084	0.015
235400	1	0.099	0.004	0.095
235700	1	0.099	0.004	0.095
23600	1	0.099	0.034	0.065
238300	1	0.099	0.003	0.096
239500	1	0.099	0.006	0.093
239900	1	0.099	0.003	0.096
24000	3	0.297	0.309	-0.012
240500	1	0.099	0.007	0.093
24060	1	0.099	0.001	0.099
24100	1	0.099	0.027	0.072
241000	2	0.198	0.029	0.169
24200	1	0.099	0.035	0.064
242000	1	0.099	0.034	0.065
242100	1	0.099	0.002	0.097
242550	1	0.099	0.000	0.099
24400	1	0.099	0.037	0.062
245000	1	0.099	0.066	0.033
246500	1	0.099	0.005	0.095
248000	1	0.099	0.032	0.067
24950	1	0.099	0.001	0.099
2500	1	0.099	0.034	0.065
25000	4	0.396	0.339	0.058
250000	3	0.297	0.159	0.138
25200	2	0.198	0.052	0.147
255000	1	0.099	0.062	0.037
257000	2	0.198	0.025	0.173
25800	2	0.198	0.039	0.160
2600	1	0.099	0.018	0.081
26000	4	0.396	0.170	0.226
260000	1	0.099	0.090	0.009
26100	1	0.099	0.029	0.070
262000	1	0.099	0.030	0.069
263000	1	0.099	0.023	0.076
263400	1	0.099	0.003	0.096
26380	1	0.099	0.001	0.099
26500	1	0.099	0.043	0.056
265200	1	0.099	0.002	0.097
266000	1	0.099	0.022	0.077
26700	1	0.099	0.033	0.066
268000	1	0.099	0.022	0.077
269000	1	0.099	0.016	0.083
27000	2	0.198	0.154	0.044
270000	1	0.099	0.079	0.020
270800	1	0.099	0.003	0.096
271100	1	0.099	0.002	0.097
27200	1	0.099	0.035	0.064
27300	1	0.099	0.032	0.067
273000	1	0.099	0.018	0.081
27400	1	0.099	0.033	0.066
274000	1	0.099	0.018	0.081
275000	1	0.099	0.062	0.037
277300	1	0.099	0.002	0.097
28000	1	0.099	0.205	-0.106
282000	1	0.099	0.020	0.079
2840	1	0.099	0.000	0.099
28400	1	0.099	0.041	0.058
284100	1	0.099	0.002	0.097
28500	1	0.099	0.038	0.061
287900	1	0.099	0.002	0.097
29000	2	0.198	0.131	0.067
293400	1	0.099	0.001	0.098
29380	1	0.099	0.001	0.098
295000	1	0.099	0.036	0.063
295400	1	0.099	0.001	0.098
295450	1	0.099	0.000	0.099
29640	1	0.099	0.001	0.098
297400	1	0.099	0.002	0.097
3000	1	0.099	0.080	0.019
30000	9	0.892	0.576	0.316
30400	1	0.099	0.036	0.063
306200	1	0.099	0.002	0.097
307600	1	0.099	0.002	0.097
310000	2	0.198	0.048	0.150
31100	1	0.099	0.030	0.069
31300	2	0.198	0.035	0.164
31400	2	0.198	0.038	0.160
31600	1	0.099	0.040	0.059
32000	3	0.297	0.250	0.047
320000	1	0.099	0.055	0.044
32100	1	0.099	0.035	0.064
321350	1	0.099	0.000	0.099
32200	1	0.099	0.041	0.058
32500	3	0.297	0.046	0.251
325100	1	0.099	0.002	0.097
32800	1	0.099	0.037	0.062
32900	1	0.099	0.032	0.067
3300	1	0.099	0.020	0.079
33000	2	0.198	0.165	0.033
33100	1	0.099	0.031	0.068
332100	1	0.099	0.001	0.098
33500	1	0.099	0.040	0.059
336000	1	0.099	0.009	0.090
33700	1	0.099	0.031	0.068
338300	1	0.099	0.001	0.098
34000	1	0.099	0.172	-0.073
340000	1	0.099	0.037	0.062
340200	1	0.099	0.002	0.098
34100	2	0.198	0.029	0.170
34200	1	0.099	0.035	0.064
342000	2	0.198	0.013	0.186
34300	1	0.099	0.033	0.066
346800	1	0.099	0.001	0.098
34700	1	0.099	0.029	0.070
347500	1	0.099	0.001	0.098
348000	1	0.099	0.007	0.092
34900	2	0.198	0.028	0.170
34930	1	0.099	0.001	0.099
3500	1	0.099	0.029	0.071
35000	2	0.198	0.383	-0.185
35100	1	0.099	0.033	0.066
351900	1	0.099	0.001	0.099
36000	3	0.297	0.270	0.027
36300	1	0.099	0.030	0.069
36600	2	0.198	0.039	0.159
36700	1	0.099	0.035	0.065
36800	1	0.099	0.039	0.060
37000	3	0.297	0.167	0.130
373100	1	0.099	0.001	0.098
37500	2	0.198	0.051	0.148
375000	1	0.099	0.021	0.078
37600	1	0.099	0.040	0.059
37800	2	0.198	0.040	0.158
3800	1	0.099	0.017	0.083
38000	1	0.099	0.223	-0.124
383000	1	0.099	0.007	0.092
38400	2	0.198	0.056	0.142
38500	1	0.099	0.048	0.051
38700	1	0.099	0.034	0.065
39000	2	0.198	0.159	0.039
39300	1	0.099	0.032	0.067
39400	2	0.198	0.044	0.154
39600	2	0.198	0.054	0.144
39700	1	0.099	0.033	0.066
4000	1	0.099	0.066	0.033
40000	6	0.595	0.616	-0.021
403000	1	0.099	0.007	0.093
40400	2	0.198	0.040	0.158
404000	1	0.099	0.005	0.094
40440	1	0.099	0.001	0.099
404500	1	0.099	0.001	0.098
40500	2	0.198	0.045	0.153
405000	1	0.099	0.012	0.087
40700	1	0.099	0.034	0.065
41000	4	0.396	0.148	0.249
41500	1	0.099	0.048	0.051
418200	1	0.099	0.000	0.099
42000	5	0.496	0.262	0.234
42120	1	0.099	0.000	0.099
42300	2	0.198	0.033	0.165
42600	1	0.099	0.044	0.055
42700	1	0.099	0.031	0.068
43000	1	0.099	0.177	-0.078
430000	1	0.099	0.015	0.084
43200	1	0.099	0.053	0.046
433700	1	0.099	0.000	0.099
43590	1	0.099	0.000	0.099
44000	3	0.297	0.169	0.128
44250	1	0.099	0.001	0.098
44400	1	0.099	0.051	0.048
44600	1	0.099	0.040	0.059
44800	2	0.198	0.041	0.157
45000	4	0.396	0.426	-0.029
45200	1	0.099	0.040	0.059
453000	1	0.099	0.004	0.095
45400	1	0.099	0.037	0.062
45800	1	0.099	0.042	0.057
46000	1	0.099	0.184	-0.085
461000	1	0.099	0.005	0.094
46500	1	0.099	0.046	0.053
46700	2	0.198	0.037	0.161
47000	1	0.099	0.177	-0.077
470000	1	0.099	0.010	0.089
473000	1	0.099	0.003	0.096
47900	1	0.099	0.028	0.072
48000	5	0.496	0.290	0.205
48500	1	0.099	0.044	0.055
48600	1	0.099	0.041	0.058
48700	1	0.099	0.032	0.068
49000	2	0.198	0.157	0.041
49100	2	0.198	0.030	0.168
49200	2	0.198	0.045	0.153
492000	2	0.198	0.005	0.193
493400	1	0.099	0.001	0.098
49400	1	0.099	0.037	0.062
49600	3	0.297	0.042	0.256
496000	1	0.099	0.007	0.092
497400	1	0.099	0.000	0.099
50000	5	0.496	0.695	-0.199
50040	1	0.099	0.001	0.098
50200	1	0.099	0.037	0.062
50300	1	0.099	0.031	0.068
503000	1	0.099	0.006	0.093
506000	1	0.099	0.004	0.095
50900	2	0.198	0.028	0.171
51000	4	0.396	0.179	0.217
51400	1	0.099	0.041	0.058
514000	1	0.099	0.004	0.095
51900	1	0.099	0.026	0.073
52000	4	0.396	0.291	0.106
52050	1	0.099	0.002	0.098
52100	1	0.099	0.027	0.072
52200	1	0.099	0.037	0.062
52500	2	0.198	0.046	0.153
527000	1	0.099	0.003	0.096
52800	2	0.198	0.047	0.151
53000	2	0.198	0.198	0.001
53400	1	0.099	0.036	0.063
53500	1	0.099	0.043	0.057
53600	1	0.099	0.039	0.060
537000	1	0.099	0.022	0.078
5400	1	0.099	0.020	0.080
54000	2	0.198	0.230	-0.032
54100	2	0.198	0.028	0.170
54200	2	0.198	0.037	0.161
54300	1	0.099	0.030	0.069
54500	1	0.099	0.043	0.056
545000	1	0.099	0.011	0.089
54600	1	0.099	0.036	0.063
55000	5	0.496	0.396	0.100
55070	1	0.099	0.001	0.098
55100	1	0.099	0.033	0.066
55300	2	0.198	0.032	0.166
55700	1	0.099	0.030	0.069
56000	1	0.099	0.226	-0.127
56500	1	0.099	0.039	0.060
56700	1	0.099	0.032	0.067
57000	1	0.099	0.201	-0.102
573000	1	0.099	0.004	0.096
57500	1	0.099	0.041	0.058
58000	1	0.099	0.240	-0.141
58100	1	0.099	0.030	0.069
58200	1	0.099	0.038	0.061
58500	1	0.099	0.038	0.061
58800	1	0.099	0.043	0.056
59000	3	0.297	0.172	0.125
59500	1	0.099	0.042	0.058
59700	2	0.198	0.028	0.171
59800	2	0.198	0.036	0.162
59900	1	0.099	0.025	0.074
6000	1	0.099	0.111	-0.012
60000	10	0.991	0.708	0.283
60600	2	0.198	0.039	0.160
60700	1	0.099	0.027	0.072
60780	1	0.099	0.000	0.099
60850	1	0.099	0.001	0.098
61000	2	0.198	0.174	0.024
61300	1	0.099	0.030	0.069
61770	1	0.099	0.000	0.099
61800	1	0.099	0.036	0.063
6200	1	0.099	0.018	0.081
62000	3	0.297	0.250	0.047
62200	1	0.099	0.035	0.064
62400	1	0.099	0.042	0.057
62500	1	0.099	0.043	0.057
62900	1	0.099	0.027	0.073
62940	1	0.099	0.001	0.098
63000	2	0.198	0.200	-0.002
63100	1	0.099	0.026	0.073
63200	2	0.198	0.036	0.162
63400	1	0.099	0.038	0.061
63500	1	0.099	0.038	0.061
63790	1	0.099	0.000	0.099
64000	2	0.198	0.206	-0.007
64150	1	0.099	0.001	0.098
64300	1	0.099	0.026	0.073
64600	1	0.099	0.034	0.065
6500	1	0.099	0.026	0.073
65000	6	0.595	0.448	0.147
65100	1	0.099	0.029	0.070
65150	1	0.099	0.002	0.098
65250	1	0.099	0.001	0.098
657000	1	0.099	0.003	0.096
65800	1	0.099	0.034	0.066
66000	2	0.198	0.207	-0.009
66200	1	0.099	0.039	0.061
66300	1	0.099	0.028	0.071
66500	2	0.198	0.043	0.155
66700	1	0.099	0.029	0.070
67000	1	0.099	0.189	-0.090
67050	1	0.099	0.001	0.098
67170	1	0.099	0.001	0.098
67300	1	0.099	0.029	0.070
674000	2	0.198	0.003	0.196
67500	1	0.099	0.042	0.057
67620	1	0.099	0.000	0.099
67800	1	0.099	0.035	0.064
68000	2	0.198	0.248	-0.050
68300	2	0.198	0.029	0.169
6900	2	0.198	0.014	0.184
69000	1	0.099	0.175	-0.076
69100	2	0.198	0.025	0.173
698000	1	0.099	0.002	0.097
7000	2	0.198	0.067	0.131
70000	7	0.694	0.586	0.108
71000	1	0.099	0.163	-0.064
71140	1	0.099	0.000	0.099
71400	1	0.099	0.033	0.066
718900	1	0.099	0.000	0.099
72000	4	0.396	0.285	0.111
72700	1	0.099	0.029	0.071
73000	1	0.099	0.190	-0.091
73100	1	0.099	0.027	0.072
74000	4	0.396	0.203	0.193
74300	1	0.099	0.024	0.075
74400	1	0.099	0.035	0.064
74500	1	0.099	0.037	0.062
74900	1	0.099	0.022	0.077
75000	8	0.793	0.479	0.314
75020	1	0.099	0.002	0.097
75100	2	0.198	0.029	0.169
75200	1	0.099	0.036	0.063
75500	1	0.099	0.036	0.063
75600	1	0.099	0.039	0.060
75800	1	0.099	0.032	0.067
76000	2	0.198	0.204	-0.006
76100	1	0.099	0.025	0.075
761300	1	0.099	0.000	0.099
76500	1	0.099	0.037	0.062
76900	1	0.099	0.022	0.077
7700	1	0.099	0.016	0.083
77000	1	0.099	0.200	-0.100
77200	1	0.099	0.032	0.067
77900	1	0.099	0.022	0.077
7800	1	0.099	0.021	0.078
78000	2	0.198	0.249	-0.051
78500	1	0.099	0.036	0.063
78760	1	0.099	0.000	0.099
79000	5	0.496	0.164	0.331
79200	1	0.099	0.037	0.062
79400	1	0.099	0.032	0.067
79500	1	0.099	0.037	0.062
79700	1	0.099	0.024	0.075
79900	1	0.099	0.022	0.077
80000	1	0.099	0.624	-0.525
80010	1	0.099	0.003	0.096
80200	1	0.099	0.035	0.065
80400	1	0.099	0.033	0.066
804200	1	0.099	0.000	0.099
80600	1	0.099	0.032	0.067
80900	1	0.099	0.026	0.073
81000	3	0.297	0.175	0.123
81400	1	0.099	0.031	0.068
81900	1	0.099	0.022	0.077
82000	2	0.198	0.226	-0.028
82200	1	0.099	0.031	0.068
82300	1	0.099	0.027	0.072
823100	1	0.099	0.000	0.099
82400	1	0.099	0.032	0.067
83000	3	0.297	0.197	0.101
83800	1	0.099	0.029	0.070
84600	2	0.198	0.031	0.167
84700	1	0.099	0.021	0.078
8500	1	0.099	0.030	0.069
85000	1	0.099	0.402	-0.303
85500	1	0.099	0.034	0.065
86000	4	0.396	0.198	0.198
861000	1	0.099	0.002	0.097
86400	1	0.099	0.033	0.066
87000	3	0.297	0.195	0.102
87100	1	0.099	0.023	0.076
87700	2	0.198	0.022	0.177
88000	4	0.396	0.210	0.186
88200	1	0.099	0.029	0.070
88320	1	0.099	0.001	0.099
89000	2	0.198	0.161	0.037
89100	1	0.099	0.021	0.078
89400	1	0.099	0.028	0.071
89700	1	0.099	0.022	0.077
90000	5	0.496	0.552	-0.056
90080	1	0.099	0.001	0.098
90200	2	0.198	0.029	0.169
90600	1	0.099	0.029	0.070
90700	1	0.099	0.022	0.078
90780	1	0.099	0.001	0.099
91000	2	0.198	0.154	0.044
91500	2	0.198	0.031	0.167
91600	2	0.198	0.030	0.168
92000	3	0.297	0.224	0.074
92600	1	0.099	0.027	0.072
93000	1	0.099	0.184	-0.085
94100	2	0.198	0.020	0.178
95000	5	0.496	0.359	0.137
95500	1	0.099	0.030	0.069
95800	1	0.099	0.024	0.075
96000	2	0.198	0.196	0.002
96500	1	0.099	0.032	0.067
97000	1	0.099	0.170	-0.071
97400	1	0.099	0.026	0.073
97500	1	0.099	0.033	0.066
97800	1	0.099	0.027	0.072
98000	1	0.099	0.204	-0.105
98400	1	0.099	0.028	0.071
99000	1	0.099	0.156	-0.057
99500	1	0.099	0.031	0.068
99800	1	0.099	0.026	0.073
99950	1	0.099	0.001	0.098
<NA>	0	0.000	0.000	0.000
Total	1,009	100.000	100.000	0.000

NOTE: Comparison of simulated versus ‘population’ data. Mean income = 114,228 (SD = 117,339) and 116,894 (SD = 123,866), respectively. Median income = 82,000 (IQR = 99,000) and 85,000 (IQR = 100,600), respectively. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. pp = Percentage points. NA = Missing data. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.11: Distribution of Household Incomes

NOTE: Comparison of simulated versus ‘population’ data. Mean income = 114,228 (SD = 117,339) and 116,894 (SD = 123,866), respectively. Median income = 82,000 (IQR = 99,000) and 85,000 (IQR = 100,600), respectively. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.6 Number of Children

This section gives an overview of the distribution of children in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.12 and graphically in figure 10.12. The mode of this discrete distribution of the sample is ‘0’ (64.62%). With a sample size of 1,009, the simulated sample manages to capture 99.99% of the information about children in the population.

Table 10.12: Distribution of Children
Number of Children	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
0	652	64.62	65.39	-0.768
1	177	17.54	17.07	0.475
2	121	11.99	11.52	0.472
3	44	4.36	4.23	0.133
4 Or More	15	1.49	1.80	-0.311
<NA>	0	0.00	0.00	0.000
Total	1,009	100.00	100.00	0.000

Figure 10.12: Distribution of Children

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.7 Area of Residence

This section gives an overview of the distribution of area residencies in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.13 and graphically in figure 10.13. The mode of this discrete distribution of the sample is ‘Small Town Or Surburb’ (78.20%). With a sample size of 1,009, the simulated sample manages to capture 100.00% of the information about area residencies in the population.

Table 10.13: Distribution of Area Residencies
Area Residence	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
City	105	10.4	10.9	-0.465
Rural Area	115	11.4	10.9	0.453
Small Town Or Surburb	789	78.2	78.2	0.012
<NA>	0	0.0	0.0	0.000
Total	1,009	100.0	100.0	0.000

Figure 10.13: Distribution of Area Residencies

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.8 Sex

This section gives an overview of the distribution of sex in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.14 and graphically in figure 10.14. The mode of this discrete distribution of the sample is ‘Female’ (52.82%). With a sample size of 1,009, the simulated sample manages to capture 99.99% of the information about sex in the population.

Table 10.14: Distribution of Sexes
Sex	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Female	533	52.8	52.2	0.585
Male	476	47.2	47.8	-0.585
<NA>	0	0.0	0.0	0.000
Total	1,009	100.0	100.0	0.000

Figure 10.14: Distribution of Sexes

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.9 Marital Status

This section gives an overview of the distribution of marital status in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.15 and graphically in figure 10.15. The mode of this discrete distribution of the sample is ‘Married’ (56.69%). With a sample size of 1,009, the simulated sample manages to capture 99.92% of the information about marital status in the population.

Table 10.15: Distribution of Marital Status
Marital Status	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Divorced	118	11.695	11.35	0.341
Married	572	56.690	56.67	0.017
Not Married	223	22.101	23.80	-1.702
Separated	10	0.991	1.42	-0.433
Widowed	86	8.523	6.75	1.776
<NA>	0	0.000	0.00	0.000
Total	1,009	100.000	100.00	0.000

Figure 10.15: Distribution of Marital Status

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.10 Employment Status

This section gives an overview of the distribution of employment status in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.16 and graphically in figure 10.16. The mode of this discrete distribution of the sample is ‘Employed’ (60.36%). With a sample size of 1,009, the simulated sample manages to capture 99.96% of the information about employment status in the population.

Table 10.16: Distribution of Employment Status
Employment Status	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Employed	609	60.36	59.05	1.303
Not In Labor Force	376	37.27	38.67	-1.408
Unemployed	24	2.38	2.27	0.105
<NA>	0	0.00	0.00	0.000
Total	1,009	100.00	100.00	0.000

Figure 10.16: Distribution of Employment Status

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data is based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.11 Political Party

This section gives an overview of the distribution of political parties in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.17 and graphically in figure 10.17. The mode of this discrete distribution of the sample is ‘Democratic Party’ (53.72%). With a sample size of 1,009, the simulated sample manages to capture 99.95% of the information about political parties in the population.

Table 10.17: Distribution of Preferred Political Parties
Political Party	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Democratic Party	542	53.7	55.3	-1.58
Republican Party	467	46.3	44.7	1.58
<NA>	0	0.0	0.0	0.00
Total	1,009	100.0	100.0	0.00

Figure 10.17: Distribution of Preferred Political Parties

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.4.12 Political Ideology

This section gives an overview of the distribution of political ideologies in the simulated data, as well as census-based estimates of the US population for comparison. This is provided metrically in table 10.18 and graphically in figure 10.18. The mode of this discrete distribution of the sample is ‘Conservative’ (43.01%). With a sample size of 1,009, the simulated sample manages to capture 99.83% of the information about political ideologies in the population.

Table 10.18: Distribution of Political Ideologies
Political Ideology	Sample Frequency	Sample Percentage (%)	Population Percentage (%)	Difference (pp)
Extremely Liberal	6	0.595	0.503	0.092
Liberal	192	19.029	19.982	-0.953
Slightly Liberal	70	6.938	6.233	0.705
Moderate; Middle Of The Road	282	27.948	29.519	-1.571
Slightly Conservative	17	1.685	2.313	-0.628
Conservative	434	43.013	40.395	2.618
Extremely Conservative	8	0.793	1.056	-0.263
<NA>	0	0.000	0.000	0.000
Total	1,009	100.000	100.000	0.000

Figure 10.18: Distribution of Political Ideologies

$Distribution of Political Ideologies$

NOTE: Comparison of simulated versus ‘population’ data. The population percentage is an estimate based on census data. $n_{simulation}$ = 1,009; $n_{census}$ = 2,366,924. Both the simulated and ‘population’ data are based on data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.5 Synthetic Data

10.5.1 Welfare Support

This section gives an overview of the distribution of welfare support as measured among the synthetic respondents across the the ‘lazy’ and ‘unlucky’ welfare recipient manipulation conditions. The distributions of welfare support in the synthetic data across the two conditions is provided metrically in table 10.19 and graphically in figure 10.19, respectively. The modes of these discrete distributions are ‘6 Agree’ (35.58%) and ‘1 Strongly disagree’ (86.22%), respectively.

Table 10.19: Distribution of Support for Restricting Welfare Eligibility
Support for Restricting Welfare Eligibility	Lazy Frequency	Lazy Percentage (%)	Unlucky Frequency	Unlucky Percentage (%)	Difference (pp)
1 Strongly disagree	10	0.991	870	86.22	-85.23
2 Disagree	108	10.704	80	7.93	2.77
3 Somewhat disagree	335	33.201	0	0.00	33.20
4 Neither agree or disagree	40	3.964	59	5.85	-1.88
5 Somewhat agree	157	15.560	0	0.00	15.56
6 Agree	359	35.580	0	0.00	35.58
7 Strongly agree	0	0.000	0	0.00	0.00
<NA>	0	0.000	0	0.00	0.00
Total	1,009	100.000	1,009	100.00	0.00

NOTE: Comparison of support for restricting welfare eligibility across group conditions. $n$ = 1,009. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.19: Distribution of Support for Restricting Welfare Eligibility

10.5.2 Simultaneity Test

This section gives an overview of the diagnostics and estimates from the Bayesian Generalized Linear Model (BGLM) used to test the simultaneity assumption (see table 4.8). A traceplot of chain convergence of this model is provided graphically in 10.20. The model’s goodness-of-fit is provided as a posterior predictive plot in figure 10.21, with explained variance ($R^2$, cf. Gelman et al. 2019) not being reported due to being entirely invariant. Building on Bowley (1910) and Tukey (1910), a 13-number summary of the estimated mean time difference ($\hat{\delta}$) is provided in table 10.20 and visually in figure 10.22, while the estimated dispersion ($\hat{\varsigma}$) is provided in table 10.21 and figure 10.23. When possible and meaningful, the mode, median, mean, interquartile range (IQR), skewness, kurtosis, posterior predictive p-value ($PP_P$, Gelman et al. 1996), chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021), effective sample size (ESS, Bürkner 2017), and Monte-Carlo Standard Error (MCSE, Kruschke 2014) are listed in the table notes.

10.5.2.1 Model Diagnostics

Figure 10.20: Chain Convergence of the Simultaneity Test BGLM

NOTE: Traceplot. Total chains = 7, total samples per chain = 7,715, warmup samples per chain = 2,000. Total posterior samples = 40,005. The gray-shaded area indicates the warmup samples. Effective Sample Size (ESS, Bürkner 2017) = 21,108 and 21,876, respectively. Chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021) = 1.000 and 1.000, respectively. Monte Carlo Standard Error (MCSE, Kruschke 2014) = .000 and .000, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.21: Posterior Predictive Distribution of the Simultaneity Test BGLM

NOTE: Traceplot with 100 random draws shown. $PP_P$ (cf. Gelman et al. 1996) = 0.999. Posterior samples = 40,005. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.5.2.2 Model Parameters

Table 10.20: 13-Number Summary of the Measurement Time Difference Between Manipulations Conditions
Percentile	Posterior	Likelihood
0%	-0.093	-0.002
0.1%	-0.065	-0.001
1%	-0.041	0.000
5%	-0.019	0.001
10%	-0.007	0.001
25%	0.012	0.002
50%	0.033	0.033
75%	0.054	0.063
90%	0.073	0.066
95%	0.085	0.096
99%	0.107	0.128
99.9%	0.130	0.160
100%	0.180	0.222

NOTE: Posterior and (maximum) likelihood distributions of the estimated mean parameter ($\hat{\delta}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and OLS-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .033 and .033, respectively. IQR = .042 and .061, respectively. Skewness = .004 and .994, respectively. Kurtosis = 1.001 and .041, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.22: Measurement Time Difference Between Manipulations Conditions

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated intercept mean parameter ($\hat{\delta}$) in time difference standard deviations. Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. To visualize posterior and likelihood distributions, the prior distributions is not visible due to being substantially wider than both. The dashed black line indicates a time difference of zero. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.21: 13-Number Summary of the Explanatory Error of the Simultaneity Test BGLM
Percentile	Posterior	Likelihood
0%	0.926	0.010
0.1%	0.938	0.015
1%	0.954	0.019
5%	0.968	0.023
10%	0.975	0.026
25%	0.987	0.031
50%	1.001	1.000
75%	1.015	1.414
90%	1.029	1.414
95%	1.036	1.731
99%	1.051	1.997
99.9%	1.068	2.232
100%	1.109	2.638

NOTE: Posterior and (maximum) likelihood distributions of the estimated dispersion parameter ($\hat{\varsigma}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and OLS-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = 1.001 and .785, respectively. IQR = .028 and 1.382, respectively. Skewness = .086 and -.079, respectively. Kurtosis = -1.312 and .002, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.23: Explanatory Error of the Simultaneity Test BGLM

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated dispersion parameter ($\hat{\varsigma}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.5.3 Deservingness & Welfare Support

This section gives an overview of the diagnostics and estimates of the the Multilevel Bayesian Generalized Linear Model (ML-BGLM) for testing the relationship between deservingness and welfare support (see table 4.10). Building on Bowley (1910) and Tukey (1910), this includes 13-number summaries. A traceplot of chain convergence of this model is provided graphically in 10.24, while the model’s goodness-of-fit is provided as a posterior predictive distribution in figure 10.25 and as explained variance ($R^2$, cf. Gelman et al. 2019), metrically in table 10.22 and visually in figure 10.26. The estimated coefficient ($\hat{\beta_G}$) is provided metrically in table 10.23 and visually in figure 10.27, while the estimated intercept mean ($\hat{\mu_\alpha}$) and dispersion ($\hat{\sigma_\alpha}$) are provided in tables 10.24 and 10.25, and in figures 10.28 and 10.29, respectively. The model’s estimated dispersion parameter ($\hat{\sigma}$) is provided in table 10.26 and figure 10.30. The model’s estimated degrees of freedom (df) parameter ($\hat{\nu}$) is provided in table 10.27 and figure 10.31. When possible and meaningful, the mode, median, mean, interquartile range (IQR), skewness, kurtosis, posterior predictive p-value ($PP_P$, Gelman et al. 1996), chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021), effective sample size (ESS, Bürkner 2017), and Monte-Carlo Standard Error (MCSE, Kruschke 2014) are listed in the table notes.

10.5.3.1 Model Diagnostics

Figure 10.24: Chain Convergence of the Deservingness & Welfare Support ML-BGLM

NOTE: Traceplot. Total chains = 7, total samples per chain = 7,715, warmup samples per chain = 2,000. Total posterior samples = 40,005. The gray-shaded area indicates the warmup samples. Effective Sample Size (ESS, Bürkner 2017) = 80,562, 48,813, 11,326, 15,747, 59,833, respectively. Chain convergence ($\hat{R}$, Gelman and Rubin 1992; Vehtari et al. 2021) = 1.000, 1.000, 1.000, 1.000, 1.000, respectively. Monte Carlo Standard Error (MCSE, Kruschke 2014) = .000, .000, .000, .000, .707, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.25: Posterior Predictive Distribution of the Deservingness & Welfare Support ML-BGLM

NOTE: Traceplot with 100 random draws shown. $PP_P$ (cf. Gelman et al. 1996) = 0.942. Posterior samples = 40,005. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.22: 13-Number Summary of the Explanatory Power of the Deservingness & Welfare Support ML-BGLM
Percentile	Estimate
0%	0.674
0.1%	0.686
1%	0.695
5%	0.704
10%	0.708
25%	0.714
50%	0.722
75%	0.729
90%	0.735
95%	0.738
99%	0.744
99.9%	0.751
100%	0.760

NOTE: Posterior distribution of the explained variance ($R^2$, cf. Gelman et al. 2019). Posterior samples = 40,005. Mean = .721. IQR = .014. Skewness = -.193. Kurtosis = .061. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.26: Explanatory Power of the Deservingness & Welfare Support ML-BGLM

NOTE: Posterior distribution of the explained variance ($R^2$, cf. Gelman et al. 2019). Posterior samples = 40,005. The geometric below the distribution indicates the posterior mode (circle), the 95% HDCI (thick bar), 99% HDCI (thin bar), and 99.9% HDCI (line). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.5.3.2 Model Parameters

Table 10.23: 13-Number Summary of Difference in Welfare Support across Manipulation Conditions
Percentile	Posterior	Likelihood
0%	0.572	0.571
0.1%	0.579	0.580
1%	0.586	0.587
5%	0.592	0.592
10%	0.595	0.596
25%	0.601	0.601
50%	0.607	0.608
75%	0.613	0.613
90%	0.619	0.619
95%	0.622	0.622
99%	0.628	0.628
99.9%	0.635	0.635
100%	0.643	0.648

NOTE: Posterior and (maximum) likelihood distributions of the estimated normalized coefficient parameter ($\hat{\beta_G}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .607 and .607, respectively. IQR = .012 and .012, respectively. Skewness = .009 and -.008, respectively. Kurtosis = -.017 and .008, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.27: Difference in Welfare Support across Manipulation Conditions

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated normalized coefficient parameter ($\hat{\beta_G}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The posterior distribution is not easily visible due to being obscured by the likelihood distribution. The dashed black line indicates a difference in welfare support of zero. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.24: 13-Number Summary of Welfare Support when presented with a ‘Lazy’ Welfare Recipient
Percentile	Posterior	Likelihood
0%	0.310	0.308
0.1%	0.317	0.317
1%	0.324	0.323
5%	0.329	0.328
10%	0.331	0.331
25%	0.336	0.336
50%	0.341	0.342
75%	0.346	0.347
90%	0.351	0.352
95%	0.354	0.355
99%	0.359	0.361
99.9%	0.364	0.367
100%	0.375	0.378

NOTE: Posterior and (maximum) likelihood distributions of the estimated normalized intercept parameter ($\hat{\mu_\alpha}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .341 and .342, respectively. IQR = .010 and .011, respectively. Skewness = .004 and .024, respectively. Kurtosis = -.028 and -.010, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.28: Welfare Support when presented with a ‘Lazy’ Welfare Recipient

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated normalized intercept mean parameter ($\hat{\mu_\alpha}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.25: 13-Number Summary of Welfare Support Variability when presented with a ‘Lazy’ Welfare Recipient
Percentile	Posterior	Likelihood
0%	0.094	0.147
0.1%	0.102	0.152
1%	0.108	0.154
5%	0.113	0.157
10%	0.116	0.158
25%	0.121	0.160
50%	0.126	0.163
75%	0.131	0.165
90%	0.136	0.167
95%	0.138	0.169
99%	0.143	0.171
99.9%	0.148	0.174
100%	0.156	0.176

NOTE: Posterior and (maximum) likelihood distributions of the estimated normalized intercept dispersion parameter ($\hat{\sigma_\alpha}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .126 and .163, respectively. IQR = .010 and .005, respectively. Skewness = -.075 and -.075, respectively. Kurtosis = .023 and -.013, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.29: Welfare Support Variability when presented with a ‘Lazy’ Welfare Recipient

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated normalized intercept dispersion parameter ($\hat{\sigma_\alpha}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The dashed black line indicates an intercept dispersion of zero. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.26: 13-Number Summary of the Dispersion of the Deservingness & Welfare Support ML-BGLM
Percentile	Posterior	Likelihood
0%	0.187	0.163
0.1%	0.191	0.166
1%	0.194	0.168
5%	0.197	0.170
10%	0.199	0.170
25%	0.201	0.172
50%	0.204	0.173
75%	0.207	0.175
90%	0.210	0.176
95%	0.212	0.177
99%	0.216	0.178
99.9%	0.219	0.180
100%	0.224	0.183

NOTE: Posterior and (maximum) likelihood distributions of the estimated normalized dispersion parameter ($\hat{\sigma}$), estimated with NUTS MCMC (Hoffman and Gelman 2014) and REML-based non-parametric bootstrap (Efron and Tibshirani 1994), respectively. Posterior/likelihood samples = 40,005. Mean = .204 and .173, respectively. IQR = .006 and .003, respectively. Skewness = .125 and -.037, respectively. Kurtosis = .071 and .030, respectively. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.30: Dispersion of the Deservingness & Welfare Support ML-BGLM

NOTE: Posterior, (maximum) likelihood, and prior distributions of the estimated normalized dispersion parameter ($\hat{\sigma}$). Posterior/likelihood/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Table 10.27: 13-Number Summary of the Degrees of Freedom of the Deservingness & Welfare Support ML-BGLM
Percentile	Posterior
0%	134
0.1%	574
1%	1,170
5%	1,532
10%	1,650
25%	1,795
50%	1,899
75%	1,963
90%	1,997
95%	2,007
99%	2,016
99.9%	2,018
100%	2,018

NOTE: Posterior distribution of the estimated model degrees of freedom ($\hat{\nu}$), estimated with NUTS MCMC (Hoffman and Gelman 2014). Posterior samples = 40,005. Mean = 1849.486. IQR = 168.036. Skewness = -2.703. Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

Figure 10.31: Degrees of Freedom of the Deservingness & Welfare Support ML-BGLM

NOTE: Posterior and prior distributions of the estimated model degrees of freedom ($\hat{\nu}$). Posterior/prior samples = 40,005; prior is not necessarily correctly visualized due to finite samples. The geometric below the distributions indicate the posterior mode (circle) and the 95% HDCI (thick bar). Results derived from synthetic data generated by OpenAI’s (2024a) gpt-4o-mini using vignettes and questions created by Aarøe and Petersen (2014) and Petersen et al. (2012), as well as data by the IPUMS (Ruggles et al. 2024) and ANES (2021).

10.6 Citation

The Synthetic Causal Framework (SCF) and its related concepts of ‘parallel worlds’-estimation and the Synthetically Controlled Trial (SCT) are original concepts by Emil Niclas Meyer-Hansen, conceived as part of the author’s PhD project proposal for Aarhus University (submitted 2024-9-5). This work serves as the founding paper of these concepts. In their current states, Emil Niclas Meyer-Hansen is the sole author and contributor to their ongoing development, conceptualization, formalization, and implementation. For correspondence, contact the author via email: emil098meyerhansen@gmail.com.

Please, if you use, refer to, and/or continue the development of the SCF and/or SCT, provide proper reference and citation to their founding author. An example of proper citation is provided below:

Meyer-Hansen, E. N. (2025): The Synthetic Causal Framework: An LLM-based Solution to the Fundamental Problem of Causal Inference. Preview version (Large-N Release): 2025-03-18-18-53, on Open Science Framework. DOI: 10.17605/OSF.IO/5P3M2

For LaTeX users, a BibTeX entry is provided below:

@Manual{,
  title = {The Synthetic Causal Framework: An LLM-based Solution to the Fundamental Problem of Causal Inference},
  author = {Emil Niclas Meyer-Hansen},
  organization = {Open Science Framework},
  year = {2025},
  doi = {10.17605/OSF.IO/5P3M2},
  notes = {Preview version (Large-N Release)}
}

Independent researcher. Holds a Master of Science (MSc.) and Bachelor of Science (BSc.) in Political Science from Aarhus University, Denmark. Email: emil098meyerhansen@gmail.com ↩︎
Regarding the notion that no theory can ever be proven correct (Lakatos 1970; Miller 1974; Popper 2002[1959]; see also Dellsén 2024), consistent with Thomas Kuhn (1970; as cited by Farrell 2014: 43), while also emulating Richard McElreath (2019: xiii), it may be boldly argued that all theories are false, but some are useful.↩︎
Note that within the Frequentist NHST, a null hypothesis ($H_0$) can never be disproven nor proven (i.e., ‘accepted,’ Gigerenzer 2004; Wagenmakers et al. 2010: 159; Wilkinson and The Task Force on Statistical Inference 1999: 599).↩︎
As demonstrated by Google Scholar currently listing 505,000 results for the search query: ‘causal inference’.↩︎
The 10,278 projects created in 2024 were projects currently registered on the Open Science Framework (OSF) mentioning ‘causal inference’, with many more projects likely existing that are not registered on the OSF.↩︎
The criterion that causal relationships necessitates physical phenomena serves to exclude metaphysical phenomena, which are generally considered inherently beyond the scope of science (Farrell 2014: 8-10; Hume 1975[1777]: 165; Van Fraassen 2002). Note that the term ‘physical phenomena’ here includes psychological, sociological, and related socially constructed phenomena (Hacking 1999), at least as long as they in principle are reducible to physical phenomena.↩︎
The criterion that causal relationships involves physically distinct phenomena (e.g., $X \ne Y$) serves to a priori exclude considerations for the nonetheless interesting concept of self-causing phenomena (i.e., causa sui, Masi 2023).↩︎
The notion that causal relationships involve temporally distinct phenomena (e.g., $X \prec Y$) serves to a priori exclude concerns for retrocausality (Leifer and Pusey 2017) and mutual causation (Yang et al. 2018), which under some circumstances may have some validity to them (for discussion, see, e.g., Drummond and Reid 2021). Note that this does not exclude feedback causation (Baumeister et al. 2007). For example, suppose two temporally-variable vector phenomena (e.g., $X_t$ and $Y_t$, with all $X_t \ne Y_t$ and all $X_t \prec Y_t$), where $t$ indexes increases in time. At some initial point in time, suppose that $X_1$ causes $Y_1$, for example, through a stochastic process (e.g., $Y_1 = X_1 + \epsilon_1$). Then, a later time point, $Y_1$ causes $X_2$ in a similar process (e.g., $X_2 = Y_1 + X_1 + \epsilon_2$). Such feedback causation are not excluded from consideration here.↩︎
Note that this definition of a cause is intentionally simplified to mostly include sufficient, probabilistic, and contrary-to-fact causes (Seawright 2002). As such, it deliberately ignores valid considerations of more complex causes, such as INUS causes (Mackie 1980).↩︎
As an extended consideration to causal effects possibly being heterogenous conditional on $X$, it should be noted that heterogeneity can extend to the specific level of $C$ held constant. For example, if subsets of $C$ interact with the $D$ conditional on $X$. However, with a sample representative of the population of interest as achieved with random sampling (see below), this does not bias causal inference, and notation to account for this scenario is thus not considered here.↩︎
For example, suppose that researchers posit a causal relationship between height and age. They conduct a study where they first measure the height of subjects followed by then asking them their age. The researchers run a model and find that an increase in height is associated with an increase in age. Instead of considering the possibility of reverse causality, i.e., that increases in age causes increases in height, they argue that this is impossible, because they measured height prior to age. Such an argument would be nonsensical, precisely because both height and age exist prior to their measurement.↩︎
Note that a masking effect may alternatively be denoted a suppression effect (cf. Lenz and Sahn 2021).↩︎
This is an emphasized criticism specifically targeting flawed definitions of causality, such as one currently made by Merriam-Webster, who include a definition of causality as ‘the relation […] between regularly correlated events or phenomena’. Since correlated events or phenomena, in principle, can be just as regularly confounded, this is an example of a cum hoc ergo propter hoc logical fallacy (Manninen 2019c), and the definition is erroneous and promotes an unreliable conceptualization of causality.↩︎
As an extension of the statement correlation does not imply causation, Professor of Statistics at Columbia University, Andrew Gelman, has stated that ‘correlation does not even imply correlation’. For an interesting and alternative view specific to quantum mechanics, see, e.g., Fitzsimons et al. (2015).↩︎
As a response to criticisms of this statement by Kenneth A. Bollen and Judea Pearl (2013: 312-313; see also Bollen 1989: 41; Goldthorpe 2001: 15; Bhrolcháin and Dyson 2007: 3), it should be clarified that the statement ‘no causation without manipulation’ by Paul W. Holland and Donald Rubin (Holland 1986: 959) does not make the ontological claim (cf. Jacquette 1976) that causal effects cannot occur without manipulation, since causal effects undoubtedly occur all the time without human intervention, of which both Paul W. Holland and Donald Rubin are likely aware. Instead, it makes the epistemological claim (cf. Rosenberg 2016: 2) that we cannot reliably identify a causal effect without a manipulation. Furthermore, this manipulation does not necessarily require human intervention, for example, a natural experiment (Dunning 2008) could qualify, and if the assumptions for alternative strategies do hold, reliable causal inference can indeed be drawn in quasi-experiments (Angrist and Pischke 2009: 113). But for non-manipulated settings this merely begs the question (Rivera 2019) of how researchers can credibly assess whether assumptions outside of their control have been violated. As such, the statement ‘no causation without manipulation’ (Holland 1986: 959) serves to outline the premise for the scientific method on how to draw epistemologically reliable causal inference.↩︎
Multilevel models can be implemented in standard statistical software. For R (R Core Team 2024), Bayesian multilevel models can be implemented using, for example, brms (Bürkner 2017, 2018) and rstanarm (Goodrich et al. 2024), while Frequentist versions are provided in, e.g., the lme4 package (Bates et al. 2015). For Stata (StataCorp 2023), Bayesian and Frequentist multilevel models are provided as standard features.↩︎
In Aarøe and Petersen (2014), the wording of the ‘lazy’ welfare recipient vignette is: ‘Imagine a man who is currently on social welfare. He has always had a regular job, but has now been the victim of a work-related injury. He is very motivated to get back to work again’ (Aarøe and Petersen 2014: 689). Wording of the ‘unlucky’ welfare recipient is ‘Imagine a man who is currently on social welfare. He has never had a regular job, but he is fit and healthy. He is not motivated to get a job’ (Aarøe and Petersen 2014: 689). Wording of the ‘control’ welfare recipient is ‘Imagine a man who is currently on social welfare’ (Aarøe and Petersen 2014: 689).↩︎
The term ‘statistically discernible’ is explicitly used as a more appropriate term compared to the conventional term of ‘statistical significance’ (cf. McShane et al. 2019).↩︎
In relation to the results of the effect of deservingness of welfare support derived from Aarøe and Petersen (2014), Mummolo and Peterson (2019), and Kane et al. (2023), it should be noted that the reported statistical models from which the estimates were derived were not identically specified. For example, Kane et al. (2023) only report results from models including attentiveness-interaction terms, where the results derived from the most attentive respondents were used due to being arguably most reliable. Note also that some estimates (e.g., means, standard errors, and confidence intervals) are less reliable because they had to be derived using informed approximations (e.g., estimating the standard error by knowing that the mean and 95% confidence intervals) or tool-assisted ‘eyeballing’ (Gross et al. 2014) because the study (e.g., Mummolo and Peterson 2019) only reported the relevant results visually. Note that the meta-analysis of a previous version of this paper used results derived from the incorrect graphs, leading to underestimated estimates.↩︎
While the assumption of prior odds of 1 fits with the Frequentist notion of unbiasedness by being entirely noninformative, it arguably assigns too much relative weight to the hypothesis of no effect, since researchers often find a null hypothesis of exactly no effect to be highly unlikely a priori (e.g., Gelman et al. 2021: 59-60). This is similarly expressed by Meehl’s (1978) conjecture that in some settings the null hypothesis is always false (Gigerenzer 2004: 600-602). Assuming prior odds of 1 is thus arguably too conservative, because it results in an underestimation of the actual posterior odds, but this merely produces a bias against the researcher’s (alternative) hypothesis by reducing the statistical power of their study.↩︎
‘Ethnoracial group’ is a term used to denote the overlapping racial/ethnic/national terminology employed in the US (e.g., ‘White’, ‘African-American’, ‘Chinese’, etc.).↩︎
The level of shared information between the simulated sample and the population was estimated by a novel application of cosine similarity (Kozlowski et al. 2019: 910-911; Mikolov et al. 2013: 5). Analogous to computational text analysis (Theocharis and Jungherr 2021), this exploits the fact that cosine similarity is derivable from Token Frequency Documents (TDF, e.g., Rheault and Cochrane 2020; Rodriguez et al. 2023a: 6; Rodriguez et al. 2023b: 5), which can be reconceptualized as an Information Frequency Matrix (IFM) that reflects the frequency of values (i.e., information) occurring in the sample and population. With cosine similarity being scaled 0 - 1 in such instances, cosine similarity here reflects the shared information between the sample and population.↩︎
With respect to the statement that intra-unit comparisons eliminates all unit-invariant confounders, it should be added that this does not guarantee the elimination of numerous types of errors (e.g., systematic measurement error, Bandalos 2018) and biases (e.g., social desirability bias, Edwards 1953; Krumpal 2013) that may also make results invalid and unreliable.↩︎
The expectation that differences in character length or word length of the manipulations could violate the simultaneity assumption is in part based on simultaneity tests suggesting this, as ascertained in a pilot study used in a previous versions of this study.↩︎
The rephrasing of the welfare recipient vignettes used by Aarøe and Petersen (2014) and Petersen et al. (2012) so that they were of equal word length and character length relied on ChatGPT (OpenAI 2022). To ensure that phrasings were meaningful and did not change the substantive meaning of the original vignettes, the phrasings were compared and validated by the author.↩︎
Note that the prompts for generating counterfactual outcomes in a previous version of this paper incorrectly specified a 5-point Likert (1932) scale instead of the correct 7-point Likert scale.↩︎
A previous version of this paper followed a procedure Hartman and Hidalgo (2018), that, while arguably an improvement on the Null-Hypothesis Significance Testing (NHST) framework, unfortunately were difficult to implement in a Bayesian framework given current software limitations, with the author having to rely on transformations of data that made it improper for a linear model specification, resulting in convergence issues. Once Bayesian software packages has been updated to allow bimodal priors, the suggestions by Hartman and Hidalgo (2018) will be more reliably implementable.↩︎
While modeling the likelihood as a Student t distribution (Gosset 1908; Helmert 1876a; 1876b; 1875; Lüroth 1876; Pearson 1895b) would have been preferable compared to a Gaussian (Gauss 2012[1809]) for robustness (Gelman et al. 2021: 264-286; McElreath 2019: 233-234; O’Hagan 1979), inspections of model convergence using convergence metrics and traceplots found that this choice resulted in failed convergence ($\hat{R} \ge 1.1$, Gelman and Rubin 1992; Vehtari et al. 2021).↩︎
Inconsequential exceptions to this are the dispersion parameter ($\sigma$) and intercept dispersion parameter ($\sigma_\alpha$), where Bayesian estimates compared to likelihood-estimates were found to be different with 95% credibility, indicating that priors are ‘informative’ relative to the data. For the dispersion parameter ($\sigma$), the posterior was specifically credibly higher than the likelihood, though higher estimates are a desirable property for the dispersion parameter, because it makes the Bayesian estimates more conservative. For the intercept dispersion parameter ($\sigma_\alpha$), the posterior was credibly lower than the likelihood, and the Frequentist estimates are here more conservative than the Bayesian estimates.↩︎
The fundamental concept underlying the SCF, SCT, and’ Parallel Worlds’-estimation were conceptualized by Emil Meyer-Hansen no later than March 3rd, 2024, prior to knowing about similar concepts involving the use of LLMs for experimental research purposes, for example, Hewitt et al. (2024), which the author was first made aware of on January 6th, 2025.↩︎

\(\hat{\theta}\)	\(Parameter\)	\(Summary\)	\(p\)	\(ln(BF_{1,0})\)	\(\hat{R}\)	\(ESS\)	\(MCSE\)	\(Prior\)
\(\hat{\beta_G}\)	Deservingness	.607*** (.009) [.590; .625]	0	228	1	81,461	0	\(\mathcal{T}_3(.000, .077)\)
\(\hat{\mu_\alpha}\)	Intercept	.341*** (.008) [.326; .356]	0	152	1	49,026	0	\(\mathcal{T}_3(.500, .039)\)


\(D_{z, i} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{N}(\delta, \varsigma)\)	1.1
\(\delta \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: 1)\)	1.2
\(\varsigma \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(1, \: .077)\)	1.3


\(Y_i \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_\nu(\mu_i, \sigma)\)	1.1
\(\mu_i = \alpha_{id[i]} \: + \: \beta_G \: \cdot \: G_i\)	1.2
\(\beta_G \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: .077)\)	1.3
\(\alpha_{id} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(\mu_\alpha, \: \sigma_\alpha)\)	1.4
\(\mu_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(.5, \: .039)\)	1.5
\(\sigma_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(0, \: .039)\)	1.6
\(\sigma \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(.5, \: .039)\)	1.7
\(\nu \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(1; 2018)}_3(2013, \: 155.757)\)	1.8


\(Y_i \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_\nu(\mu_i, \sigma)\)	1.1
\(\mu_i = \alpha_{study[i]}\)	1.2
\(\alpha_{study} \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(\mu_\alpha, \: \sigma_\alpha)\)	1.3
\(\mu_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}_3(0, \: .077)\)	1.4
\(\sigma_\alpha \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(0; \infty]}_3(0, \: .077)\)	1.5
\(\nu \: \overset{\mathrm{iid}}{\sim} \: \mathcal{T}^{(1; 4)}_3(1, \: .077)\)	1.6

The Synthetic Causal Framework

An LLM-based Solution to the Fundamental Problem of Causal Inference

Emil N. Meyer-Hansen1

Version: 2025-03-18-18-49

1 Abstract

2 Introduction

3 Theory

3.1 Causal Inference

3.1.1 Potential Outcomes Framework

3.1.2 Confounding

3.1.3 Issues with Existing Solutions

3.2 Synthetic Causal Framework

3.2.1 Large Language Models

3.2.2 Simulated Data

3.2.3 Life-Histories

3.2.4 Synthetic Units

3.2.5 Exogenous Manipulation

3.2.6 Formalization

4 Methodology

4.1 Deservingness Heuristic

4.2 Simulating Data

4.3 Generating Life-Histories

4.4 Conducting an SCT

4.4.1 Parallel Worlds

4.4.2 Group Balance

4.4.3 Testing Simultaneity

4.5 Bayesian Inference

5 Analyses

5.1 Confirmatory Analysis

5.2 Sensitivity Analysis

5.3 Replication Analysis

5.4 Cost Analysis

6 Discussion

7 Conclusion

8 Declarations

9 References

10 Appendix

10.1 Changelog

10.2 Software Specifications

10.3 Meta-Analysis

10.3.1 Studies

10.3.2 Bayesian Meta-Analysis Model

10.3.3 Model Diagnostics

10.3.4 Model Parameters

10.4 Simulated Data

10.4.1 US State Residence

10.4.2 Ethnoracial Group

10.4.3 Education

10.4.4 Age

10.4.5 Household Income

10.4.6 Number of Children

10.4.7 Area of Residence

10.4.8 Sex

10.4.9 Marital Status

10.4.10 Employment Status

10.4.11 Political Party

10.4.12 Political Ideology

10.5 Synthetic Data

10.5.1 Welfare Support

10.5.2 Simultaneity Test

10.5.2.1 Model Diagnostics

10.5.2.2 Model Parameters

10.5.3 Deservingness & Welfare Support

10.5.3.1 Model Diagnostics

10.5.3.2 Model Parameters

10.6 Citation

Emil N. Meyer-Hansen¹