Data Analytics Colloquium

Upcoming webinars

Empirical studies in social sciences often involve observational data with many controls or technical terms. As a way to seek robustness, a common practice is to compute an estimate using different subsets of controls. However, many conventional estimators ignore the additional mean squared error (MSE) incurred due to the presence of many controls or technical terms, which can cause the empirical results to be misinterpreted. Given a set of controls, how can we come up with a more robust estimator? In this talk, we will first review the balancing method as an effective way to estimate some  Read More

How does one identify and find policy and regime changes over time? Here we review data-drive techniques for the statistical detection and identification of data changes over time and provide a taxonomy for how to diagnose and think about these methods. Slides | Sample code and other materials

This workshop aims to introduce inferential statistical models for network data. The workshop will integrate theoretical discussions with technical breakdowns, practical examples, and software code to perform analyses.

Just like any other area of statistics, network analytic procedures can be divided into two categories – descriptive and inferential. We will spend a short amount of time covering some descriptive basics (e.g. measures of centrality), but the emphasis of the workshop is on inferential network analysis. Methods of descriptive network analysis are suitable for many worthwhile

 Read More

Tension has long existed in the political and social sciences between quantitative and qualitative approaches on one hand, and theory-minded and empirical techniques on the other. The latter divide has grown sharper in the wake of new behavioral and experimental perspectives that draw on both sides of these modeling schemes. We propose to address this disconnect by establishing a framework for methodological unification: empirical implications of theoretical models (EITM). This framework connects behavioral and applied statistical concepts, develops analogues of these concepts, and links and  Read More

Researchers using nonlinear probability models, such as logit and probit, often study how the original coefficients change when additional covariates are added or subtracted from the model. Such comparisons are illegitimate. The problem occurs because the estimated parameters in such models are only identified “up to a scale,” which means that the estimated coefficients are scaled by the standard deviation of the unobserved disturbance term. Hence, adding additional covariates decreases the residual variance, which then inflates all the estimated coefficients even if their true values have

 Read More

A fundamental challenge facing applied time-series analysts is how to draw inferences about long-run relationships (LRR) when we are uncertain whether the data contain unit roots. Unit root tests are notoriously unreliable and often leave analysts uncertain, but popular extant methods hinge on correct classification. Webb, Linn, and Lebo (WLL; 2019) develop a framework for inference based on critical value bounds for hypothesis tests on the long-run multiplier (LRM) that eschews unit root tests and incorporates the uncertainty inherent in identifying the dynamic properties of the data into  Read More

Many political phenomena are hard to quantify. Take electoral fraud. In the 21st century, both democracies and autocracies held regular elections. How do we know if the majority of voters indeed supported the winner? To answer this question, scholars and experts should be able to separate “real” votes from fakes. This ability is equally crucial for theory-testing and policy-making. In this talk, Dr. Sobolev will discuss the evolution of measurement approaches in social sciences, using an example of yet another paramount phenomenon: mass protest behavior. The ability of citizens to  Read More

To study the evolution of electoral preferences, Wlezien and Erikson (2002) propose assessing the relationship between pre-election vote intentions and the final vote for a set of elections. That is, they model poll data not as a set of different time series, which are difficult to analyze in most election years in most countries because of missing data and survey error, but as a series of cross-sections—across elections—for each day of the election ‘timeline.’ Although the method does not provide information about preference dynamics in particular election years, it does reveal  Read More

The biggest challenge in empirical work is to get our statistical models to correctly represent the politics of what we are studying.  For example, Donald Trump raised voter turnout.  So did Franklin Roosevelt and Adolf Hitler.  Strong preferences motivate voters to go to the polls.  Yet studies of elections nearly always analyze vote choices and turnout separately, missing the politics that mobilizes voters.  Researchers have long understood the theoretical limitation of doing so, but issues of parameter identification, computing power, unavailability of survey weighting, and complexity of  Read More

Big data problem taxes computation resources in many ways including RAM, storage, swapping, ability to parallelize, and the limitations of specific software packages to even perform the operations. There is no conventional way to estimate spacial models of this size. As a result we have been required to creatively reform matrix objects, use relatively obscure linear algebra relationships, break operations up into multiple discrete tasks, and consider hardware issues in new ways. The current solution is written in C++ code to run on AWS, which is labor intensive and ultimately expensive. Human  Read More