Statistical Modeling & Predictive Analytics


Survey Analysis

Survey-based instruments (e.g., questionnaires, interviews, etc.) are the most frequently used data collection tools to gather information about respondents. The information is collected from one or multiple random samples from a population, and then inferences are made about particular population characteristics. Surveys are used in screening, progress monitoring, assessment, and evaluation. Survey-based instruments are inexpensive and can easily target groups of interest for scientific purpose in many ways and in wide geographical areas. Once survey data are collected from participants, the next step is to do appropriate statistical analyses, interpret the results, and make recommendations pertaining to the research objectives. Usually, the participants’ responses are measured on a Likert-type scale with two (i.e., dichotomous variables) or more categories (i.e., polytomous variables) that attempts to determine the level of agreement or disagreement of the respondents.

Binary and ordinal logistic regression are used extensively to establish the relationship between the categorical response variables and a set of explanatory variables. When complex survey sampling designs are used, such as stratified or clustered sampling, ordinal logistic regression can not be conducted without taking sampling design into account, usually through multilevel modeling.

In the case when a large number of possibly confounded response variable exists, Factor Analysis (EFA and CFA) can be used for dimensionality reduction as well as to extract a reduced set of latent explanatory structures.


Survival Analysis & Hazard Models

Survival Analysis represents a set of statistical methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest (e.g., equipment failure, earthquake, promotion, stock market crashes, death, divorce, arrest, etc.). This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modeling in economics, and event history analysis in sociology.  Survival data are modeled in terms of two related functions: survival and hazard. The survival function is the probability that an individual survives from the time origin (e.g. diagnosis) to a specified future time. The hazard function assesses the risk, at a particular moment, that an individual who is under observation will experience the target event.

A number of models are available to analyze the relationship between the survival time and a set of predictor variables. Methods include parametric, nonparametric, and semi-parametric approaches. Parametric methods assume that the underlying distribution of the survival times follows certain known probability distributions (exponential, Weibull, gamma, Gompertz, lognormal). They focus on the description of the distribution of the survival times and the change in their distribution as a function of predictors.

The Kaplan-Meier estimator, a popular nonparametric statistic, is widely used to estimate survival probabilities as a function of time. This estimator is often used to obtain univariate descriptive statistics for survival data and compare the survival experience for two or more groups of participants.

A popular semi-parametric  model for the analysis of survival data is the Cox proportional hazards regression model, the most common tool for studying the dependency of survival time on predictor variables. In contrast with parametric models, this regression model makes no assumptions about the shape of the baseline hazard function. In the case when 2 or more end events are possible (for example the on-set of a specific disease or death), a competing risk events model must be employed. In the case when the transition to the target state takes place through a set of consecutive intermediate states, a multi-state model is recommended. Finally, when one or more covariates change in a relatively predictive manner in time, a joint longitudinal model is used to make inferences both about the covariates evolution and the target event hazard rate.


Time Series Analysis

When a variable is measured sequentially in time, the resulting data form a time series. Time series analysis covers techniques for modeling, estimating, filtering, and forecasting. The main features of many time series are the trend (e.g., increase or decrease) and the seasonal variation (i.e., pattern which occur at regular intervals), but an important goal is to understand and model the correlational structure of the data. Once a good model is found, it can be used to forecast future values, or generate simulations, to guide planning decisions. An efficient method of forecasting is to find a suitable leading variable or to find a variable that is associated with the variable we need to forecast. However, when this is not possible, an effective strategy is to make extrapolations based on present trends and to implement adaptive estimates of these trends.

An important concept in time series analysis is stationarity (i.e., mean, variance, and autocorrelation structure do not change over time) generally formulated in the weak stationarity form. When a series is non stationary, the forecasting is much more difficult. In many cases, a differencing operation can be used to transform a non-stationary series into a series where the assumptions of stationarity are met. In this case, a model such as autoregressive integrated moving average (ARIMA) is an appropriate choice.  Time series modeling methods include parametric (i.e., the process has any particular structure which can be described using a small number of parameters) and non-parametric, linear and nonlinear, or univariate and multivariate techniques.


Latent Variable Models in SEM and HMM

Latent variables (LVs) play a very important part in statistical modeling. They can be defined, in simple terms, as variables that are not directly measurable, but whose presence can explain the relationships existing between the measured quantities.  Two of the most frequent approaches to modeling with latent variables are Structural Equation  Models (SEM) and Hidden Markov Models (HMM).

In SEM, various relationships between the LVs and the manifest variables (indicators) are probed. The validity of a proposed model is established by evaluating the goodness-of-fit factors. Among the most utilized models, the estimatory (EFA) and confirmatory (CFA) factor analysis are used to probe the hidden relationships between the measured indicators. The number of latent ‘factors’ uncovered is usually much smaller than the number of indicators and which they influence directly. In Path Analysis, another common form of SEM a relatively complex dependence structure between the postulate latent variables and the observed ones is tested. The model parameter estimation is usually done by requiring that the sample covariance matrix is as close as possible to the model covariance matrix. Alternatively, a Bayesian SEM can be used when a priori knowledge about the model parameters is available. The a posteriori distributions are then sampled using MCMC methods.

In Hidden Markov Model states are not directly observable, but they can be seen as latent variables that affect the result of the observable output tokens.


Bayesian Estimation & MCMC Sampling

Bayesian analysis is an important statistical technique with large applicability in scientific, financial, engineering or medical fields where detection, estimation, classification, or prediction methods are considered.

Given a statistical model for the data, the Bayesian approach requires an additional probability model for all unknown parameters in the model. This a priori information is obtained independently of the data being analyzed and represents our beliefs about the process. The a posteriori density function, constructed as a product between the data sampling distribution and the a priori  probability density function (pdf), is then used in model parameter estimation or forecasting. However, due to the complexity of the a posteriori distributions, the parameter estimates based on maximum a posteriori (MAP) approach or the distribution probability values of future events determined by integrating the functions of interest over these distributions can not be done analytically. Therefore, Markov chain Monte Carlo (MCMC) simulations represent an important set of techniques which can generate a large body of sampled data from a complex distribution by sampling from a much simpler candidate distribution.

A Markov chain with a stationary distribution given by the a posteriori distribution of interest is constructed by using a Metropolis-Hastings type algorithm, for example. Another popular choice is represented by Gibbs sampling which requires explicit knowledge, up to a multiplicative factors, of all conditional distributions of the a posteriori pdf. A good Markov chain will present a rapid mixing of the states such that the stationary distribution can be quickly reached. The choice of the candidate  distribution plays a critical role in the construction.