Supervised & Unsupervised Learning

*Statistical learning *refers to a vast set of tools for *understanding data*. These tools can be classified as *supervised *or *unsupervised*.

The goal of the *supervised learning* is to fit a model that relates the output to one or more inputs, with the aim of predicting the response for future observations (prediction) or understanding the relationship between the response and predictors (inference). Classical statistical learning methods such as linear or logistic regression, generalized additive models (GAM), boosting, and support vector machines operate in the supervised learning domain.

In contrast, *unsupervised learning* describes a more challenging technique in which the goal is to find patterns in unlabeled data, beyond what would be considered pure unstructured noise. The method is more subjective, and it is often performed as part of an *exploratory data analysis*. Furthermore, the results obtained from unsupervised learning methods are hard to be assessed, since the true answers are unknown. Classic examples of unsupervised learning are cluster analysis and dimensionality reduction.

Feature Extraction & Anomaly Detection

Two components are included in the field of feature extraction: *feature construction* and *feature selection*.

*Feature construction* is one of the key steps in the data analysis process that converts the “raw” data into a set of useful features for the data analysis process. Usually, the *feature construction* is integrated in the modeling process, but in some approaches it represents a preprocessing transformation technique. These techniques may include standardization (i.e., centering, scaling), normalization, signal enhancement (i.e., Fourier transform, wavelet transforms), linear and non-linear space embedding methods (i.e., principal component analysis, multidimensional scaling ), or feature discretization (i.e., discretize continuous values into a finite discrete set) among others.

*Feature selection* is primarily performed to select relevant informative features, but it can be used such as: (1) general data reduction ( to limit storage requirements and increase algorithm speed), (2) feature set reduction (to save resources in the next round of data collection or during utilization), (3) performance improvement (to gain in predictive accuracy), and (4) data understanding (to gain knowledge about the process that generated the data or simply visualize the data). Filters, wrappers, and embedded methods are the three principal approaches of feature selection.

*Anomaly detection*, also known as outlier detection, is the identification of data points, items, observations, or events that do not conform to expected behavior. It is applicable in domains such as fraud detection, intrusion detection, fault detection, system health monitoring, and event detection systems in sensor networks. Techniques for anomaly detection include statistical methods (i.e., deviates by a certain standard deviation from the mean) and machine learning approaches (i.e., k-nearest neighbors algorithm, local outlier factor, clustering, support vector machine, moving average using discrete linear convolution ) among others.

Linear & Nonlinear Classifiers

Statistical classification procedures are employed to determine the specific class to which a new observation belongs, based on a set of specific set of features associated with it. In the prelimiary stage of supervised learning, a set of training data together with their known targets are used to extract the classifier parameters. These parameters are then used on the incoming data in order to assign them to the corresponding class.

*Linear classifiers* assign a probability or a score for each possible class based on a linear prediction function of parameters to be determined. In the simplest case, these parameters operate as weights on the specific set of features of the considered observation. Frequently used linear classifiers are logistic and multi-logistic regression, linear and quadratic discriminant analysis, Naive Bayes classifier, or Support Vector Machines.

In contrast, the *nonlinear classifiers*, such as decission trees, random forests, or neural networks, usually cannot be formulated in a closed form expression.

Clustering Algorithms

Together with Principal Component Analysis, *clustering* represent an important set of unsupervised learning methods. Their main objective is to find a set of homogenous subgroups in a given dataset, based on some similarity measure or a set of specific features. The partition can be either flat, in the sense that the subsets are disjoint, or hierarchical in which the number of clusters is not specified and the data separation is presented in the form of a dendogram.

Among the most used flat *clustering* procedures, the K-Means method attempts to minimize the within-cluster variance, usually defined as the Euclidian distance between the points. The number of clusters must be determined in advance and each data point is assigned to the nearest cluster center. The standard k-means algorithm is iterative and can only find a local minimum. Because of this, it is usually ran multiple times with different random starting centers and the final solution is considered the one for which the best minimum is obtained.

A distribution based *clustering*, usually formulated as a Gaussian mixture model can be used with very large data sets. It can automatically determine the number of clusters to establish and it is iterative in nature.

Hierarchical *clustering* is also used when the number of clusters is not known in advance. It builds the hierarchy beginning from individual elements (leafs) and progressively merging the clusters (tree) by estimating the inter-cluster dissimilarities. The final number of clusters is decided in the end usually by visual inspection of the resulting dendogram.

When the data dimensionality is large, the traditional *clustering* procedures are used in connection with some data reduction techniques leading to either subspace or correlation clustering methods.