Dimensionality reduction and selection of the most informative variables. Evaluation of data dimensionality reduction methods used to transform a video stream for personal identification. Benefits of Dimensionality Reduction

Purpose of the study:

Assessing the effectiveness of data dimensionality reduction techniques to optimize their use in recognition (identification) practice.

Research objectives:

1. Review of literature sources on existing methods for reducing data dimensionality.

2. Conducting research (experiments) to compare the effectiveness of algorithms used in practice to reduce the dimensionality of data in classification problems

Research methods (software):

C++ programming language, OpenCV library

Perceiving high-dimensional data is difficult and sometimes impossible for humans. In this regard, it became quite natural to want to move from multidimensional sampling to small-dimensional data so that “they can be looked at,” evaluated and used, including to achieve recognition tasks. In addition to clarity, reducing dimensionality allows you to get rid of factors (information) that interfere with statistical analysis, lengthening the time for collecting information, increasing the dispersion of parameter estimates and characteristics of distributions.

Dimensionality reduction is the transformation of high-dimensional original data into a new lower-dimensional representation that preserves the underlying information. Ideally, the dimension of the transformed representation matches the intrinsic dimension of the data. The internal dimension of data is the minimum number of variables needed to express all possible properties of the data. An analytical model built from a reduced data set should be easier to process, implement, and understand than a model built from the original set.

The decision to choose a dimensionality reduction method is based on knowledge of the characteristics of the problem being solved and the expected results, as well as the limited time and computing resources. According to literature reviews, the most commonly used dimensionality reduction methods include Principal Component Analisys (PCA), Independent Component Analisys (ICA), and Singular Value Decomposition (SVD).

Principal Component Analysis (PCA) - the simplest method for reducing data dimensionality. It is widely used for feature transformation while reducing data dimensionality in classification problems. The method is based on projecting data onto a new coordinate system of lower dimension, which is determined by the eigenvectors and eigenvalues of the matrix. From a mathematical point of view, the principal component method is an orthogonal linear transformation.

The main idea of the method is to calculate the eigenvalues and eigenvectors of the covariance matrix of the data in order to minimize the variance. The covariance matrix is used to determine the spread around the mean relative to each other. The covariance of two random variables (dimensions) is a measure of their linear dependence:

where is the mathematical expectation of the random variable X, is the mathematical expectation of the random variable Y. We can also write formula (1) in the form:

where is the average X, where is the average Y, N is the dimension of the data.

After calculating the eigenvectors and eigenvalues, their values are sorted in descending order. Thus, the components are obtained in order of decreasing importance. The eigenvector with the largest eigenvalue is the main component of the data set. The principal components are obtained by multiplying the rows of eigenvectors by the sorted eigenvalues. To find the optimal space of a lower dimension, formula (3) is used, by which the minimum error between the original data set and the one obtained according to the following criterion is calculated:

where P is the dimension of the new space, N is the dimension of the original sample, are the eigenvalues, and is the threshold value. During the operation of the algorithm, we obtain a matrix with data MP, linearly transformed from MN, after which PCA finds a linear mapping M that minimizes the evaluation function:

where is the Euclidean distance between points and , is the Euclidean distance between points and , , . The minimum of this evaluation function can be calculated by performing a spectral decomposition of the Gram matrix and multiplying the eigenvectors of this matrix by the root of the corresponding eigenvalues.

Independent component analysis ( ICA ) , in contrast to PCA, it is a fairly new method, but is quickly gaining popularity. It is based on the idea of linearly transforming data into new components that are as statistically independent as possible and not necessarily orthogonal to each other. For the research in this work, the FastICa algorithm was chosen, described in detail in the article. The main objectives of this method are centering (subtracting the average from the data) and whitening (linearly transforming a vector x into a vector with uncorrelated coordinates whose variance is equal to one).

The criterion for independence in FastICA is non-Gaussianity, which is measured using the kurtosis coefficient:

For Gaussian random variables this value is zero, so FastICA maximizes its value. If is the “whitened” data, then the covariance matrix of the “whitened” data is the identity matrix.

Such a transformation is always possible. A popular whitening method uses spectral decomposition of the covariance matrix , where is the orthogonal matrix of eigenvectors, and is the diagonal matrix of eigenvalues,. It turns out that “whitening” can be represented as:

where the matrix is calculated by componentwise operation:

Experiments

For an experimental study of the proposed methods, storyboarded video sequences from the CASIA GAIT database were used. The database contains sequences of binary images corresponding to individual frames of the video sequence, on which moving objects have already been identified.

From the entire set of video sequences, 15 classes were randomly selected, in which the shooting angle is 90 degrees, people are depicted in ordinary non-winter clothes and without bags. There were 6 sequences in each class. The length of each sequence was at least 60 frames. The classes were divided into training and test sets of 3 sequences each.

The features obtained as a result of the PCA and ICA methods were used to train a classifier, which in this work was a Support Vector Machines (SVM).

To determine the quality of the method, the classification accuracy was assessed, defined as the proportion of correctly classified objects. During the experiment, the time spent in training and testing mode was also recorded.

Figure 1. a) Principal component analysis (PCA) b) Independent component method (ICA)

Figure 1(a,b) shows the dependence of classification accuracy on the value of the output data dimension after transformation. It can be seen that in PCA, the classification accuracy changes slightly as the number of components increases, but when using ICA, the accuracy begins to fall, starting from a certain value.

Figure 2. Dependence of classification time on the number of components A) PCA , b) ICA

Figure 2(a,b) shows the dependence of classification time on the number of PCA and ICA components. The increase in dimensionality in both cases was accompanied by a linear increase in processing time. It can be seen from the graphs that the SVM classifier performed faster after dimensionality reduction using Principal Component Analysis (PCA).

The Principal Component Analisys (PCA), Independent Component Analisys (ICA) methods worked quite quickly and, with certain parameters, good results were obtained in the classification task. But with data with a complex structure, these methods do not always achieve the desired result. Therefore, recently more and more attention has been paid to local nonlinear methods that perform data projection onto a certain variety, which allows preserving the data structure.

In the future, it is planned to expand both the list of algorithms used to generate a feature description and the list of classification methods used. Another important area of research appears to be reducing processing time.

References:

Jolliffe, I.T., Principal Component Analysis, Springer, 2002
Hyvärinen and Erkki Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks, 13, 2000
Josiński, H. Feature Extraction and HMM-Based Classification of Gait Video Sequences for the Purpose of Human Identification/ Springer, 2013 - Vol 481.

Data reduction

In analytical technologies, data dimensionality reduction refers to the process of converting it into a form that is most convenient for analysis and interpretation. This is usually achieved by reducing their volume, reducing the number of features used and the diversity of their meanings.

Often, analyzed data is incomplete when it poorly reflects the dependencies and patterns of the business processes being studied. The reasons for this may be an insufficient number of observations, the absence of signs that reflect the essential properties of objects. In this case, data enrichment is applied.

Dimensionality reduction is applied in the opposite case, when the data is redundant. Redundancy occurs when an analysis problem can be solved with the same level of efficiency and accuracy, but using a smaller data dimension. This allows you to reduce the time and computational costs of solving the problem, making the data and the results of their analysis more interpretable and understandable for the user.

Reducing the number of data observations is used if a solution of comparable quality can be obtained from a smaller sample size, thereby reducing computational and time costs. This is especially true for algorithms that are not scalable, when even a small reduction in the number of records leads to a significant gain in computational time.

It makes sense to reduce the number of features when the information necessary for a high-quality solution of the problem is contained in a certain subset of features and it is not necessary to use all of them. This is especially true for correlated features. For example, the characteristics “Age” and “Work Experience” essentially carry the same information, so one of them can be excluded.

The most effective means of reducing the number of features is factor analysis and principal component method.

Reducing the diversity of feature values makes sense, for example, if the accuracy of data representation is excessive and integer values can be used instead of real values without compromising the quality of the model. But this will reduce the amount of memory occupied by the data and the computational costs.

The subset of data obtained as a result of dimensionality reduction should inherit from the original set as much information as is necessary to solve the problem with a given accuracy, and the computational and time costs of data reduction should not devalue the benefits obtained from it.

An analytical model built from a reduced data set should be easier to process, implement, and understand than a model built from the original set.

The decision to choose a dimensionality reduction method is based on a priori knowledge about the characteristics of the problem being solved and the expected results, as well as the limited time and computing resources.

As a result of studying the material in Chapter 5, the student should:

know

basic concepts and tasks of dimensionality reduction:
approaches to solving the problem of transforming the feature space;

be able to

use the principal component method to move to standardized orthogonal features;
evaluate the decrease in information content of data when reducing the dimension of the feature space;
solve the problem of constructing optimal multidimensional scales for studying objects;

own

methods of dimensionality reduction for solving applied problems of statistical analysis;
skills to interpret variables in the transformed feature space.

Basic concepts and problems of dimensionality reduction

At first glance, the more information about the objects of study in the form of a set of characteristics characterizing them will be used to create a model, the better. However, too much information can reduce the effectiveness of data analysis. There is even a term "curse of dimensionality" (course of dimensionality), characterizing the problems of working with high-dimensional data. The need to reduce dimensionality in one form or another is associated with solving various statistical problems.

Uninformative features are a source of additional noise and affect the accuracy of estimating model parameters. In addition, datasets with a large number of features may contain groups of correlated variables. The presence of such groups of features means duplication of information, which can distort the specification of the model and affect the quality of estimation of its parameters. The higher the dimension of the data, the higher the amount of calculations during its algorithmic processing.

Two directions can be distinguished in reducing the dimension of the feature space based on the principle of the variables used for this: selecting features from the existing initial set and forming new features by transforming the original data. Ideally, the reduced data representation should have a dimension that matches the dimension inherent in the data. (intrinsic dimensionality).

The search for the most informative features characterizing the phenomenon under study is an obvious direction for reducing the dimension of the problem, which does not require transformation of the original variables. This allows you to make the model more compact and avoid losses associated with the interfering effect of uninformative features. The selection of informative features consists of finding the best subset from the set of all original variables. The criteria for the concept of “best” can be either the highest quality of modeling for a given dimension of the feature space, or the smallest dimension of data at which it is possible to build a model of a given quality.

A direct solution to the problem of creating the best model involves searching through all possible combinations of features, which usually seems extremely labor-intensive. Therefore, as a rule, they resort to direct or reverse selection of traits. In direct selection procedures, variables are sequentially added from the initial set until the required quality of the model is achieved. In algorithms for sequential reduction of the original feature space (inverse selection), the least informative variables are gradually removed until an acceptable reduction in the information content of the model is achieved.

It should be taken into account that the information content of features is relative. Selection should ensure high information content of a set of features, and not the total information content of its constituent variables. Thus, the presence of a correlation between features reduces their overall information content due to duplication of information common to them. Therefore, adding a new feature to the already selected ones provides an increase in information content to the extent that it contains useful information that is missing in the previously selected variables. The simplest situation is the selection of mutually orthogonal features, in which the selection algorithm is implemented extremely simply: the variables are ranked according to their information content, and a composition of the first features in this ranking is used that ensures the specified information content.

The limitations of feature selection methods for the purpose of reducing the dimension of space are associated with the assumption of the direct presence of the necessary features in the source data, which usually turns out to be incorrect. An alternative approach to dimensionality reduction involves transforming the features into a reduced set of new variables. In contrast to the selection of initial features, the formation of a new feature space involves the creation of new variables, which are usually functions of the original features. These variables, not directly observable, are often called hidden, or latent. During the creation process, these variables can be given various useful properties, such as orthogonality. In practice, the original features are usually interconnected, so the transformation of their space into an orthogonal one generates new coordinates-signs in which there is no effect of duplicating information about the objects under study.

Mapping objects in a new orthogonal feature space creates the ability to visualize the usefulness of each feature in terms of the differences between these objects. If the coordinates of the new basis are ordered by dispersion, which characterizes the spread of values for them for the observations under consideration, then the uselessness from a practical point of view of some features with small variance values becomes obvious, since objects based on these features are practically indistinguishable in comparison with their differences in more informative variables. In such a situation, we can talk about the so-called degeneration of the original feature space from k variables, and the real dimension of this space T may be less than the original (m< k).

Reduction of the feature space is accompanied by a certain reduction in the information content of the data, but the level of acceptable reduction can be determined in advance. Feature extraction projects a set of original variables into a lower dimensional space. Compressing the feature space to two or three dimensions can be useful for data visualization. Thus, the process of forming a new feature space usually leads to a smaller set of truly informative variables. On their basis, a higher quality model can be built as one based on a smaller number of the most informative features.

The formation of new variables based on the original ones is used for latent semantic analysis, data compression, classification and pattern recognition, increasing the speed and efficiency of learning processes. The compressed data is usually used for further analysis and modeling.

One important application of feature space transformation and dimensionality reduction is the construction of synthetic latent categories based on measured feature values. These latent features can characterize certain general features of the phenomenon being studied, integrating the particular properties of the observed objects, which makes it possible to construct integral indicators of various levels of generalization of information.

The role of feature space reduction methods in studying the problem of duplication of information in the original features, leading to “swelling” of the variance of estimates of regression model coefficients, is significant. The transition to new, ideally orthogonal and meaningfully interpretable, variables is an effective means of modeling in conditions of multicollinearity of source data.

Transforming the original feature space into an orthogonal one is convenient for solving classification problems, since it allows you to reasonably apply certain measures of proximity or differences between objects, such as the Euclidean distance or the square of the Euclidean distance. In regression analysis, constructing a regression equation using principal components allows us to solve the problem of multicollinearity.

Chapter 13. PRINCIPAL COMPONENT METHOD

13.1. The essence of the problem of dimensionality reduction and various methods for solving it

In research and practical statistical work, one has to deal with situations where the total number of signs recorded on each of the many objects being surveyed (countries, cities, enterprises, families, patients, technical or environmental systems) is very large - about a hundred or more. However, the available multivariate observations

should be statistically processed, comprehended or entered into a database in order to be able to use them at the right time.

The desire of a statistician to present each of the observations (13.1) in the form of a vector Z of some auxiliary indicators with a significantly smaller (than) number of components is due primarily to the following reasons:

the need for a visual representation (visualization) of the initial data (13.1), which is achieved by projecting them onto a specially selected three-dimensional space, a plane or a number line (Section IV is devoted to problems of this type);

the desire for laconicism of the studied models, due to the need to simplify the calculation and interpretation of the obtained statistical conclusions;

the need to significantly compress the volume of stored statistical information (without visible losses in its information content), if we are talking about recording and storing arrays of type (13.1) in a special database.

In this case, new (auxiliary) characteristics can be selected from among the original ones or determined according to some rule based on a set of initial characteristics, for example, as their linear combinations. When forming a new system of features, various requirements are imposed on the latter, such as the greatest information content (in a certain sense), mutual uncorrelation, the least distortion of the geometric structure of the set of initial data, etc. Depending on the variant of formal specification of these requirements (see .below, as well as section IV) we arrive at one or another dimensionality reduction algorithm. There are at least three main types of fundamental prerequisites that determine the possibility of transition from a large number of initial indicators of the state (behavior, operating efficiency) of the analyzed system to a significantly smaller number of the most informative variables. This is, firstly, duplication of information delivered by highly interrelated features; secondly, the lack of information content of features that change little when moving from one object to another (low “variability” of features); thirdly, the possibility of aggregation, i.e. simple or “weighted” summation, according to certain criteria.

Formally, the task of transition (with minimal losses in information content) to a new set of features can be described as follows. Let be some p-dimensional vector function of the original variables and let be a certain specified measure of the informativeness of the -dimensional system of features. The specific choice of the functional depends on the specifics of the real problem being solved and is based on one of the possible criteria: the criterion of auto-informativeness, aimed at maximizing the preservation of the information contained in the original array relative to the original features themselves; and the criterion of external information content, aimed at maximizing “squeezing” out of the information contained in this array relative to some other (external) indicators.

The task is to determine such a set of features Z, found in the class F of admissible transformations of initial indicators, that

One or another version of the specification of this statement (which determines the specific choice of information content measure) and the class of admissible transformations) leads to a specific method of dimensionality reduction: the principal component method, factor analysis, extreme grouping of parameters, etc.

Let's explain this with examples.

13.1.1. Principal component method (see § 13.2-§ 13.6).

It is to the first principal components that the researcher will come if, as a class of admissible transformations F, he defines all possible linear orthogonal normalized combinations of initial indicators, i.e.

(here) is the mathematical expectation and as a measure of the information content of the -dimensional system of indicators, the expression

(here D, as before, is the sign of the operation of calculating the variance of the corresponding random variable).

13.1.2. Factor analysis (see Chapter 14).

As is known (see § 14.1), the factor analysis model explains the structure of connections between the initial indicators by the fact that the behavior of each of them statistically depends on the same set of so-called common factors, i.e.

where - the “load” of the general factor on the initial indicator - the residual “specific” random component, and - are pairwise uncorrelated.

It turns out that if F is defined as the class of all possible linear combinations, taking into account the mentioned restrictions on and as a measure of the information content of the -dimensional system of indicators, choose a value, then the solution to the optimization problem (13.2) coincides with the vector of common factors in the factor analysis model. Here is the correlation matrix of the initial indicators; the correlation matrix of indicators is the Euclidean norm of matrix A.

13.1.3. Method of extreme grouping of features (see clause 14.2.1).

In this method, we are talking about dividing the set of initial indicators into a given number of groups such that the characteristics belonging to one group would be relatively strongly intercorrelated, while the characteristics belonging to different groups would be weakly correlated. At the same time, the problem of replacing each group of strongly intercorrelated initial indicators with one auxiliary “resultant” indicator is solved, which, naturally, should be in close correlation with the characteristics of its group. Having defined all normalized linear combinations as a class of admissible transformations F of the initial indicators, we look for a solution by maximizing (with respect to S and ) the functional

where is the correlation coefficient between variables.

13.1.4. Multidimensional scaling (see Chapter 16).

In a number of situations, and primarily in situations where initial statistical data are obtained using special surveys, questionnaires, expert assessments, there may be cases when the element of primary observation is not the state of the object described by the vector, but the characteristic of the pairwise proximity (remoteness) of two objects (or signs) according to the numbers

In this case, the researcher has as an array of initial statistical data a matrix of size (if the characteristics of pairwise proximity of objects are considered) or (if the characteristics of pairwise proximity of features are considered) of the form

where quantities are interpreted either as distances between objects (features) i and either as ranks that specify the ordering of these distances. The task of multidimensional scaling is to “immerse” our objects (features) in such a -dimensional space, i.e., to choose coordinate axes so that the initial geometric configuration of the set of analyzed object points (or point-features) specified using ( 13.1) or (13.5), would turn out to be the least distorted in the sense of some criterion of the average “degree of distortion” of mutual pairwise distances.

One of the fairly general multidimensional scaling schemes is determined by the criterion

where - the distance between objects in the original space, - the distance between the same objects in the desired space of a lower dimension - are free parameters, the choice of specific values of which is at the discretion of the researcher.

Having determined the measure of information content of the desired set of features Z, for example, as the inverse of the above-mentioned degree of distortion of the geometric structure of the original set of points, we reduce this problem to the general formulation (13.2), assuming

13.1.5. Selection of the most informative indicators in discriminant analysis models (see § 1.4; 2.5).

The above functionalities are measures of the auto-informativeness of the corresponding system of features. Let us now give examples of criteria for external information content. In particular, we will be interested in the information content of the system of indicators from the point of view of the correct classification of objects according to these indicators in the discriminant analysis scheme. In this case, we define the class of admissible transformations F based on the requirements that only representatives of a set of initial indicators can be considered, i.e.

A common initial thesis when solving the problem of identifying the most informative indicators from the original set is the statement that a vector of indicators of a given dimension is the more informative, the greater the difference in the laws of its probability distribution, defined in different classes in the classification problem under consideration. If we introduce a measure of pairwise differences in the laws describing the probability distribution of the feature vector in classes with numbers, then we can formalize the above principle of selecting the most informative indicators by determining them from the condition of maximizing (by) the value

The most commonly used measures of difference between the laws of probability distribution are the information type distance (Kullback distance, Mahalanobis distance), as well as the “variation distance” (for more details, see .

13.1.6. Selection of the most informative variables in regression models (see).

When constructing regression-type dependencies, one of the central issues is identifying a relatively small number of variables (from the a priori set that most significantly influence the behavior of the resulting characteristic being studied).

Thus, as in the previous paragraph, class F consists of all possible sets of variables selected from the initial set of factor-arguments and we are dealing with the criterion of external information content of such sets. Its type is usually specified using a multiple coefficient of determination - a characteristic of the degree of close connection between the indicator y and a set of variables. In this case, for a fixed dimension, the set of variables will obviously be considered the most informative (from the point of view of the accuracy of describing the behavior of the indicator y), if the value of the measure of information content on this the set reaches its maximum.

In statistics, machine learning and information theory, dimensionality reduction is a transformation of data that consists of reducing the number of variables by obtaining the main variables. Transformation can be divided into feature selection and feature extraction.

Related concepts

Mentions in literature

– loading and preprocessing of input data, – manual and automatic marking of stimulus materials (selection of areas of interest), – algorithm for calculating the successor representation matrix, – construction of an extended data table with the values of input variables necessary for subsequent analysis, – method dimensionality reduction feature space (principal component method), – visualization of component loads for selecting interpretable components, – algorithm for training a decision tree, – algorithm for assessing the predictive ability of a tree, – visualization of a decision tree.

Related concepts (continued)

Spectral clustering techniques use the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in lower dimensional spaces. The similarity matrix is provided as input and consists of quantitative estimates of the relative similarity of each pair of points in the data.

Spectral methods are a class of techniques used in applied mathematics for the numerical solution of certain differential equations, possibly involving the Fast Fourier Transform. The idea is to rewrite the solution of differential equations as the sum of some "basis functions" (like how Fourier series are the sum of sinusoids), and then choose the coefficients in the sum to satisfy the differential equation as best as possible.

Mathematical analysis (classical mathematical analysis) - a set of branches of mathematics corresponding to the historical section called “infinitesimal analysis”, combines differential and integral calculus.

Differential evolution is a method of multidimensional mathematical optimization that belongs to the class of stochastic optimization algorithms (that is, it works using random numbers) and uses some ideas of genetic algorithms, but, unlike them, does not require working with variables in binary code.

The discrete element method (DEM, from the English Discrete element method) is a family of numerical methods designed to calculate the movement of a large number of particles, such as molecules, grains of sand, gravel, pebbles and other granular media. The method was originally applied by Cundall in 1971 to solve rock mechanics problems.