• Basics of linear regression. Methods of mathematical statistics. Regression analysis

    CONCLUSION OF RESULTS

    Table 8.3a. Regression statistics
    Regression statistics
    Plural R 0,998364
    R-square 0,99673
    Normalized R-squared 0,996321
    Standard error 0,42405
    Observations 10

    Let's first consider top part calculations presented in table 8.3a - regression statistics.

    The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and regression model(calculated data). The measure of certainty is always within the interval.

    In most cases, the R-squared value falls between these values, called extreme values, i.e. between zero and one.

    If the R-squared value is close to one, this means that the constructed model explains almost all the variability in the relevant variables. Conversely, an R-squared value close to zero means bad quality built model.

    In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

    Plural R- multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

    Multiple R is equal to square root from the coefficient of determination, this quantity takes values ​​in the range from zero to one.

    In simple linear regression analysis, multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

    Table 8.3b. Regression coefficients
    Odds Standard error t-statistic
    Y-intersection 2,694545455 0,33176878 8,121757129
    Variable X 1 2,305454545 0,04668634 49,38177965
    * A truncated version of the calculations is provided

    Now consider the middle part of the calculations, presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

    Based on the calculations, we can write the regression equation as follows:

    Y= x*2.305454545+2.694545455

    The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

    If the sign at regression coefficient- positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

    If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

    In table 8.3c. The results of the derivation of residuals are presented. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

    WITHDRAWAL OF THE REST

    Table 8.3c. Leftovers
    Observation Predicted Y Leftovers Standard balances
    1 9,610909091 -0,610909091 -1,528044662
    2 7,305454545 -0,305454545 -0,764022331
    3 11,91636364 0,083636364 0,209196591
    4 14,22181818 0,778181818 1,946437843
    5 16,52727273 0,472727273 1,182415512
    6 18,83272727 0,167272727 0,418393181
    7 21,13818182 -0,138181818 -0,34562915
    8 23,44363636 -0,043636364 -0,109146047
    9 25,74909091 -0,149090909 -0,372915662
    10 28,05454545 -0,254545455 -0,636685276

    Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value

    Assessing the quality of a regression equation using coefficients of determination. Testing the null hypothesis about the significance of the equation and the relationship strength indicators using Fisher's F test.

    Standard errors of coefficients.

    The regression equation is:

    Y =3378,41 -494.59X 1 -35.00X 2 +75.74X 3 -15.81X 4 +80.10X 5 +59.84X 6 +
    (1304,48) (226,77) (10,31) (277,57) (287,54) (35,31) (150,93)
    +127.98X 7 -78.10X 8 -437.57X 9 +451.26X 10 -299.91X 11 -14.93X 12 -369.65X 13 (9)
    (22,35) (31,19) (97,68) (331,79) (127,84) 86,06 (105,08)

    To fill out the table “Regression statistics” (Table 9) we find:

    1. Plural R– r-correlation coefficient between y and ŷ.

    To do this, use the CORREL function by entering the arrays y and ŷ.

    The resulting number 0.99 is close to 1, which shows a very strong relationship between the experimental data and the calculated data.

    2. For calculation R-square we find:

    Explained Error 17455259,48,

    Unexplained error .

    Therefore, R-squared is .

    Accordingly, 97% of the experimental data can be explained by the resulting regression equation.

    3. Normalized R-squared find by formula

    This indicator serves for comparison different models regression when changing the composition of explanatory variables.

    4. Standard error– square root of the sample residual variance:

    As a result, we obtain the following table.

    Table 9.

    Filling out the “Analysis of Variance” table

    Most of the data has already been obtained above. (Explained and unexplained error).

    Let's calculate t wx:val="Cambria Math"/> 13 = 1342712,27"> .



    We will assess the statistical significance of the regression equation as a whole using F-Fisher criterion. Equation multiple regression significant (otherwise, the hypothesis H 0 about the equality of the parameters of the regression model to zero, i.e., is rejected), if

    , (10)

    where is the table value of Fisher's F test.

    Actual value F- the criterion according to the formula will be:

    For calculation table value Fisher's test uses the function FRIST (Figure 4).

    Degree of freedom 1: p=13

    Degree of freedom 2: n-p-1 = 20-13-1=6

    Figure 4. Using the FRIST function in Excel.

    F table = 3.976< 16,88, следовательно, модель адекватна опытным данным.

    Significance F calculated using the FDIST function. This function returns the F probability distribution (Fisher distribution) and allows you to determine whether two data sets have different degrees of dispersion in their results.

    Figure 5. Using the FDIST function in Excel.

    Significance F = 0.001.

    REPORT

    Assignment: consider a regression analysis procedure based on data (sale price and living space) on 23 real estate properties.

    The "Regression" operating mode is used to calculate the parameters of the equation linear regression and checking its adequacy to the process under study.

    To solve the problem of regression analysis in MS Excel, select from the menu Service team Data Analysis and analysis tool" Regression".

    In the dialog box that appears, set the following parameters:

    1. Input interval Y- this is the range of data for the resulting attribute. It must consist of one column.

    2. Input interval X is a range of cells containing the values ​​of factors (independent variables). The number of input ranges (columns) must be no more than 16.

    3. Checkbox Tags, is set if the first line of the range contains a title.

    4. Checkbox Reliability level is activated if in the field next to it you need to enter a reliability level different from the default one. Used to test the significance of the coefficient of determination R2 and regression coefficients.

    5. Constant zero. This checkbox must be checked if the regression line must pass through the origin (and 0 =0).

    6. Output interval / New worksheet / New workbook - specify the address of the upper left cell of the output range.

    7. Checkboxes in the group Leftovers are set if it is necessary to include the corresponding columns or graphs in the output range.

    8. The Normal Probability Graph checkbox must be made active if you want to display a scatter plot of the dependence of observed Y values ​​on automatically generated percentile intervals.

    After clicking the OK button in the output range, we get a report.

    Using a set of data analysis tools, we will perform regression analysis of the source data.

    The Regression analysis tool is used to fit the parameters of a regression equation using the least squares method. Regression is used to analyze the impact on an individual dependent value variable one or more independent variables.

    TABLE REGRESSION STATISTICS

    Magnitude plural R is the root of the coefficient of determination (R-squared). It is also called the correlation index or multiple correlation coefficient. Expresses the degree of dependence of the independent variables (X1, X2) and the dependent variable (Y) and is equal to the square root of the coefficient of determination; this value takes values ​​in the range from zero to one. In our case, it is equal to 0.7, which indicates a significant relationship between the variables.

    Magnitude R-squared (coefficient of determination), also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

    In our case, the R-square value is 0.48, i.e. almost 50%, which indicates a poor fit of the regression line to the original data. Because found value R-squared = 48%<75%, то, следовательно, также можно сделать вывод о невозможности прогнозирования с помощью найденной регрессионной зависимости. Таким образом, модель объясняет всего 48% вариации цены, что говорит о недостаточности выбранных факторов, либо о недостаточном объеме выборки.

    Normalized R-squared is the same coefficient of determination, but adjusted for the sample size.

    Normal R-squared=1-(1-R-squared)*((n-1)/(n-k)),

    regression analysis linear equation

    where n is the number of observations; k - number of parameters. It is preferable to use normalized R-squared when adding new regressors (factors), because as they increase, the R-squared value will also increase, but this will not indicate an improvement in the model. Since in our case the resulting value is 0.43 (which differs from R-squared by only 0.05), we can talk about high confidence in the R-squared coefficient.

    Standard error shows the quality of approximation (approximation) of observation results. In our case, the error is 5.1. Let's calculate as a percentage: 5.1/(57.4-40.1)=0.294? 29% (The model is considered better when standard error amounts to<30%)

    Observations- the number of observed values ​​is indicated (23).

    TABLE ANALYSIS OF VARIANCE

    To obtain the regression equation, a -statistic is determined - a characteristic of the accuracy of the regression equation, which is the ratio of that part of the variance of the dependent variable that is explained by the regression equation to the unexplained (residual) part of the variance.

    In column df- the number of degrees of freedom k is given.

    For regression, this is the number of regressors (factors) - X1 (area) and X2 (score), i.e. k=2.

    For the remainder, this value is equal to n-(m+1), i.e. the number of initial points (23) minus the number of coefficients (2) and minus the free term (1).

    In the SS column- the sum of squared deviations from the average value of the resulting characteristic. It presents:

    Regression sum of squared deviations from the mean value of the resulting characteristic of theoretical values ​​calculated using the regression equation.

    The residual sum of deviations of the original values ​​from the theoretical values.

    The total sum of squared deviations of the initial values ​​from the resulting characteristic.

    The larger the regression sum of squared deviations (or the smaller the residual sum), the better the regression equation approximates the cloud of original points. In our case, the residual amount is about 50%. Consequently, the regression equation very poorly approximates the cloud of initial points.

    In the MS column- unbiased sample variances, regression and residual.

    In column F The value of criterion statistics was calculated to test the significance of the regression equation.

    To carry out a statistical test of the significance of the regression equation, a null hypothesis is formulated about the absence of a relationship between the variables (all coefficients for the variables are equal to zero) and the significance level is selected.

    The significance level is the acceptable probability of making a type I error - rejecting the correct null hypothesis as a result of testing. In this case, making a type I error means admitting in a sample that there is a relationship between variables in the population, when in fact there is none. Typically the significance level is taken to be 5%. Comparing the obtained value = 9.4 with the table value = 3.5 (the number of degrees of freedom is 2 and 20, respectively), we can say that the regression equation is significant (F>Fcr).

    In the significance column F the probability of the obtained value of the criterion statistics is calculated. Since in our case this value = 0.00123, which is less than 0.05, we can say that the regression equation (dependence) is significant with a probability of 95%.

    The two pillars described above show the reliability of the model as a whole.

    The following table contains the coefficients for the regressors and their estimates.

    The Y-intercept line is not associated with any regressor; it is a free coefficient.

    In column odds The values ​​of the regression equation coefficients are recorded. Thus, the equation was obtained:

    Y=25.6+0.009X1+0.346X2

    The regression equation must pass through the center of the cloud of initial points: 13.02? M(b)? 38.26

    Next, compare the column values ​​in pairs Coefficients and Standard Error. It can be seen that in our case, all absolute values ​​of the coefficients exceed the standard errors. This may indicate the significance of the regressors, however, this is a rough analysis. The t-statistics column contains a more accurate estimate of the significance of the coefficients.

    In the t-statistic column contains t-test values ​​calculated using the formula:

    t=(Coefficient)/(Standard error)

    This test has a Student distribution with the number of degrees of freedom

    n-(k+1)=23-(2+1)=20

    Using the Student's table we find the value ttable = 2.086. Comparing

    t with ttable we find that the regressor coefficient X2 is insignificant.

    Column p-value represents the probability that the critical value of the test statistic (Student's t statistic) will exceed the value calculated from the sample. In this case we compare p-values with the selected significance level (0.05). It can be seen that only the regressor coefficient X2=0.08>0.05 can be considered insignificant

    The lower 95% and upper 95% columns provide confidence interval limits with 95% confidence. Each coefficient has its own limits: Coefficientttable*Standard error

    Confidence intervals are constructed only for statistically significant values.

    In his works dating back to 1908. He described it using the example of the work of an agent selling real estate. In his records, the house sales specialist kept track of a wide range of input data for each specific building. Based on the results of the auction, it was determined which factor had the greatest influence on the transaction price.

    Analysis large quantities deals given interesting results. The final price was influenced by many factors, sometimes leading to paradoxical conclusions and even obvious “outliers” when a house with high initial potential was sold at a reduced price.

    The second example of the application of such an analysis is the work of which was entrusted with determining employee remuneration. The complexity of the task lay in the fact that it required not the distribution of a fixed amount to everyone, but its strict correspondence to the specific work performed. The appearance of many problems with practically similar solutions required a more detailed study of them at the mathematical level.

    A significant place was allocated to the section “regression analysis”, it combined practical methods, used to study dependencies that fall under the concept of regression. These relationships are observed between data obtained from statistical studies.

    Among the many problems to be solved, he sets three main goals: determining the regression equation general view; constructing estimates of parameters that are unknowns that are part of the regression equation; testing of statistical regression hypotheses. In the course of studying the relationship that arises between a pair of quantities obtained as a result of experimental observations and constituting a series (set) of the type (x1, y1), ..., (xn, yn), they rely on the provisions of regression theory and assume that for one quantity Y there is a certain probability distribution, while the other X remains fixed.

    The result Y depends on the value of the variable X; this dependence can be determined by various patterns, while the accuracy of the results obtained is influenced by the nature of the observations and the purpose of the analysis. The experimental model is based on certain assumptions that are simplified but plausible. The main condition is that the parameter X is a controlled quantity. Its values ​​are set before the start of the experiment.

    If the experiment uses a pair of uncontrolled XY variables, then regression analysis is carried out in the same way, but to interpret the results, during which the relationship between the studied variables is studied random variables, methods are applied Methods of mathematical statistics are not an abstract topic. They find application in life in various spheres of human activity.

    In the scientific literature, the term linear regression analysis is widely used to define the above method. For variable X, the term regressor or predictor is used, and dependent Y variables are also called criterion variables. This terminology reflects only the mathematical dependence of the variables, but not the cause-and-effect relationship.

    Regression analysis is the most common method used in processing the results of a wide variety of observations. Physical and biological dependencies are studied by means this method, it is implemented both in economics and technology. A lot of other fields use regression analysis models. Analysis of variance and multivariate statistical analysis work closely with this method of study.

    Lecture 4

    1. Elements of statistical analysis of the model
    2. Checking the statistical significance of regression equation parameters
    3. Analysis of Variance
    4. Examination overall quality regression equations
    5. F-statistics. Fisher distribution in regression analysis.

    When assessing the relationship between endogenous and exogenous variables (y and x) using sample data, it is not always possible to obtain a successful regression model at the first stage. In this case, the quality of the resulting model should be assessed each time. The quality of the model is assessed in 2 areas:

    · Statistical evaluation model quality

    Statistical analysis The model includes the following elements:

    • Checking the statistical significance of regression equation parameters
    • Checking the overall quality of the regression equation
    • Checking the properties of the data that were assumed to be true when estimating the equation

    The statistical significance of the parameters of the regression equation is determined by t-statistics or Student statistics. So:

    tb – t-statistic for regression coefficient b

    mb – standard error of the regression coefficient.

    The t-statistics for the correlation coefficients R are also calculated:

    Thus tb^2=t r ^2=F. That is, checking the statistical significance of the regression coefficient b is equivalent to checking the statistical significance of the correlation coefficient

    The correlation coefficient shows the closeness of the correlation relationship (between x and y).

    For linear regression, the correlation coefficient is:

    To determine the tightness of the connection, the Cheglok table is usually used

    R 0.1 – 0.3 weak

    R 0.3 – 0.5 moderate

    R 0.5-.07 noticeable

    R 0.7-0.9 high

    R 0.9 to 0.99 very high relationship between x and y

    Correlation coefficient -1

    Often for practical purposes the elasticity coefficient, beta coefficient, is calculated:

    The elasticity of the function y=f(x) is the limit of the ratio of the relative variables y and x

    Elasticity shows how much % y will change when x changes by 1%.

    For paired linear regression, the elasticity coefficient is calculated using the formula:

    It shows how many % in y will change on average when x changes on average by 1%.

    The beta coefficient is:

    – mean square deviation x

    – Mean square deviation

    The beta coefficient shows by what amount y will change from its standard deviation when x changes by the value of its standard deviation.


    Analysis of Variance

    In dispersion analysis, a special place is occupied by the decomposition of the total sum of squared deviations of the variable y from the mean into two parts: the amount explained by regression and the amount not explained by regression.

    The total sum of squared deviations is equal to the sum of squared deviations explained by the regression plus the residual sum of squared deviations.

    These sums are related to the number of degrees of freedom df - this is the number of freedom of independent variation of characteristics.

    So the total sum of squared deviations has a total number of degrees of freedom (n – 1).

    The sum of squared deviations explained by regression has a degree of freedom of 1, since the variable depends on one value - the regression coefficient b.

    There is an equality between the number of degrees of freedom, from which:

    N – 1 = 1 + n – 2

    Let's divide each sum by the corresponding number of degrees of freedom, we get the average square of deviations or dispersion:

    D total = D fact + D rest

    Assessing the overall quality of a regression equation means determining whether the mathematical model expressing the relationship between variables corresponds to the experimental data and whether there are enough variables included in the model to explain y.

    Assess the overall qualities of the model = assess the reliability of the model = assess the reliability of the regression equation.

    The overall quality of the regression model is assessed based on analysis of variance. To assess the quality of the model, the coefficient of determination is calculated:

    The numerator contains a sample estimate of the residual variance, and the denominator contains a sample estimate of the total variance.

    The coefficient of determination characterizes the proportion of variation in the dependent variable explained by the regression equation.

    So, if R squared is 0.97, this means that 97% of the changes in y are due to changes in x.

    The closer R squared is to one, the stronger the statistically significant linear relationship between x and y.

    To obtain unbiased estimates of the variance (coefficient of determination), both the numerator and denominator in the formula are divided by the appropriate number of degrees of freedom:

    To determine the statistical significance of the coefficient of determination R square, the null hypothesis for F-statistics is tested, calculated using the formula:

    For a pair linear:

    F-calculated is compared with the value of the statistic in the table. F-tabular is considered with the number of degrees of freedom m, n-m-1, at a significance level of alpha.

    If F calculated > F table then the null hypothesis is rejected, the hypothesis about the statistical significance of the coefficient of determination R square is accepted.

    Fisher's F test = factor variance / by residual variance:

    Lecture No. 5

    Checking data properties that were assumed to be true when estimating the regression equation

    1. Autocorrelation in residuals

    2. Durbin-Watson statistics

    3. Examples

    When estimating the regression model parameters, it is assumed that the deviation

    1. In case the relationship between x and y is not linear.

    2. The relationship between the variables x and y is linear, but the indicator under study is influenced by a factor not included in the model. The magnitude of such a factor may change its dynamics over the period under review. This is especially true for lagged variables.

    Both reasons indicate that the resulting regression equation can be improved by estimating a nonlinear relationship or adding an additional factor to the original model.

    The fourth premise of the least squares method says that deviations are independent of each other, however, when studying and analyzing source data in practice, there are situations when these deviations contain a trend or cyclical fluctuations.