• Excel regression model analysis. Konrad Carlberg. Regression analysis in Microsoft Excel

    The MS Excel package allows you to do most of the work very quickly when constructing a linear regression equation. It is important to understand how to interpret the results obtained. To build a regression model, you must select Tools\Data Analysis\Regression (in Excel 2007 this mode is in the Data/Data Analysis/Regression block). Then copy the results into a block for analysis.

    Initial data:

    Analysis results

    Include in report
    Calculation of regression equation parameters
    Theoretical material
    Regression equation on standard scale
    Multiple Correlation Coefficient (Multiple Correlation Index)
    Partial elasticity coefficients
    Comparative assessment of the influence of the analyzed factors on the resulting characteristic (d - coefficients of separate determination)

    Checking the quality of the constructed regression equation
    Significance of regression coefficients b i (t-statistics. Student's test)
    Significance of the equation as a whole (F-statistics. Fisher's test). Determination coefficient
    Partial F-tests

    Significance level 0.005 0.01 0.025 0.05 0.1 0.25 0.4

    Topic: CORRELATION AND REGRESSION ANALYSIS INEXCEL

    LABORATORY WORK No. 1

    1. DETERMINATION OF THE PAIRED CORRELATION COEFFICIENT IN THE PROGRAMEXCEL

    Correlation- this is an incomplete, probabilistic relationship between indicators, which manifests itself only in the mass of observations.

    Pairwise correlation- this is the relationship between two indicators, one of which is factorial and the other is resultant.

    Multiple correlation arises from the interaction of several factors with an effective indicator.

    Necessary conditions for using correlation analysis:

    1. The presence of a sufficiently large number of observations about the value of the studied factor and performance indicators.

    2. The factors under study must be quantitatively measured and reflected in certain sources of information.

    The use of correlation analysis allows us to solve the following problems:

    1. Determine the change in the performance indicator under the influence of one or more factors.

    2. Establish the relative degree of dependence of the performance indicator on each factor.

    Task 1.

    Data are available for 20 agricultural holdings. Find correlation coefficient between the yields of grain crops and the quality of the land and evaluate its significance. The data is shown in the table.

    Table. Dependence of grain yield on land quality

    Farm number

    Land quality, score x

    Productivity, c/ha

      To find the correlation coefficient, use the function CORREL.

      The significance of the correlation coefficient is checked using the criterion Student's test.

    For the example under consideration, r=0.99, n=18.

    To find the quantile of the Student distribution, use the STUDIST function with the following arguments: Probability –0,05, Degrees freedom –18.

    By comparing the value of the t-statistic with the quantile of the Student distribution, draw conclusions about the significance of the pair correlation coefficient. If the calculated value of the t-statistic is greater than the quantile of the Student distribution, then the value of the correlation coefficient is significant.

    BUILDING A REGRESSION MODEL OF THE RELATIONSHIP BETWEEN TWO QUANTITIES

    Task 2.

    According to task 1:

    1) build a regression equation (linear model), which characterizes the linear relationship between land quality and productivity;

    2). check the adequacy of the resulting model.

    1 - th way.

    1. On an Excel sheet, select an array of free cells consisting of five rows and two columns.

    2. Call the function LINEST.

    3.Specify the following arguments for the function: Izv_value_y Productivity, c/ha;Izv_value_x- column of indicator values Land quality, score; Constant –1, Stat – 1(allows you to calculate indicators used to check the adequacy of the model. If Stat – 0, then such indicators will not be calculated.

    4. Press the key combination Ctrl- Shift- Enter.

    The selected cells display the model coefficients, as well as indicators that allow you to check the model for adequacy (Table 2).

    Table 2

    a 1

    a 0

    S e1

    S e0

    R 2

    S e

    Q R

    Q e

    a 1 , a 0 – model coefficients;

    S e 1 S e 0 – standard errors of coefficients. The more accurate the model, the smaller these values ​​are.

    R 2 – coefficient of determination. The larger it is, the more accurate the model.

    F– statistics to test the significance of the model.

    n- k-1 – number of degrees of freedom (n-sample size, k-number of input variables; in this example n=20, k=1)

    Q R– sum of squares due to regression;

    Q e– sum of squared errors.

    5. To check the adequacy of the model, find the quantile of the Fisher distribution F f . using the function FDISCOVER. To do this, enter the function in any free cell FDISCOVER with the following arguments: Probability – 0,05, Degrees_of_freedom _1–1, Degrees_of_freedom _2–18. If F> F f , then the model is adequate to the original data

    6. Check the adequacy of the constructed model using the calculated significance level (P). Enter function FDIST with the following arguments: X– statistics value F, Degrees_of_freedom_1–1, Degrees_of_freedom_2– 18. If the calculated significance level P<α =0,05, то модель адекватна исходным данным.

    2nd method.

    Determination of model coefficients with obtaining indicators to check its adequacy and the significance of the coefficients.

      Select team Service/Data Analysis/Regression. In the dialog box set: Input intervalY– indicator values Productivity, c/ha,Input intervalX– indicator values Land quality, score.

      Check box Tags. In the area Output Options select radio button Output interval and indicate the cell from which the results will begin. To obtain the results, click OK.

    Interpretation of results.

    The required model coefficients are in the column Odds:

    For this example, the model equation is:

    Y=2.53+0.5X

    In this example, with an increase in soil quality by one point, the yield of grain crops increases by an average of 0.5 c/ha.

    Checking the adequacy of the model performed at the calculated significance level P indicated in the column SignificanceF. If the calculated significance level is less than the specified significance level α = 0.05, then the model is adequate.

    Testing for statistical significance model coefficients is performed using the calculated significance levels P indicated in the column P-meaning. If the calculated significance level is less than the specified significance level α = 0.05, then the corresponding model coefficient is statistically significant.

    MultipleRcorrelation coefficient. The closer its value is to 1, the closer the relationship between the studied indicators. For this example, R= 0.99. This allows us to conclude that the quality of the land is one of the main factors on which the yield of grain crops depends.

    R-squarecoefficient of determination. It is obtained by squaring the correlation coefficient - R 2 =0.98. It shows that 98% of grain yields depend on soil quality, with other factors accounting for 0.02%.

    3rd method. GRAPHICAL METHOD OF BUILDING A MODEL.

    Construct a scatter plot yourself that shows the relationship between yield and land quality.

    Obtain a linear model of the dependence of grain yield on land quality.

    In previous posts, the analysis often focused on a single numerical variable, such as mutual fund returns, Web page loading times, or soft drink consumption. In this and subsequent notes, we will look at methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

    The material will be illustrated with a cross-cutting example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company currently does not have a systematic approach to selecting new outlets. The location in which a company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager’s idea of ​​the ideal store location. Imagine that you are the head of the special projects and planning department. You have been tasked with developing a strategic plan for opening new stores. This plan should include a forecast of annual sales for newly opened stores. You believe that retail space is directly related to revenue and want to factor this into your decision making process. How do you develop a statistical model to predict annual sales based on the size of a new store?

    Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that can predict the values ​​of a dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will look at simple linear regression - a statistical method that allows you to predict the values ​​of a dependent variable Y by independent variable values X. Subsequent notes will describe a multiple regression model designed to predict the values ​​of an independent variable Y based on the values ​​of several dependent variables ( X 1, X 2, …, X k).

    Download the note in or format, examples in format

    Types of regression models

    Where ρ 1 – autocorrelation coefficient; If ρ 1 = 0 (no autocorrelation), D≈ 2; If ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; If ρ 1 = -1 (negative autocorrelation), D ≈ 4.

    In practice, the application of the Durbin-Watson criterion is based on comparing the value D with critical theoretical values dL And d U for a given number of observations n, number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If D< d L , the hypothesis about the independence of random deviations is rejected (hence, there is a positive autocorrelation); If D>dU, the hypothesis is not rejected (that is, there is no autocorrelation); If dL< D < d U , there are no sufficient grounds for making a decision. When the calculated value D exceeds 2, then with dL And d U It is not the coefficient itself that is compared D, and the expression (4 – D).

    To calculate the Durbin-Watson statistics in Excel, let's turn to the bottom table in Fig. 14 Withdrawal of balance. The numerator in expression (10) is calculated using the function =SUMMAR(array1;array2), and the denominator =SUMMAR(array) (Fig. 16).

    Rice. 16. Formulas for calculating Durbin-Watson statistics

    In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that a positive autocorrelation exists? It is necessary to correlate the value of D with the critical values ​​( dL And d U), depending on the number of observations n and significance level α (Fig. 17).

    Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

    Thus, in the problem of sales volume in a store delivering goods to home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, dL= 1.08 and dU= 1.36. Since D = 0,883 < dL= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be used.

    Testing hypotheses about slope and correlation coefficient

    Above, regression was used solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X The least squares method was used. In addition, we examined the root mean square error of the estimate and the mixed correlation coefficient. If the analysis of the residuals confirms that the conditions of applicability of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

    Applicationt -criteria for slope. By testing whether the population slope β 1 is equal to zero, you can determine whether there is a statistically significant relationship between the variables X And Y. If this hypothesis is rejected, it can be argued that between the variables X And Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (there is no linear dependence), H1: β 1 ≠ 0 (there is a linear dependence). By definition t-statistic is equal to the difference between the sample slope and the hypothetical value of the population slope, divided by the root mean square error of the slope estimate:

    (11) t = (b 1 β 1 ) / Sb 1

    Where b 1 – slope of direct regression on sample data, β1 – hypothetical slope of direct population, , and test statistics t has t-distribution with n – 2 degrees of freedom.

    Let's check whether there is a statistically significant relationship between store size and annual sales at α = 0.05. t-the criterion is displayed along with other parameters when used Analysis package(option Regression). The complete results of the Analysis Package are shown in Fig. 4, fragment related to t-statistics - in Fig. 18.

    Rice. 18. Application results t

    Since the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level of α = 0.05 can be found using the formula: t L=STUDENT.ARV(0.025,12) = –2.1788, where 0.025 is half the significance level, and 12 = n – 2; tU=STUDENT.ARV(0.975,12) = +2.1788.

    Since t-statistics = 10.64 > tU= 2.1788 (Fig. 19), null hypothesis H 0 rejected. On the other side, r-value for X= 10.6411, calculated by the formula =1-STUDENT.DIST(D3,12,TRUE), is approximately equal to zero, so the hypothesis H 0 again rejected. The fact that r-value of almost zero means that if there were no true linear relationship between store size and annual sales, it would be virtually impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

    Rice. 19. Testing the hypothesis about the population slope at a significance level of 0.05 and 12 degrees of freedom

    ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of simple linear regression is to use F-criteria. Let us remind you that F-test is used to test the relationship between two variances (for more details, see). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-criterion uses the ratio of the variance explained by the regression (i.e. the value SSR, divided by the number of independent variables k), to the error variance ( MSE = S YX 2 ).

    By definition F-statistic is equal to the mean square of regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, Where MSR=SSR / k, MSE =SSE/(n– k – 1), k– number of independent variables in the regression model. Test statistics F has F-distribution with k And n– k – 1 degrees of freedom.

    For a given significance level α, the decision rule is formulated as follows: if F>FU, the null hypothesis is rejected; otherwise it is not rejected. The results, presented in the form of a summary table of variance analysis, are shown in Fig. 20.

    Rice. 20. Analysis of variance table for testing the hypothesis about the statistical significance of the regression coefficient

    Likewise t-criterion F-the criterion is displayed in the table when used Analysis package(option Regression). Full results of the work Analysis package are shown in Fig. 4, fragment related to F-statistics – in Fig. 21.

    Rice. 21. Application results F-criteria obtained using the Excel Analysis Package

    The F-statistic is 113.23, and r-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained using the formula F U=F.OBR(1-0.05;1;12) = 4.7472 (Fig. 22). Since F = 113,23 > F U= 4.7472, and r-value close to 0< 0,05, нулевая гипотеза H 0 is rejected, i.e. The size of a store is closely related to its annual sales.

    Rice. 22. Testing the population slope hypothesis at a significance level of 0.05 with one and 12 degrees of freedom

    Confidence interval containing slope β 1 . To test the hypothesis that there is a linear relationship between variables, you can construct a confidence interval containing the slope β 1 and verify that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±tn –2 Sb 1

    As shown in Fig. 18, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 =STUDENT.ARV(0.975,12) = 2.1788. Hence, b 1 ±tn –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, there is a probability of 0.95 that the population slope lies in the interval +1.328 to +2.012 (i.e., $1,328,000 to $2,012,000). Since these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that each increase in store area by 1,000 sq. ft. results in an increase in average sales volume of $1,328,000 to $2,012,000.

    Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking the existence of a correlation:

    Where r = + , If b 1 > 0, r = – , If b 1 < 0. Тестовая статистика t has t-distribution with n – 2 degrees of freedom.

    In the problem about the Sunflowers chain of stores r 2= 0.904, a b 1- +1.670 (see Fig. 4). Since b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t-statistics:

    At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

    When discussing inferences regarding population slope, confidence intervals and hypothesis tests are used interchangeably. However, calculating the confidence interval containing the correlation coefficient turns out to be more difficult, since the type of sampling distribution of the statistic r depends on the true correlation coefficient.

    Estimation of mathematical expectation and prediction of individual values

    This section discusses methods for estimating the mathematical expectation of a response Y and predictions of individual values Y for given values ​​of the variable X.

    Constructing a confidence interval. In example 2 (see section above Least squares method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales volume in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is point-wise. To estimate the mathematical expectation of the population, the concept of a confidence interval was proposed. Similarly, we can introduce the concept confidence interval for the mathematical expectation of the response for a given variable value X:

    Where , = b 0 + b 1 X i– predicted value is variable Y at X = X i, S YX– root mean square error, n– sample size, Xi- specified value of the variable X, µ Y|X = Xi– mathematical expectation of the variable Y at X = Xi, SSX =

    Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given significance level, an increase in the amplitude of fluctuations around the regression line, measured using the root mean square error, leads to an increase in the width of the interval. On the other hand, as one would expect, an increase in sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the variable value Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the average.

    Let's say that when choosing a store location, we want to construct a 95% confidence interval for the average annual sales of all stores whose area is 4000 square meters. feet:

    Therefore, the average annual sales volume in all stores with an area of ​​4,000 sq. feet, with 95% probability lies in the range from 6.971 to 8.317 million dollars.

    Calculate the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains the predicted value rather than the parameter estimate. Interval for predicted response YX = Xi for a specific variable value Xi determined by the formula:

    Suppose that when choosing a location for a retail outlet, we want to construct a 95% confidence interval for the predicted annual sales volume for a store whose area is 4000 square meters. feet:

    Therefore, the predicted annual sales volume for a store with an area of ​​4000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As we can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the mathematical expectation.

    Pitfalls and ethical issues associated with using regression

    Difficulties associated with regression analysis:

    • Ignoring the conditions of applicability of the least squares method.
    • Erroneous assessment of the conditions for the applicability of the least squares method.
    • Incorrect choice of alternative methods when the conditions of applicability of the least squares method are violated.
    • Application of regression analysis without deep knowledge of the subject of research.
    • Extrapolating a regression beyond the range of the explanatory variable.
    • Confusion between statistical and causal relationships.

    The widespread use of spreadsheets and statistical software has eliminated the computational problems that had hampered the use of regression analysis. However, this led to the fact that regression analysis was used by users who did not have sufficient qualifications and knowledge. How can users know about alternative methods if many of them have no idea at all about the conditions of applicability of the least squares method and do not know how to check their implementation?

    The researcher should not get carried away with crunching numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four data sets shown in Fig. 23, have the same regression parameters (Fig. 24).

    Rice. 23. Four artificial data sets

    Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the picture to enlarge the image)

    So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Figure 25) and residual plots (Figure 26) constructed for these data sets.

    Rice. 25. Scatter plots for four data sets

    Scatter plots and residual plots indicate that these data differ from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A does not have any pattern. This cannot be said about sets B, C and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the residual plot, which has a parabolic shape. The scatter plot and residual plot show that data set B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. A method for detecting and eliminating outliers in observations is called influence analysis. After eliminating the outlier, the result of re-estimating the model may be completely different. The scatterplot plotted from data from set G illustrates an unusual situation in which the empirical model is significantly dependent on an individual response ( X 8 = 19, Y 8 = 12.5). Such regression models must be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

    Rice. 26. Residual plots for four data sets

    How to avoid pitfalls in regression analysis:

    • Analysis of possible relationships between variables X And Y always start by drawing a scatter plot.
    • Before interpreting the results of regression analysis, check the conditions for its applicability.
    • Plot the residuals versus the independent variable. This will make it possible to determine how well the empirical model matches the observational results and to detect a violation of the variance constancy.
    • Use histograms, stem-and-leaf plots, boxplots, and normal distribution plots to test the assumption of a normal error distribution.
    • If the conditions for applicability of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
    • If the conditions for the applicability of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
    • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
    • Keep in mind that statistical relationships are not always cause-and-effect. Remember that correlation between variables does not mean there is a cause-and-effect relationship between them.

    Resume. As shown in the block diagram (Figure 27), the note describes the simple linear regression model, the conditions for its applicability, and how to test these conditions. Considered t-criterion for testing the statistical significance of the regression slope. A regression model was used to predict the values ​​of the dependent variable. An example is considered related to the choice of location for a retail outlet, in which the dependence of annual sales volume on the store area is examined. The information obtained allows you to more accurately select a location for a store and predict its annual sales volume. The following notes will continue the discussion of regression analysis and also look at multiple regression models.

    Rice. 27. Note structure diagram

    Materials from the book Levin et al. Statistics for Managers are used. – M.: Williams, 2004. – p. 792–872

    If the dependent variable is categorical, logistic regression must be used.

    The linear regression method allows us to describe a straight line that best fits a series of ordered pairs (x, y). The equation for a straight line, known as the linear equation, is given below:

    ŷ is the expected value of y for a given value of x,

    x is an independent variable,

    a is a segment on the y-axis for a straight line,

    b is the slope of the straight line.

    The figure below illustrates this concept graphically:

    The figure above shows the line described by the equation ŷ =2+0.5x. The y-intercept is the point at which the line intersects the y-axis; in our case, a = 2. The slope of the line, b, the ratio of the rise of the line to the length of the line, has a value of 0.5. A positive slope means the line rises from left to right. If b = 0, the line is horizontal, which means there is no relationship between the dependent and independent variables. In other words, changing the value of x does not affect the value of y.

    ŷ and y are often confused. The graph shows 6 ordered pairs of points and a line, according to the given equation

    This figure shows the point corresponding to the ordered pair x = 2 and y = 4. Note that the expected value of y according to the line at X= 2 is ŷ. We can confirm this with the following equation:

    ŷ = 2 + 0.5х =2 +0.5(2) =3.

    The y value represents the actual point and the ŷ value is the expected value of y using a linear equation for a given value of x.

    The next step is to determine the linear equation that best matches the set of ordered pairs, we talked about this in the previous article, where we determined the type of equation by .

    Using Excel to Define Linear Regression

    In order to use the regression analysis tool built into Excel, you must activate the add-in Analysis package. You can find it by clicking on the tab File -> Options(2007+), in the dialog box that appears OptionsExcel go to the tab Add-ons. In the field Control choose Add-onsExcel and click Go. In the window that appears, check the box next to Analysis package, click OK.

    In the tab Data in the group Analysis a new button will appear Data analysis.

    To demonstrate the work of the add-in, we will use data where a guy and a girl share a table in the bathroom. Enter the data from our bathtub example in Columns A and B of the blank sheet.

    Go to the tab Data, in the group Analysis click Data analysis. In the window that appears Data Analysis select Regression as shown in the figure and click OK.

    Set the necessary regression parameters in the window Regression as shown in the picture:

    Click OK. The figure below shows the results obtained:

    These results are consistent with those we obtained by doing our own calculations in .

    Regression analysis is one of the most popular methods of statistical research. It can be used to establish the degree of influence of independent variables on the dependent variable. Microsoft Excel has tools designed to perform this type of analysis. Let's look at what they are and how to use them.

    But, in order to use the function that allows you to perform regression analysis, you first need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.


    Now when we go to the tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data Analysis".

    Types of Regression Analysis

    There are several types of regressions:

    • parabolic;
    • sedate;
    • logarithmic;
    • exponential;
    • demonstrative;
    • hyperbolic;
    • linear regression.

    We will talk in more detail about performing the last type of regression analysis in Excel later.

    Linear Regression in Excel

    Below, as an example, is a table showing the average daily air temperature outside and the number of store customers for the corresponding working day. Let's find out using regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

    The general linear regression equation is as follows: Y = a0 + a1x1 +…+ akhk. In this formula Y means a variable, the influence of factors on which we are trying to study. In our case, this is the number of buyers. Meaning x are the various factors that influence a variable. Options a are regression coefficients. That is, they are the ones who determine the significance of a particular factor. Index k denotes the total number of these very factors.


    Analysis results analysis

    The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

    One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependency less than 0.5 is bad.

    Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Odds". This indicates what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

    Value at the intersection of the graph "Variable X1" And "Odds" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high influence indicator.

    As you can see, using Microsoft Excel it is quite easy to create a regression analysis table. But only a trained person can work with the output data and understand its essence.