Regression :- PDF / PPT

Save (0)




(Curve fitting by the method of least squares, Fitting the lines y=a + bx and x= a + by,

                  Multiple regression, Standard error of regression)



Regression analysis

• Regression analysis is a form of predictive modelling technique which investigates the

  relationship between a dependent (target) and independent variable(s) (predictor derived by

  Francis Galton

• If two variables are involved, the variable that is the basis of the estimation, is conventionally

  called the independent variable and the variable whose value is to be estimated is called the

  dependent variable.

• In Simple words regression is technique concerned with predicting some variables by knowing


• This technique is used for forecasting, time series modelling and finding the causal effect

  relationship between the variables.

• The dependent variable is variously known as explained variables, predict and, response and

  endogenous variables.

• While the independent variable is known as explanatory, regressor and exogenous variable.





 Typically, a regression analysis is used for these purposes:


(1) Prediction of the target variable (forecasting).


(2) Modelling the relationships between the dependent variable and the explanatory variable.


(3) Testing of hypotheses.


 Benefits


1. It indicates the strength of impact of multiple independent variables on a dependent variable.


2. It indicates the significant relationships between dependent variable and independent variable.


These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the

best set of variables to be used for building predictive models.





Types of Regression Analysis

 Types of regression analysis:

Regression analysis is generally classified into two kinds: simple and multiple.







             Simple Multiple

           regression regression





The regression analysis is a statistical method to deal with the formulation of mathematical model depicting

relationship amongst variables which can be used for the purpose of prediction of the values of dependent

variable, given the values of the independent variable.

Example: The relationship between Estriol level and birthweight can be quantified by fitting a regression line

best studied through regression.














                                  INDEPENDENT VARIABLE





• x=Estriol level


• y= birthweight


• The line y = α + βx is the regression line, where α is the intercept and β is the slope of the line.


• The relationship y = α + βx is not expected to hold exactly for every woman. For example, not all

  women with a given estriol level have babies with identical birthweights. Thus an error term e,

  which represents the variance of birthweight among all babies of women with a given estriol level

  x, is introduced into the model. Let’s assume e follows a normal distribution. The full linear

  regression model then takes the following form: y = α + βx + e

• One interpretation of the regression line is that for a woman with estriol level x, the corresponding

  birthweight will be normally distributed with mean α + βx and σ2 variance. If σ2 were 0, then

  every point would fall exactly on the regression line, whereas the larger σ2 is, the more scatter

  occurs about the regression line.



The effect of σ2 on the goodness of fit of a regression line







Interpretation of regression line for different values of β (Slope of the line)



                Case 1: β > 0





                                                   If β is greater than 0, then as x

                                                   increases, the expected value of y will







Case 2: β < 0





                                  If β is less than 0, then as x

                                  increases, the expected value of y

                                  will decrease.







Case 3: β = 0





                                  If β is equal to 0, then there is no

                                  linear relationship between x and y.







Methods of finding regression lines

1. Scatter diagram method

 In this method we plot each pair of observations on a graph paper and obtain a diagram

  called a scatter diagram.

 Then we find a straight line passing through the points of the scatter diagram such that the

  error in the estimation of Y are minimized. This is the line of regression of Y on X.

 Similarly, if we find a straight line passing through the points of the scatter diagram such

  that the error in the estimation of X are minimized. This is the line of regression of X on



2. Method of Least squares.




                           Fitting the lines y = a + bx (Straight Line)



• Curve fitting is a method of finding a specific relation

   connecting the DEPENDENT and INDEPENDENT

   VARIABLES for a given data so as to satisfy the data as

   accurately as possible.

• The method of least square is most systematic procedure to fit

   a unique curve through given point.

• Using ‘x’ and ‘y’ points we need to find a curve to be fitted in

   the given data.

• One is observed points and expected points and their

   difference is known as ‘ERROR’.


1. y = a + bx or x = a + by (Straight Line)



2. y = axb





Find the best value of ‘a’ and ‘b’ so that y = a + bx fits the data of Estriol concentration and

birthweight given in the table and also provide the equation of line.

Estriol mg/24hr 10 9 9 12 14 16 16 14



Birthweight g/100 y 25 25 25 27 27 27 24 30








Once these values are obtained and

                  have been put in the equation Y = a

                  + bX, we say that we have fitted the

                  regression equation of Y on X to the

                  given data. In a similar fashion, we

                  can develop the regression equation

                  of X and Y viz., X = a + bY,

                  presuming Y as an independent

                  variable and X as dependent



Multiple Regression

• Multiple regression is the extension of simple linear regression.



       Independent Dependent Simple Linear

         Variable Variable regression one

           (IV) (DV) to one.




       (IV) (IV) Dependent Multiple

                                                                       Variable regression

                                                                        (DV) many to one.

       (IV) (IV)





• Multiple regression is an extension of simple linear regression.

• Two or more independent variables are used to predict/explain the variance in one

   dependent variable.

• Two problems may arise:

   1.Overfitting : It is caused by adding too many independent variables; they account for

more variance but add nothing to the model.

   2.Multicollinearity : It happens when some/all of the independent variables are

correlated with each other.

Remedy for solving problems:

1. Increasing the sample size is a common first step since when sample size is increased,

    standard error decreases (all other things equal).

2. Remove the most intercorrelated variable(s) from analysis. This method is misguided if

    the variables were there due to the theory of the model, which they should have been.

3. Take transformation of variables which is the best fit for model.

In multiple regression, each coefficient is interpreted as the estimated change in ‘y’

corresponding to a one unit change in a variable, when all other variables are held



Also Also

denoted as denoted as


Find the line of regression having two independent variable (Birthweight and Age of child

and also calculate the following:

1. Calculate the predicted average SBP of a baby with birthweight 8 lb. (128 oz.)

   measured at 3 days of life.

2. Calculate the predicted average SBP of a baby with birthweight 2 lb. (32 oz.) measured

   at 5 days of life.


Birthweight 135 120 100 105 130

 (oz.) (X1)

Age (days) 3 4 3 2 4


Systolic BP 89 90 83 77 92





The regression equation tells us that

                  for a new born the average blood

                  pressure increases by an estimated

                  0.180 mm-hg per ounce of

                  birthweight and 4.976 mm-hg per

                  day of age.




Important Properties of Regression Coefficient

1. The regression coefficient is denoted by b.

2. We express it in the form of an original unit of data.

3. The regression coefficient of y on x is denoted by byx. The regression coefficient of x on y

is denoted by bxy.

4. If one regression coefficient is greater than 1, then the other will be less than 1.

5. They are not independent of the change of scale. There will change if the regression

coefficient if x and y are multiplied by any constant.

6. AM of both regression coefficients is greater than or equal to the coefficient of


7. GM between the two regression coefficients is equal to the correlation coefficient.


8. If bxy is positive, then byx is also positive and vice versa. 30


Standard Error of Regression (S)

• The standard error of the regression or Standard error of estimate is a measure of the

  accuracy of predictions.

• Note: The regression line is the line that minimizes the sum of squared deviations of

  prediction (also called the sum of squares error).

• All of the observed values of (Y,X1,X2) do not fall on the regression line but they scatter

  away from it.

• The degree of scatter of the observed values about the regression is measured by Standard

  deviation of regression or the standard deviation, it measures the variations of observation

  about the true regression line Y= α + β1X1 + β2X2 is denoted by σY.12



• The standard error of regression (S) represents the average distance that the observed

  values fall from the regression line.

• Conveniently, it tells you how wrong the regression model is on average using the units

  of the response variable.

• Smaller values are better because it indicates that the observations are closer to the fitted


• Unlike R-squared, one can use the standard error of the regression to assess the precision

  of the predictions.

• The standard error of the regression provides the absolute measure of the typical distance

  that the data points fall from the regression line. S is in the units of the dependent






Standard Error of estimate for Simple Regression








Standard Error of estimate for Multiple Regression