REGRESSION
(Curve fitting by the method of least squares, Fitting the lines y=a + bx and x= a + by,
Multiple regression, Standard error of regression)
www.DuloMix.com
Regression analysis
• Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable(s) (predictor derived by
Francis Galton
• If two variables are involved, the variable that is the basis of the estimation, is conventionally
called the independent variable and the variable whose value is to be estimated is called the
dependent variable.
• In Simple words regression is technique concerned with predicting some variables by knowing
others.
• This technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables.
• The dependent variable is variously known as explained variables, predict and, response and
endogenous variables.
• While the independent variable is known as explanatory, regressor and exogenous variable.
www.DuloMix.com
2
WHY WE NEED REGRESSION ANALYSIS?
Typically, a regression analysis is used for these purposes:
(1) Prediction of the target variable (forecasting).
(2) Modelling the relationships between the dependent variable and the explanatory variable.
(3) Testing of hypotheses.
Benefits
1. It indicates the strength of impact of multiple independent variables on a dependent variable.
2. It indicates the significant relationships between dependent variable and independent variable.
These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.
www.DuloMix.com
3
www.DuloMix.com
4
Types of Regression Analysis
Types of regression analysis:
Regression analysis is generally classified into two kinds: simple and multiple.
REGRESSION
ANALYSIS
Simple Multiple
regression regression
www.DuloMix.com
5
The regression analysis is a statistical method to deal with the formulation of mathematical model depicting
relationship amongst variables which can be used for the purpose of prediction of the values of dependent
variable, given the values of the independent variable.
Example: The relationship between Estriol level and birthweight can be quantified by fitting a regression line
best studied through regression.
DEPENDENT VARIABLE
EXPECTED
VALUE
OBSERVED
VALUE
INDEPENDENT VARIABLE
www.DuloMix.com
6
If,
• x=Estriol level
• y= birthweight
• The line y = α + βx is the regression line, where α is the intercept and β is the slope of the line.
• The relationship y = α + βx is not expected to hold exactly for every woman. For example, not all
women with a given estriol level have babies with identical birthweights. Thus an error term e,
which represents the variance of birthweight among all babies of women with a given estriol level
x, is introduced into the model. Let’s assume e follows a normal distribution. The full linear
regression model then takes the following form: y = α + βx + e
• One interpretation of the regression line is that for a woman with estriol level x, the corresponding
birthweight will be normally distributed with mean α + βx and σ2 variance. If σ2 were 0, then
every point would fall exactly on the regression line, whereas the larger σ2 is, the more scatter
occurs about the regression line.
www.DuloMix.com
7
The effect of σ2 on the goodness of fit of a regression line
www.DuloMix.com
8
Interpretation of regression line for different values of β (Slope of the line)
Case 1: β > 0
If β is greater than 0, then as x
increases, the expected value of y will
increase.
www.DuloMix.com
9
Case 2: β < 0
If β is less than 0, then as x
increases, the expected value of y
will decrease.
www.DuloMix.com
10
Case 3: β = 0
If β is equal to 0, then there is no
linear relationship between x and y.
www.DuloMix.com
11
Methods of finding regression lines
1. Scatter diagram method
In this method we plot each pair of observations on a graph paper and obtain a diagram
called a scatter diagram.
Then we find a straight line passing through the points of the scatter diagram such that the
error in the estimation of Y are minimized. This is the line of regression of Y on X.
Similarly, if we find a straight line passing through the points of the scatter diagram such
that the error in the estimation of X are minimized. This is the line of regression of X on
Y.
2. Method of Least squares.
www.DuloMix.com
12
CURVE FITTING BY LEAST SQUARE METHOD
Fitting the lines y = a + bx (Straight Line)
Definition:
• Curve fitting is a method of finding a specific relation
connecting the DEPENDENT and INDEPENDENT
VARIABLES for a given data so as to satisfy the data as
accurately as possible.
• The method of least square is most systematic procedure to fit
a unique curve through given point.
• Using ‘x’ and ‘y’ points we need to find a curve to be fitted in
the given data.
• One is observed points and expected points and their
difference is known as ‘ERROR’.
Types:
1. y = a + bx or x = a + by (Straight Line)
www.DuloMix.com
13
2. y = axb
www.DuloMix.com
14
www.DuloMix.com
15
TO DO:
Find the best value of ‘a’ and ‘b’ so that y = a + bx fits the data of Estriol concentration and
birthweight given in the table and also provide the equation of line.
Estriol mg/24hr 10 9 9 12 14 16 16 14
x
Birthweight g/100 y 25 25 25 27 27 27 24 30
www.DuloMix.com
16
www.DuloMix.com
17
Once these values are obtained and
have been put in the equation Y = a
+ bX, we say that we have fitted the
regression equation of Y on X to the
given data. In a similar fashion, we
can develop the regression equation
of X and Y viz., X = a + bY,
presuming Y as an independent
variable and X as dependent
variable].
www.DuloMix.com
18
Multiple Regression
• Multiple regression is the extension of simple linear regression.
Independent Dependent Simple Linear
Variable Variable regression one
(IV) (DV) to one.
(IV) (IV) Dependent Multiple
Variable regression
(DV) many to one.
(IV) (IV)
www.DuloMix.com
19
• Multiple regression is an extension of simple linear regression.
• Two or more independent variables are used to predict/explain the variance in one
dependent variable.
• Two problems may arise:
1.Overfitting : It is caused by adding too many independent variables; they account for
more variance but add nothing to the model.
2.Multicollinearity : It happens when some/all of the independent variables are
correlated with each other.
Remedy for solving problems:
1. Increasing the sample size is a common first step since when sample size is increased,
standard error decreases (all other things equal).
2. Remove the most intercorrelated variable(s) from analysis. This method is misguided if
the variables were there due to the theory of the model, which they should have been.
3. Take transformation of variables which is the best fit for model.
In multiple regression, each coefficient is interpreted as the estimated change in ‘y’
corresponding to a one unit change in a variable, when all other variables are held
constant. www.DuloMix.com
20
Also Also
denoted as denoted as
“a”
Find the line of regression having two independent variable (Birthweight and Age of child
and also calculate the following:
1. Calculate the predicted average SBP of a baby with birthweight 8 lb. (128 oz.)
measured at 3 days of life.
2. Calculate the predicted average SBP of a baby with birthweight 2 lb. (32 oz.) measured
at 5 days of life.
Birthweight 135 120 100 105 130
(oz.) (X1)
Age (days) 3 4 3 2 4
(X2)
Systolic BP 89 90 83 77 92
(mm-hg)
(Y) www.DuloMix.com
26
www.DuloMix.com
27
The regression equation tells us that
for a new born the average blood
pressure increases by an estimated
0.180 mm-hg per ounce of
birthweight and 4.976 mm-hg per
day of age.
www.DuloMix.com
28
www.DuloMix.com
29
Important Properties of Regression Coefficient
1. The regression coefficient is denoted by b.
2. We express it in the form of an original unit of data.
3. The regression coefficient of y on x is denoted by byx. The regression coefficient of x on y
is denoted by bxy.
4. If one regression coefficient is greater than 1, then the other will be less than 1.
5. They are not independent of the change of scale. There will change if the regression
coefficient if x and y are multiplied by any constant.
6. AM of both regression coefficients is greater than or equal to the coefficient of
correlation.
7. GM between the two regression coefficients is equal to the correlation coefficient.
www.DuloMix.com
8. If bxy is positive, then byx is also positive and vice versa. 30
www.DuloMix.com
31
Standard Error of Regression (S)
• The standard error of the regression or Standard error of estimate is a measure of the
accuracy of predictions.
• Note: The regression line is the line that minimizes the sum of squared deviations of
prediction (also called the sum of squares error).
• All of the observed values of (Y,X1,X2) do not fall on the regression line but they scatter
away from it.
• The degree of scatter of the observed values about the regression is measured by Standard
deviation of regression or the standard deviation, it measures the variations of observation
about the true regression line Y= α + β1X1 + β2X2 is denoted by σY.12
www.DuloMix.com
32
• The standard error of regression (S) represents the average distance that the observed
values fall from the regression line.
• Conveniently, it tells you how wrong the regression model is on average using the units
of the response variable.
• Smaller values are better because it indicates that the observations are closer to the fitted
line.
• Unlike R-squared, one can use the standard error of the regression to assess the precision
of the predictions.
• The standard error of the regression provides the absolute measure of the typical distance
that the data points fall from the regression line. S is in the units of the dependent
variable.
www.DuloMix.com
33
www.DuloMix.com
34
Standard Error of estimate for Simple Regression
www.DuloMix.com
35
www.DuloMix.com
36
Standard Error of estimate for Multiple Regression