REGRESSION

www.DuloMix.com

Introduction

• Regression is the study of formulation and determination of an algebraic term for

the relationship between the variables.

• Also, it predict the value of one variable from that of the other.

• Regression is to determine statistical relationship between two or more variables.

• For example, temperature and drug dissolved in solvent are correlated. It can help to

find out amount of drug dissolved at particular temperature Regression analysis

estimates or predicts the value of one variable, if the value of other variable is

known.

• Thus, regression analysis is a statistical method used in various disciplines that

attempts to determine the strength and character of the relationship between one

dependent variable (y) and a number of other independent variables (X1, X2, X3

etc.).

• According to Blair “regression is the measure of the average relationship between

two or more variables in terms of the original units of the data”.

www.DuloMix.com 2

www.DuloMix.com 3

Types of Regression

• There are two basic types of regression namely,

1.Simple linear regression and

– uses one independent variable to explain or predict the outcome of

the dependent variable

2. Multiple linear regression.

– uses two or more independent variables to predict the outcome.

• There may be linear or non-linear regression.

• The linear regression is represented using graphs by straight line.

• Non-linear relationship between variables forms a curve and hence is

called as curvilinear regression,.

www.DuloMix.com 4

Properties of Regression:

1. Regression is illustration of response variable as a function of predictor variable.

2. It is estimated when there is significant correlation between the response and

predictor variable.

3. It predicts only a probable value of the response on a known value of predictor

4. It can be worked out in two ways: variable x on variable y and variable y on variable x

5. Regression coefficient is used to work out regression equation.

• There are two methods used to study regression namely;

Graphical method:

• In graphical method values of variables are plotted on graph as points that give a scatter

plot. Usually, independent variable is placed on x-axis and dependent variable on y-axis.

The regression line is drawn through point to balance points on each side of line.

Mathematical method:

• In mathematical method regression line is used to show regression of x on y and

regression of y on x to estimate correct correlation coefficient.

www.DuloMix.com 5

Linear Regression

• The relationship between two or more variables is estimated by

plotting the individual observations as points on scatter plot.

• The significance of relationship between variables is determined on

the basis of nature and direction of point on the plot.

• To estimate relationship between variables a straight line is drawn

approaching as close as possible through all the points on the plot.

www.DuloMix.com 6

(a) Regression lines:

• Relationship between two variables is presented by a regression line.

• It gives an average value of one variable (x) from any given value of other

variables (y).

• There are always two regression lines to demonstrate the relationship

between x and y variables.

• One line demonstrates regression of y on x and other line demonstrate

regression of x on y.

• In case of perfect correlation (+1), both the lines coincide to become single

line. If both these lines are close to each other is an indication of strong

correlation between both the variables, x and y .

• When these lines are farther from each other is an indication of weaker

correlation between variables.

• The correlation value O is an indication that both the variables are

independent of each other and their line will intersect with each other at

some location on the plot.

www.DuloMix.com 7

(b) Regression Equation:

• Suppose we are establishing a relationship between two variables x

and y, and we plotted them on scatter plot, we get two lines of best

fit that pass between all the point called ‘regression lines.

• Each line gives an equation called as ‘regression equation’.

• The statistical analysis is used to predict the exact location of straight

line/s is called as ‘line regression analysis.

• The two variables are supposed to have linear relationship if change

in one variable reflects certain amount of effect on other variable.

• The usefulness of regression analysis is to predict the value of one

variable from the known value/s of other variable/s.

www.DuloMix.com 8

• The regression equation is mathematical expression of regression line. Since there are two

lines, each line has its equation as:

• The linear regression equation for x upon y is:

Xy = a + by

• The linear regression equation for y upon x is:

yx = a + bx

Where, and y are two variables and, o and b are unknown constants.

• The value of a the point of intercept of regression line on y-axis. The value of b, also called as

regression coefficient, is computed as slope of the regression line. The regression coefficient

indicates relation between every unit change of x with the corresponding change in y.

www.DuloMix.com 9

(c) Significance of Regression:

Regression is employed to determine whether estimated value of

regression coefficient deviates significantly from the ideal zero value.

www.DuloMix.com 10

Application of Regression in Pharmacy

• Regression is often used in pharmaceutical research to determine how many specific

factors are related with each other and to know their level of interaction for the desired

outcomes.

• It has importance in predicting and solving the issues of target properties and desired

features.

• Following are some of the applications of regression in pharmaceutical field.

1. Finding best fit to linear physicochemical relationships.

2. Measurement of the median particle size of drugs, excipients and finished forms.

3. Studying enzyme kinetics and stability predictions.

4. Study of drug dose and response relationship over time.

5. In clinical trials for fitting linear portions of pharmacokinetics data.

6. In development of biochemical and chemical assay.

7. Plot of analyte recovery versus known amount in assays.

8. Calibration of analytical data for the quantitative analysis.

www.DuloMix.com 11

Curve fitting – Method of Least Squares

• Least square method is a mathematical procedure for finding the best-fit curve to

a given para set of data points by minimizing the sum of the squares of the

errors(the residuals) of the points from the curve.

• The sum of the squares of the errors (offets) is used because this allows the

residuals to be treated as a continuous differentiable quantity.

• However, using squares of the offsets, outlying points can have a

disproportionate effect on the fit. This property may or may not be desirable as it

is depending on the problem under consideration.

• This technique is the simplest and commonly applied form of linear regression

because it provides an answer to the problem of finding the best fit straight line

using a set of data points. If the relationship between the two parameters that

are plotted is known, it becomes easy to transform the data in suitable form that

the resulting line is a straight line.

www.DuloMix.com 12

• Let (xi,yi), i=1,2,3…n be n pairs of values of the variables X and Y. let

the equation of the line of regression of Y on X be

Y = A + bx (1)

so, Yi = A + bxi i = 1,2,….,n

• Here yi is the actual value of the variable Y and Yi is the value of Y

which is estimated by the equation (1)

• So Error = yi – Yi

• We minimize the sum of squares of errors given by

(2)

www.DuloMix.com 13

www.DuloMix.com 14

www.DuloMix.com 15

Properties of regression coefficients

(1) The correlation coefficient r is the geometric mean of the two regression

coefficients i.e.

r= 𝑏𝑦𝑥 × 𝑏𝑥𝑦

∴ bxy ∙ byx = r2

(2) Both the regression coefficients have the same sign. i.e. either they are

both positive or both negative.

(3) If one of the regression coefficients is greater than unity(1), then the

other is less than unity.

(4) Arithmetic mean of the two regression coefficients is greater than the

correlation coefficient.

(5) Regression coefficients are independent of the change of origin but not

of scale.

www.DuloMix.com 16

Example: The following data show monthly advertising expenditure and sales over a

period of six months. Estimate the relationship between sales (Y) and advertising

expenditure (X).

Sales (Y) 3 15 6 20 9 25

Advertising expenditure (X) 1 2 3 4 5 6

Solution: we have to find equation of the line of regression of Y on X.

𝒏Ʃ𝑿𝒀 − Ʃ𝑿 (Ʃ𝒀)

X Y XY X2 byx =

𝒏Ʃ𝑿𝟐 − Ʃ𝑿 𝟐

1 3 3 1 𝟔 𝟑𝟐𝟔 − 𝟐𝟏 (𝟕𝟖)

=

𝟔 ×𝟗𝟏 − 𝟐𝟏 𝟐

2 15 30 4

3 6 18 9 = 3.03

4 20 80 16 The equation of the line regression of Y on X is

5 9 45 25 Y- Ȳ = byx (X – X̄)

6 25 150 36 Y- 13 = 3.03 (X- 3.5)

21 78 326 91 ∴ Y = 3.03X – 10.61 +13

Ʃ𝑥

X̅ = = 21/6 = 3.5 ∴ Y = 3.03X + 2.39

𝑛

This equation givwewsw .DtuhloeM ixb.coemst estimate of Y for a given value of X17.

Ʃ𝑌

̅Y = = 78/6 = 13

𝑛

Example: The following are the results of five assays of different but known potency.

Drug potency (X) 60 80 90 100 120

Assay(Y) 61 79 91 102 119

Find the equation of the line of regression of Y on X and estimate Y when X = 95.

Solution: we construct the following table, taking dx = X – A = X-90 and

dy = Y – B = Y – 90

X Y dx dy dxdy dx2

𝒏Ʃ𝒅𝒙𝒅𝒚 − Ʃ𝒅𝒙 (Ʃ𝒅𝒚)

60 61 -30 -29 870 900 byx =

𝒏Ʃ𝒅𝒙𝟐 − Ʃ𝒅𝒙 𝟐

80 79 -10 -11 110 100

90 91 0 1 0 0 𝟓(𝟏𝟗𝟕𝟎)− 𝟎 (𝟐)

=

100 102 10 12 120 100 𝟓(𝟐𝟎𝟎𝟎) − 𝟎 𝟐

= 0.985

120 119 30 29 870 900

The equation of the line regression of Y on X is

Tot – 0 2 1970 2000

Y- Ȳ = byx (X – X̄)

al

Y- 90.4 = 0.985 (X- 90)

Ʃ𝑑𝑥 ∴ Y = 0.985X – 88.65+90.4

X̅ = A + = 90 + 0 = 90

𝑛 ∴ Y = 0.985X + 1.75

Ʃ𝑑𝑦

̅Y = B + = 90 + 2/5 =90.4 w h e n X = 95 , the estimated value of Y is given by.s

𝑛

Y = 0.985(95)w +w w1.D.7ul5oM =ix .c9o5m.33 18

Example : Using the data given below, find the equation of the two lines of regression.

Variable Mean S.D Coeff. Of correlation

X 40 5 r = 0.8

Y 30 4

Solution : we are given that x̄ = 40 , Ȳ = 30 , 𝜎𝑥 = 5 , 𝜎𝑦 = 4 and r = 0.8

𝜎 4

byx = r 𝑦 = 0.8 × = 0.8 × 0.8 = 0.64

𝜎 5

𝑥

𝜎 5

b 𝑥

xy = r = 0.8 × = 0.2 × 5 = 1

𝜎 4

𝑦

The equation of the line regression of Y on X is The equation of the line regression of X on Y is

Y- Ȳ = byx (X – X̄) X- X̄ = bXY (Y – Ȳ)

Y- 30 = 0.64 (X- 40) X -40 = 1 ( Y – 30)

∴ Y = 0.64X – 25.6+ 30 X = Y – 30 + 40

∴ Y = 0.64X + 4.4 X = Y + 10

www.DuloMix.com 19

Example: Perform simple regression analysis for the following data set.

(X) 0 3 6 9 12 18

(Y) 51.7 51 50 50.3 48 47

Report the values of slope, intercept and correlation coefficient..

Solution: we construct the following table, taking A= 9 and B= 50. So dx = X – A = X-9 and

X Y dx dy dxdy dx2 dy2 dy = Y – B = Y – 50

0 51.7 -9 1.7 -15.3 81 2.89

𝒏Ʃ𝒅𝒙𝒅𝒚 − Ʃ𝒅𝒙 (Ʃ𝒅𝒚)

3 51 -6 1 -6 36 1 byx =

𝒏Ʃ𝒅𝒙𝟐 − Ʃ𝒅𝒙 𝟐

6 50 -3 0 0 9 0

9 50.3 0 0.3 0 0 0.09 𝟔(−𝟓𝟒.𝟑)− −𝟔 (−𝟐)

=

𝟔(𝟐𝟏𝟔) − −𝟔 𝟐

12 48 3 -2 -6 9 4

= – 0.2681

18 47 9 -3 -27 81 9

The equation of the line regression of Y on X is

-6 -2 -54.3 216 16.98

Y- Ȳ = byx (X – X̄)

Ʃ𝑑𝑥

X̅ = A + = 9 + (-6)/6 = 8 Y- 49.67 = – 0.2681 (X- 90)

𝑛

∴ Y = – 0.2681X – 2.1448+49.67

Ʃ𝑑𝑦

̅Y = B + = 50 + (-2)/6 = 49.67 ∴ Y = -0.2681X + 51.81

𝑛

Comparing this with Y = a + bX, we get

Intercewpwwt .D=u laoM=ix .c5o1m.81 and slope b = -0.2681 20

To find out the correlation coefficient , we find

X Y dx dy dxdy dx2 dy2 𝒏Ʃ𝒅𝒙𝒅𝒚 − Ʃ𝒅𝒙 (Ʃ𝒅𝒚)

bxy =

𝒏Ʃ𝒅𝒚𝟐 − Ʃ𝒅𝒚 𝟐

0 51.7 -9 1.7 -15.3 81 2.89

3 51 -6 1 -6 36 1

6 50 -3 0 0 9 0 𝟔(−𝟓𝟒.𝟑)− −𝟔 (−𝟐)

=

𝟔(𝟏𝟔.𝟗𝟖) − −𝟐 𝟐

9 50.3 0 0.3 0 0 0.09

12 48 3 -2 -6 9 4 = – 3.4512

18 47 9 -3 -27 81 9 Correlation coefficient r is given by

-6 -2 -54.3 216 16.98

r = 𝒃𝒚𝒙 × 𝒃𝒙𝒚

= −𝟎. 𝟐𝟔𝟖𝟏 (−𝟑. 𝟒𝟓𝟏𝟐)

= 𝟎. 𝟗𝟐𝟓𝟑

= ±0.9619

= – 0.9619 (∴ 𝒃𝒚𝒙 and 𝒃𝒙𝒚 are negative)

www.DuloMix.com 21

Example: Find the regression equation showing the capacity utilization on production from the following data.

Mean S.D Coeff. Of correlation

Production (in lakh 35.6 10.5 r = 0.62

units)

Capacity utilization 84.8 8.5

(in %)

Estimate the production when the capacity utilization is 70%.

Solution : Let X denote the production and Y denote capacity utilization.

we are given that x̄ = 35.6 , Ȳ = 84.8 , 𝜎𝑥 = 10.5 , 𝜎𝑦 = 8.5 and r = 0.62

𝜎𝑦 8.5

byx = r = 0.62 × = 0.5019

𝜎𝑥 10.5

𝜎 1

𝑥 0.5

bxy = r = 0.62 × = 0.7659

𝜎𝑦 8.5

The equation of the line regression of Y on X is The equation of the line regression of X on Y is

Y- Ȳ = byx (X – X̄) X- X̄ = bXY (Y – Ȳ)

Y- 84.8 = 0.5019 (X- 35.6) X -35.6 = 0.7659( Y – 84.8)

∴ Y = 0.5019X – 17.86+ 84.8 X = 0.7659Y – 64.94 + 35.6

∴ Y = 0.5019X + 66.93 X = 0.7659Y – 29.34

When Y= 70% , X= (0.7659 × 70) – 29.3483 = 24.26

When capacity utilization is 70 % , the production is 24. 26w4ww7. DlaulkoMhi xu.cnomits. 22

Example: Two regression lines involving variables x and y are y = 5.6 + 1.2 x and x = 12.5 + 0.6 y. Find the means of x and y and

the correlation coefficient between x and y.

Solution : R=Two lines of regression intersect at the point ( x̄ , ȳ ) . Hence to find the values pf the means x̄ and ȳ , we solve

the given equations.

Substituting the value of x from the second equation in the first , we get

𝑦 = 5.6 + 1.2 12.5 + 0.6 𝑦

∴ 𝑦 = 5.6 + 15 + 0.72 𝑦

∴ 0.28 𝑦 = 20.6

20 .6

∴ 𝑦 = = 73.57

0.28

∴ 𝑥 = 12.5 + 0.6 𝑦 = 12.5 + 0.6 73.57 = 𝟓𝟔. 𝟔𝟒

Thus the points of intersection of two lines is ( 56.64 , 73.57)

∴ 𝒙 ̄ = 56.64 and ȳ = 73.57 The coefficient of correlation r is given by

Now, the equation of the line of regression of y on 𝑥 is

r = 𝒃𝒚𝒙 × 𝒃𝒙𝒚

𝑦 = 5.6 + 1.2 𝑥

∴ byx = coefficient of 𝑥 = 1.2 = −𝟎. 𝟐𝟔𝟖𝟏 (−𝟑. 𝟒𝟓𝟏𝟐)

Also, the equation of the line of regression of 𝑥 on 𝑦 is = 0.72

𝑥 = 12.5 + 0.6 𝑦 = 0.8485

∴ bxy = coefficient of 𝑦 = 0.6

www.DuloMix.com 23

Example : Find the regression line of Y on X for the following data.

x 1 3 4 6 8 9 11 14

y 1 2 4 4 5 7 8 9

And estimate y when x =10.

Let 𝑦 = 𝑎 + 𝑏𝑥 (1)

X Y XY X2 But the required line of regression of Y on X.

The normal equations are

1 1 1 1

Ʃ𝑦 = 𝑎𝑛 + 𝑏Ʃ𝑥

3 2 6 9 And Ʃxy = aƩ𝑥 + 𝑏Ʃ𝑥2

4 4 16 16 Substituting the respective values from the table calculated above,

40 = 8a + 56 b

6 4 24 36 364 = 56a + 524b

8 5 40 64 Solving the simultaneous equation we have

6 7

9 7 63 81 a = & b =

11 11

11 8 88 121 From the eq.1 the required line of regression Y on X is

6 7

14 9 126 196 𝑦 = + x

11 11

Or 𝟕𝒙 − 𝟏𝟏𝒚 + 𝟔 = 𝟎 (2)

56 40 364 524

If x= 10 from eq.2

7 × 10 -11y + 6 = 0

11 y = 76

76 www.D𝟏u𝟎loMix.com 24

y = = 6 = required estimated value

11 𝟏𝟏

Multiple regression

• In studies when there are two or more independent variables, the analysis

describing relationship between them is called ‘multiple correlations’ and the

equation that describe such relationship is known as the ‘multiple regression

equation’.

• Generally, multiple regressions explain the relationship between two or more

(multiple) independent variable and one dependent variable.

• Since this regression uses two or more independent variables is called ‘multiple

regression’.

• Multiple regression has two types namely, multiple linear regression and

multiple non-linear regression.

• Multiple linear regression analysis is a set of techniques used to study the

straight-line relationships between two or more variables, It is used to predict

values of the dependent variable indicating independent variables that have a

major effect on the dependent variable.

• It can be used when there are three or mwwow.rDueloMinix.dcomependent variables 25

(a) Assumptions of Multiple Regression:

1. All the variables are continuous measurement variables.

2. Data values of variables are unimodal and have fairly symmetrical

distribution.

3. There is a linear relationship between predictor and response (criterion)

variables.

www.DuloMix.com 26

(b) Computation of Multiple Linear Regression:

The purpose of a multiple regression is to find an equation that best predicts the y

variable as a linear function of x variables.

Multiple regression estimates the β’s in the equation.

The multiple linear regression equation is presented in the following general form:

𝒚 = 𝛃𝟎 + 𝛃𝟏𝐱𝟏𝒋 + 𝛃𝟐𝐱𝟐𝒋 + ⋯ … . . 𝛃𝒏𝐗𝒏𝒋 + 𝐂𝒋

Where, x’s are the independent variables and

y is the dependent variable.

subscript j represents the observation number.

𝛃’s (i = 1. 2.. n) are the unknown regression coefficients, which represent

the value at which the criterion variable changes when the predictor

variable changes. Their estimates are represented by b’s.

Each 𝛃 represents the unknown parameter, while b is an estimate of this 𝛃.

Cj is the error of observation j.

e.g. As an example, let’s say that, the tablet hardness of a batch of product will be

dependent on various factors like type of binder, amount of moisture and the

amount of compressional Force applied. Using hardness test one can estimate the

appropriate relationship among these factors.

www.DuloMix.com 27

• Although the regression problem may be solved by a number of

techniques, the most commonly used method is least squares.

• In least squares regression analysis, the b’s are selected so as to minimize

the sum of the squared residuals.

• This set of b’s is not necessarily the set that is desired, since they may be

distorted by outliers, the points that are not representative of the data.

• The sample multiple regression equation in such case is presented as

𝑦ǉ 𝑗 = 𝑏0 + 𝑏1𝑥1𝑗 + 𝑏2 𝑥2𝑗 + … … + 𝑏𝑛𝑋𝑛𝑗

• If n = 1, the model is called simple linear regression.

• It should be noted that the number of normal equations would depend

upon the number of independent variables.

• In case of 2 Independent variables, there are 3 equations, if there are 3

independent variables then 4 equations and so on, are used.

• The intercept 𝛃𝟎 is the point at which the regression plane intersects the y-

axis

www.DuloMix.com 28

• In multiple regression analysis, the regression coefficients, 𝛃𝟏, 𝛃𝟐 ,become less reliable when the

degree of correlation between the independent variables, x1, x2 increases.

• When there is a high degree of correlation between independent variables it is then known as the

problem of multicollinearity.

• In such condition, it is suitable to use only one set of the independent variable to make an

estimate.

• Actually, adding a second variable (x2) and correlating it with the first variable (x1) alters the values

of the regression coefficients.

• However, the predictions for the dependent variables can be made even when multicollinearity is

present.

• In such situation enough care is required to be exercised while Selecting the independent variables

for estimating a dependent variable to ensure that multi collinearity is reduced to the least.

• When there is more than one independent variable, a difference between the collective effect of

the two independent variables and the individual effect of each variable is taken separately.

• The collective effect is given by the coefficient of multiple correlations (Ry,x )

1×2

www.DuloMix.com 29

Standard Error of Regression

• The standard error (S) of the regression coefficient is the SD of the estimate. It is used in

hypothesis testing or confidence limits. The S of the regression represents the average

distance that the observed values fall from the regression line.

• The S gives information about how wrong the regression model is on using the units of the

response variable. The S becomes smaller when the data points are closer to the line.

Smaller values are better because it indicates that the observations are closer to the fitted

line, Fig. 2.3. The S of the regression can be used to assess the precision of the predictions.

Approximately 95% of the observations are supposed to fall within ±2 S of the regression

from the regression line, which is a quick approximation of a 95% prediction interval. If

regression model is used to make predictions, assessing the S of the regression might be

more important than assessing R2.

www.DuloMix.com 30

• As an example, in the regression output for Minitab statistical

software, we can find S in the “Summary of Model section, next to R2.

• Both S and R2 provide an overall measure of how well the model fits

the data.

• The letter S denotes the standard error of the regression and the

standard error of the estimate.

• In fact S represents the average distance that the observed values fall

from the regression line.

www.DuloMix.com 31

• For example, the fitted line in a plot shown in

Fig. 2.4, uses body mass index (BMI) to predict

body fat percentage. The value of S is 3.53399,

which indicates the average distance of the

data points from the fitted line is about 3.5%

body fats.

• The S of the regression is used to assess the

precision of the predictions.

• Thus, approximately 95% of the observations

should fall within ±2 S of the regression from

the regression line, which is also a quick

approximation of a 95% prediction interval. In

this example, about 95% of the observations

should fall within ±7% of the fitted line, which

is a close match for the prediction interval.