Unit 1 Biostatistics and Research Methodology Notes

Save (0)




Statistics is the technique to analysis the numerical data. These data represented the
observations obtained either through statistical or non-statistical techniques. It may define a
universe or an entire population, based on various sampling procedure. It also include the
various techniques for the collecting as central tendency, tabulation, average, disperion etc.,
which help in describing and summering the characteristic or feature of sampling of data in
medical research/ field.

The word ‘statistics’ has been derived from the latin word ‘status’. In plurar sense it
means a set of a numerical figures called ‘data’ obtained by counting. Also, in the singular
sense it means collection, classification, analysis, comparision and meaningful interpretation
of ‘raw data’. According to Croxton and Cowdon, raw data as “It is the science which deals
with the collection, analysis and interpretation of numerical data”.

Or Data collection in its original form is known as a raw data.

Array : An array can be considered as a multiply subscripted collection of data entries.

Statistics means a measured or counted fact or piece of information stated as figure such

as height of a person, birth of a body. Statistics through apparently plural when used in
singular sense, is a “Science of figure”. It is a field of study concerned with various tech-
niques or methods of collections, classification, summarising, inter-pretation of data and
drawing inferences, testing hypothesis and making recommendations.


The word ‘statistics’ define as-

i) A plural noun, to mean numerical data, such as annual outcome of a machine, yearly
changes in incomes etc.

li) A singular noun, to refer to a techniques and procedures for collecting, describing,
analysing and interpreting numerical data.

iii) Statistics may be defined as the aggregate of facts affected to a marked extent by a

multiplicity of causes, numerically expressed or estimated according to a reasonable

standard of accuracy, collected in a systematic manner for a predetermined purpose

and placed in relation to each other. (Horace Secrist)

Figure (1)- Components of Statistics

categories, as descriptive statistics and inferentiay
eases, Further more it also classified and express as below :

“Making Predictions 7
ee ed

Figure (2) ; Classification of Statistics


Bio-Statistics is a term used when the tools of statistics are applied to the data that is
derived from biological sciences such as medicine. The tools and theories of statistics are
very important in the field and medical sciences.

“Mathematical Biology” is a fast growing, well designed and recognized subject and
the ee exciting modern application of applied mathematics. The biostatistics is a science of
Application and Uses of Bio-Statistics :

® To define, what is normal or healthy in a population.
w To find relative potency of a new drug with respect to a standard drug.
iii) To compare the efficiency of a particular drug.
iv) To find an association between two attributes such as disease and smoking, filariasis

and social class.
v) To identify sign and symptoms of a disease or syndrome.


Suppose the value of a variable occurs twice or more in a given series of observations,
then the number of occurance of the value is known as the frequency of that value. The way
of tabulating a pool of data of a variable and their respective frequency side by side is called
a frequency distribution of these data.

Frequency distribution is usually the first method used to well organize the data for
investigation. A sytematic presentation the data for investigation : A systematic presenta-
tion of different values taken by the variables together with corresponding frequencies is
called a frequency distribution, which is presented in tabular form called

as frequency table.
If class interval are not given, then it is called as a discrete frequency distribution. For

S.No. Number of items Number of packing

Total 63
If the class intervals are given, then frequency distribution is called as a continuous

fregency distribution. For example :

Further, the class interval are divided into two categories : exclusive aid inclusiy

i) The class interval that does not include upper class limit is called an exclusive .

interval e.g. ass

S.No. Marks Number of students

| ii) The class interval that includes the upper class limit is called an inclusive class inte.
val e.g. T

Note : In above table, the class ‘0-20’ does not include the value 20 (i.e. upper limit) and
the class “1-20” include the values 20 (i.e. upper limit).

Frequency distribution are two types :

i) Grouped frequency distribution
ii) Ungrouped frequency distribution

i) Grouped frequency distribution :
Grouped frequency distribution are used, if variables will be continuous, such a ee |

salary etc. are examined. Many measures
‘ taken during data collection, including body ‘ “1

perature, weight, score, and time are measured using a continuous scale. In this way at
data may be divided into number of groups or classes or what usually called class inte”)

For example : Grouped frequency distribution : ay)



Family income (variable) Number of persons (N) Percentage
Note : The above example is a best example of open end class interval.
ii) Ungrouped frequency distribution :

Mostly, we have some categorical data that are presented in the form of an “ungrouped
frequency distribution”, in which a table is generally developed to display all numerical
values obtained for a particular variable. This approach used on a discrete data rather than
continuous data. Example of data commonly organized in this manner include gender,
ethnicity, marital status, diagnostic category of study subject and value obtained from the
measurement of variables. Following table is the example of ungrouped frequency of sub-
jects characteristics.
Percentage distribution indicate the percentage of the sample whose scores fall into a

specific group and the number of scores in that group. Percentage distributions are particu-
larly useful for comparing the present data with findings from other studies that have dif-
ferent sample sizes. A commulative frequency distribution is a type of percentage distribu-
tion in which the percentages and frequencies of score are summed, as one moves from the
top of the table to the bottom. Thus the bottom category would have a commulative fre-
quency equivalent to the sample size and a commulative percentage of 100.


BIOSTATISTICS AND RESEARG METH Example of a commulative frequency table :

Score | Frequency | Percentage | Cummulative Cummulative)
frequency Percentage

Bivariate frequency distribution : }, frequency distribution if the number of Variable js

only two, then it is called a bivariate frequency distribution.
Multivariate frequency distribution : Frequency distribution of more than two iol

able is known as multivariate frequency distribution.

Note : 4 bivariate frequency distribution have two marginal distribution ice. the last
row and the last column in the frequency table. Also, it has ‘m+n’ conditional distribution
Le. all row and column except marginal distribution in the frequency distribution table

Table 3 : Conditional distribution of salary for particular age

Salary Age (30-40)
10,000-15,000 15

15,000-20,000 8

20,000-25,000 5

Total 28
Note : In above example, there are number of rows are 4 and number of columns are 3.

Then we have m+n = 4+3 = 7 conditional distributions.


The important forms of frequency distribution graphs are as follows :
i) Line frequency graph
HD Histogram

iti) Frequency polygon

iv) Frequency curve

VY) Commulative frequency curve or ogive
i) Line frequency graph :

Line frequency graph is used to depict a discrete data. In this graph the size is depicted
on the x-axis and the frequencies on the y-axis. Lines are drawn according to the given
frequencies of different sizes.

Example 1. The following table shows the distribution of marks of students. Depict the
data_on_a line frequency gr ph

Figure : Line frequency graph

Fig. Line graph
ii) Histogram :

a) Histogram for equal class interval ;

A histogram is a representation of frequency distribution by a set of rectangular bars
with area proportional to the class frequency. In graph, the classes are shown on the x-axis
and the frequencies on the y-axis.

Example 3. Prepare a histogram from the following table :

| Marks 0-10 10-20 20-30 30—40 40-50 50-60

Number of students 4 7 10 15 8 6


b) Histogram for unequal class interval :

In case of unequal class interval, to prepare a histogram use the following rules :

) Divide the class interval into equal class interval

ii) Calculate the adjusted frequency by dividing the frequency © f that class interval by 2.

interval are
Similarly, Follow the procedure for other unequal class intervals. The class

shown on x-axis and frequencies on y-axis.

Example 4. Prepare a histogram form the following data :

Class interval 0-10 10-20 20-40 40-70

Frequency 5 8 20 24
Solution. Here the smallest class interval is 10. In above example we have two unequal

class interval; one class is 20 (20-40) and another class is 30 (40-70).

Since the first class interval 20-40 is twice the size of the smallest class interval, so divide

it into two equal class interval. Also, the frequency of that class is 20; also divided by 2 ie.

SImilarly, the width of the class (40-70) is thrice of the size of the smallest class interval.

So, the height of the rectangle or frequency divided by 3. Finally, the adjusted frequencies

and class intervals are given below :

Class intervals | 0-10 | 10-20} 20-30 | 30-40 | 40-50 | 50-60_| 60-70

Frequencies 5 8 10 10 8 8 8
Now draw the graph of above adjusted table.

10 20 =30 40 50 60° 70 80
Figure : Unequal Histogram (Class interval)

Scanned by CamScanner

c) Histogram for inclusive awn
first it has to be-co

To preparing histogram for an inclusive series, nverteg
. into

sive series. ®xely

Example 5. Draw a histogram for the following data :

Marks 10-19 20-29 30-39 40-49 50-59 60-69 70-79

Students 1 5 7 10 12 6 :

Solution. First convert given inclusive data into exclusive data,

Marks 9.5-19.5 19.5-29.5 | 29.5-39.5 | 39.5-49.5 |49.5-59.5 59.5-69.5 69.5.79.

d) Histogram for mid-value series

To preparing histogram for mid value series, first it has to be converted into continuous

Example 6. Construct a histogram with the help of the following data :

Mid values (salary) | 50 150_}_ 250 _ | 350 | 450

No. of workers 8 Z 10 14 12
Solution. Here, we convert given mid values series into a continuous series
100 200 300 400 #4500 Class interval

(iii) Frequency Polygon:
Any graph, which have more than four sides is called a polygon. RATER polygon is

a graph in which the values of variable are taken on x-axis and the frequencies on the y-axis.
It is the curve obtained by joining the mid-points of the tops of the rectangles in a histograph
by the straight line.

Example 7. Draw a frequency polygon from the following data :
Age in years | 10-20 20-30 30-40 40-50 50-60

No. of patients 5 20 25 35 10

(iv) Frequency curve :
Draw a histogram for the given distribution. A frequency curve is a smooth curve. It is

obtained by joining the point of frequency polygon by a free hand smoothed curve.BIOSTATISTICS AND RESEARCH ae , if

Example 8, Draw a frequency curve from the following table :

Interval 10-20 20-30 30-40 40-50 50-60

(v) Commulative Frequency Curve or Ogive :

Graph can be used to depict a commulative frequency distribution. For drawing on |
ogive, an ordinary frequency distribution table is converted into commulative frequency |
table. The commulative frequencies are then ploted corresponding to the upper limits of the
classes. The points, corresponding to commulative frequency at each upper limit of the classes
are joined by a free hand curve. The obtained graph is called an ogive.

The ogive further classified as “less than ogive” and “more than ogive”.

a) Less than Ogive : For drawing the less than ogive, the frequencies are added
cummulatively in an increasing order.

b) More than Ogive : For drawing more than ogive, the commulative frequencies &
different classses are estimated in a diminishing order. The upper limit of a class interv


Example 9. Construct the less than ogive and more than ogive from the following table :

Solution. To draw ogive, firstly we construct commulative frequency table :

Less than Ogive More than Ogive
Marks c.f, Marks Gufs
Less than 10 4 More than 0 6/

Less than 20 4+7=11 More than 10 87-4=83

Less than 30 11+9=20 More than 20 83-7=76

Less than 40 20+10=30 . More than 30 76-9=67
Less than 50 30+15=45 More than 40 67-10=57

Less than 60 45+22=67 More than 50 57-15=42

Less than 70 67+14=81 More than 60 42-22=20

Less than 80 81+6=87 More than 70 20-14=6

Measure of central tendency or an average refers to the value, which is used to represnt
an entire series. This property of concentration of the value around a central value is known
as central tendency. The central value around which there is a concentration is called the
measure of central tendency.

By calculating the measure of central tendency, we can find a single value to represent
the whole data. It also help us to compare the value of two or more groups.

There are some important definitions of ‘average’ as follows :
j) Clark define ‘average’ as “Average is an attempt to find one single figure to describe

whole of figure”.

ii) A.E. Waugh observed that “an average is a single value selected from a group of
values to represent them in some way- a value which is supposed to stand for whole
group, of which it is a part, as typical of all the values in the group”.

iii) Leabo defines ‘average’ as the average is sometimes described as a number which is
typical of the whole group.

iv) According to Ya-Lun-Chou “An average is a typical value which is employed to
represent all the individual values in a series or of a variable”.

v) Lawrence J. Kaplan has defined average as one of the most widely used set of sum-
mary figure.

vi) Bowley shows average as statistical constant which helps to understand the signifi-
cance of the whole in a single effort.

i) To obtain a single value that describe the characteristics of the whole group of data.

ii) To help for comparision. .
iii) To help to make quantiative relationship between different group average.
iv) To help in decision making.


Important types of measure of central tendency are given below :

Measures of Central Tendency


Mathematical Positional


(i) Arithmetic (ii) Geometric (iii) Harmonic Median Mode
oa mean mean

: (A) Sim. ple (B) Wei ghted
arithmetic mean arithmetic mean

Figure : Measure of Central Tendency


(i) Arithmetic Mean
Arithmetic mean is the most commonly used measure of the central tendency. Its value

is obtained by dividing the sum of the values of various items in a series by the number of

total items. Arithemtic mean further divided into two types :

A) Simple arithmetic mean

B) Weighted arithmetic mean
A) Simple arithmetic mean : The simple arithmetic mean of a set of values is obtained

by dividing the sum of the values the number of values in the set. It is denoted by x or AM.

Calculation of Mean is an individual series :

There are two methods : Direct and Short cut method

a) Direct Method :

EAE Ki Ke ssceussaess ,x, be the ‘n’ values of the variable x. Then the arithmetic mean,

Example 2. The height of 10 students are in cm : 160, 162, 175, 158, 156, 169, 173, 192
165, 167cm. Find the mean height of students.
ans n 10

Here n =10, then

_ 160 + 162 +175 +158 +156 +169 +173 +192 +165 + 167

Example 3. Find the mean x : 10, 8, 15, 12,2,9
Solution. Here n=6 and Sx = 10+8+15+12+2+9=56


x =Arithmetic mean

d = Deviation = x-A = dx

A = Assumed mean

n = Number of data

Xd = Sum of deviations

Example 4. Find the arithmetic mean, if

x:101, 106, 125, 150, 110

Solution. Given n=5, A=110 (assumed mean)

X d=x-A

101 101-110 = -9

106 106-110= +4

125 125-110 = 15

150 150 – 110 = 40

110 (A)
110-110 =0

syete2 cuene d


(ii) Step Deviation Method :

We have

– ax
x=A+t+ Xi



x =Arithmetic mean

A = Assumed mean

n = Number of data

Sd = Sum of deviations

20 (dV | ei Rae

Example 5 : Find mean using step deviation method, from the following data :

Marks (x) : 20 40 60 80 100

x Deviation (dx) |Step deviation (dx/i)=d”x

Calculation of Mean in a Discrete Series/Continuous Series :

There are three methods to calculate mean :

(i) Direct method (ii) Short-cut method and (iii) Step deviation method
(i) Direct method : To calculate mean follows the procedure :
Step 1: The value of each item (x) is multiplied by its frequency (f) and take it’s total say

Step 2 : Make sum of all frequencies ie. Sf

Step 3 : Using formula to find mean x= oii

Example 6. From the following data; calculate the mean by direct method :
Class (x) : 20 30 40 50 60 70

Frequency (f):} 5 7 8 10 11 8

x f fx
20 5 100
30 7 210
40 8 320
50 10 500
60 11 660

70 8 560

LF =49 > fx = 2350
Scanned by CamScanner



_ dfx _ 2350
yi 49 = 47.96

Example 7. Calculate mean, from the followqing data
Age 18-20 | 21-23 | 24-26 | 27-29 | 30-32 |33-35

(ii) Short Cut Method :

Follow the steps of calculation of mean

Step 1: Choose assumed mean (A)

Step 2 : Calculate devaition, dx = x-A

Step 3 : Multiply deviation and it’s frequency then obtain the sum total ie. > fdx

Step 4 : Using formula to calculate mean

> fdx
x== A+—S—f

Example 8. From the following distribution of data, calculate the mean by using short-

-cut method.

Class (x) 10 20 30 40 50 60

Frequencies (f) | 3 2 5 10 11 8

Solution. Let Assumed mean = 40 Up, |

Example 9. From the following distirubtion of marks obtained by 50 studenii

quantitative methods. Calculate arithmetic mean.

Marks More than 10 | 20 30 40 50 | a :
No’s of students 50 46 40 20 10 B)
Solution. Here the given data is in commulative form. First, we convert it into as

frequency distribution.

We have i = 10, n = 50 %

Marks Students Mid value | Deviation | Step deviation | f.dx ree


(iii) Step Deviation Method :

This is improved method of short cut method :

Step 1: Select an assumed mean

Step 2 : Calculate deviation dx=x-A

Step 3 : Calculate step deviation dx’= dx ‘/i

Step 4: Multiply step deviation with its frequencies i.e. fdx’

Step 5: Take the sum of fdx’ as we get > fdx’

Step 6 : Finally, use the following formula

Where i is a class interval

(b) Weighted Arithmetic Mean
Weighted arithmetic mean is defined as the calculation of arithmetic mean by putting

the weights to different items in a series differently according to their relative importance.
The formula for the weighted arithemtic mean is given below :

Where X_ = Weighted arithmetic mean

W = Weighted assigned to different items

In case of frequency distribution, If f,, f,,….., f, are the frequencies of the variable values
Xp Xayerscrrene X, respectively, then the weighted arithmatic mean is given by

Example 10. Calculate weighted arithmetic mean from the following distribution :

(ii) Geometric Mean

Geometric mean is defined as the n’ root of the product of n values. It is denote by
GM and defined as


WHORE 5 Ky sensessvi -X, are the various values of the series and n = number of items,
In case of large amount of items,


(iii) Harmonic Mean :

The harmonic mean of n values is the reciprocal of the mean of the repro of. the |

values. It is denoted by H.M. and defined as |


(i) Median and (ii) Mode
(i) Median

Median of a set of values is the middle most value when the data is arranged in ascend-

ing or descending order of magnitude. The middle value will divide the whole data into

two equal parts. The median is denoted by M. It is also called a positional average.

Method to compute of median :

(a) Individual observations

Step 1 : Allot a serial number to each item.

Step 2 : Arrange data in ascending or descending order.

Step 3 : Using the following formula to calculate median.

N+ ‘y
M= sizeof ( 2 item

Where M = Median and N = Number of items
Example 13. Calculate median of the following data gives the height of five student


Example 18. Given median = 50.4, N=60. Find the missing term.
Marks 40-44 | 44-48 | 48-52 | 52-56 | 56-60

(ii) Mode :
Mode is the value which has the height frequency that means the item which occurs

largest number of time in a frequencies distribution.

“The value of variable which occurs most frequently in a distribution is called mode”.

-Kenny and Keeping

Mode is denoted by M, and mathematically define as

Mod=e M , =L+ a xi
° d, +d,


he eames

Scanned by CamScanner



L = Lower limit of mode class
i = Class interval of modal class
d, = Difference between the frequency of the modal class and premodal class

d, = Difference between the frequency of the modal class and post modal class

Or Mode=M, =L+


f = Frequency of modal class

f, = Frequency of pre modal class
f, = Frequency of post modal class
L = Lower limit of modal class
i = Class interval of modal

Example 19, Find the mode of given a set of data :
46, 47, 48, 47, 40, 50, 97, 52

Solution. Since the value 47 is occuring heighest number of time.

Hence mode = 47

Example 20. Calculate mode (for discrete data)

Wages 145 | 170 | 180 | 190 | 200 | 210

Employees 3 16 § 20 6 2
Solution. Since the value 190 is occuring heighest number of time. So, the value of mode

is 190,

Example 21. Calculate mode for following data :

Data 8-9 9-10 10-11 11-12 12-13 13-14 | 14-15

Frequency 8 14 21 25 15 ‘ 10 7
Solution. In above table, we have modal class (11-12) with heighest frequency 25.

STW exe asst)

Dispersion, of the-data is.the degree to which the numerical data approached to spread
about an average value: Thejvariability in the data can be analysed with the help of measure
of dispersion.” “le pid}

g-sere ‘

th sciences, like physics and chemistry there is not so much variability as is found in
medicine and biology. We-can say occurence of variability is a biological phenomenon.


There are three main types of variability :

(i) Biological Variability : Individuals in similar environments differ when compared
as regards, sex class and other properties but the difference noted may be small and is said
to occur by chance. Such type of difference or variability is called biological variability.

(ii) Real variability : When the difference between two readings. Observations, or
values of classes or samples is more than the defined limits in universe, it is known as real.

(iii) Experiment variability : Error or difference or variation may be due to materials
methods, procedures employed in the study or defects in the techniques involved in the

They are further three types :

1) Observer error

2) Instrumental error

3) Sampling error

1) Observer error :
Observer error may be subjective or objective.

a) Subjective observer error : An interviewer may change some information there by
adding a number of errors while noting human behaviour unless trained properly. He may
such ask an embarrasing questions which the person may not like to answer such as men-

strual history, pregnancy, use of family planning method etc. Some subjects are very keen
while other do not wish to give any information.

b) Objective observation error : It may be added by an untrained observer while re-

Scanned by CamScanner


cording the measurements such as blood pressure, pulse rate etc.

2) Instrumental error : This is negligible or gross. Defect in weighing machines, height
measures, sphygomanometer, and other tool may cause undesirable variability or error in
observation leading to wrong conclusions.

Note : Observer and instrumental errors are sometimes called as non sampling errors.

_ 3) Sampling Error: A sample drawn should not be biased or too small to draw conclu-
sions. It should be representative and sufficiently large size to start statistical tests. Hospital
based studies are mostly biased because the sample of patients under study are drawn from
poor, influential or nearby suction of society.

Error occur when the samples of the study is not’a true representative of the population
é ‘ pl at

and it may lead to wrong conclusion.

Experimental variability due to observer instrument and sampling defects is not un-

usual but comon occurence about which are must be careful in any scientific study so that the
bias may be minimised.

The main objectives of measuring of dispersion are given below :

i) To obtain the liability of an average.

ii) To provide as a basis for the control of variability.

iii) To compare two or more samples with regard to their variability.

iv) To facilitate the use of other statistical measures.


3.2.1 Range

3.2.2 Interquartile Range (or Quartile deviation)

3.2.3 Mean Deviation

3.2.4 Variance and Coefficient of variance
3.2.5 Standard Deviation

3.2.1 RANGE
Range is defined as the difference between the heighest and lowest value in the sample.

Also the relative measure of range is known as the “coefficient of range”.

Mathematically, if H is the highest value and L is the lowest value then

Range (R) = H-L

and coefficient of range =

H+L Advantages of Range

i) The range is very simple to understand.

ii) It is easy to calculate. hart fin
eather forecasting.

iii) It is very helpful in statistical quality control and w Disadvantages of Range

i) It is not suitable for thorough analysis. —

ii) It is affected by the extreme values in a sample IS secittigiilvn

Example 1. Calculate the range and the coefficient of range ING data

regarding Hb% of 10 patients


and coefficient of range = H-L_45-5_ 40

H+L 454+5 50

The inter-quartile range of a group of observations is the interval between the values of

the upper quartile and the lower quartile for that group. Upper quartile of a group is the
value above which 25% of the observations fall. Lower quartile is the value below which
25% of the observations fall. This measure gives us the range which covers the middle 50%
of the observations in the group. If lower quartile is Q, and the upper quartile is Q, then.

i) Inter quartile range = Q, – Q,

R, = Difference between third and first quartile.

ii) Quartile deviation or semi-inter quartile range is defined as

= sla, -Q,]

iii) Coefficient of semi inter-quartile range is defined as

Q+Q, Advantages of Quartile Deviation

i) It is easy to understand and to calculate

ii) It is unaffected by the extreme values

iii) It is quite satisfactory when only the middle half of the group is dealt with. Disadvantages of Quartile Deviation

i) It ignores 50% of the extreme values

ii) It is not suitable for algebric treatment.

Example 3. Calculate the inter quartile range, quartile dieiattan and coefficient of
quartile deviation, from the following table giving the heights of students.

Height in inches 58 59 60 61 62 63 64 65 66

Frequencies 21 25 28 18 20 22 24 23 18

(No. of students)


iii) Coefficient of quartile deviation = Q@=-Q
Q, +Q,

_64-64 0_oa ,

64+60 124 —


Mean deviation is defined as an average or mean of the deviations of the values from

central tendency (i.e. mean, median or mode).

Method :

Step 1 : Define data as x.

– x
Step 2 : Calculate arithmetic mean as * ~ Sr(=N)

Step 3 : Find the deviation of each observation from the mean, dx = x-X

Step 4 : Ignoring the negative sign of deviation and denoted by |> dx|

Step 5: Apply the formula

> |dx|
Mean Deviation =MD= N

Where N= Total number of values

In case of frequency distribution

Mean deviation =_ 2F :l [aaxx| | or pi(ax =) x )
N(or Sf) N(= Sf)

Where x is the mid point of the class interval and f is the frequency. Advantages of mean deviation :

i) Mean deviation is easy to understand and calculate.
ii) It can be calculated by any method of central tendency.

iii) It is class affected by the extreme items

iv) It is based on measurement, not on estimation. Disadvantages of mean deviation :

i) It ignores the sign of value.
ii) It is not suitable for accurate and further analysis.


Example 4, Find the mean deviation from the following data :

| Example 5. Find the mean deviation and coefficient of mean deviation from the mean
for the following table :


The square of standard deviation is called a variance. It has a significant role in inferen-

tial statistics. It is denoted by g? or (S.D.)? and defined as


(i.) Vari:a nce =o 2 — 2 (x e‘e x). (for row data)

or Variance =


= x-x) Z

(ii) Variance = 6’ =a a (for frequency distribution) COEFFICIENT OF VARIANCE

Coefficient of variance (CV) is used to compare the variability of one character in two

different groups having different magnitude of the values or two characters in the same

group by expressing in percentage.

It is calculated from standard deviation and mean of characteristic. The ratio of stan-

dard deviation and mean is found in percentage. USES OF COEFFICIENT OF VARIANCE
(i) The series to be compared are expressed in the same units and have equal or nearly

equal means.
(ii) The two series may be expressed in the same units, but their standard deviations

and means may be different.
(iii) The two series to be compared are expressed in different units.

Example 8. The mean and standard deviations of the numbers of students of two

schools A and B, are given below :
School Mean S.0;

A 450 52

B 470 55
Compare the variability of no’s of students in two schools.

Solution. We know that the coefficient of variation

CV of school B = 92 ¥100 = =. x100 =11,.70

Here, Variability of both school is nearly equal.


Standard deviation is the square root of the arithmetic mena of the squared deviations
of items taken from the arithmetic mean. It is used most commonly in statistical analysis.

Method to calculate S.D.

Step 1: Calculate mean x
Step 2 : Find the deviation of observation from mean i.e. dx.
Step 3 : Take the square of these deviations i.e. dx?

Step 4 : Take the summation of there squared deviation ie. dx?

Scanned by CamScanner



Step 5: Apply the formula

SD.(c) = e ae

or S.D.(o) =

In case of frequencies distribution, we have

> (x – x) £


Shortcut Method : To calculate S.D.


sp.no-n f= ) ()]
where H = Class interval USES OF STANDARD DEVIATION :

(i) It describe the variation (deviation) of a large distribution from mean that mean it is

used as a unit of variation.

(ii) Indicates whether the variation of difference of an individual from the mean is by

chance i.e. natural or real due to some special reasons.

(iii) It helps to find out error, which determines whether the differences between means of

samples is by chance or real.

(iv) It also helps in finding the suitable size of sample for valid conclusion.


Scanned by CamScanner


PY) elle tsuy Wisier sdeal Les RESEARCH METHODOLOGY

wing table :
| Example 6. Find out standard deviation from the follo



sD.= |= = A= = 5.84 =2.416

Example 7. In a survey of 150 families in a village, the following distribution of ages
of children was found :

Ages of children | 0-2 2-4 4-6 6-8 8-10

No. of families 40 32 25
23 30
Find the mean & standard deviation of the given distribution
Solution. Let the assumed mean (A) = 5 and class interval H = 2

Class | Mid value|Frequency | d; = x A d.? fa fd?
interval (x) (f) H a. aa

0-2 1 40 -2 4 -80 160
2-4 3 32 -1 1 -32 32
4-6 5 25 0 0 0 0
6-8 7 23 1 1 23 – 23

8-10 9 30 Z 4 120 60

Total zs
a0 Xd, =31 )D£,d= ,2?75

Scanned by CamScanner





Now |

Mean =A atopy (N=2f)

=p gn

=5+0.413 =5.413
2 2

S.D.=o=Hx Paae | zie) (Short cut method)
=e | St

275 ( a)
=2x See) Eee

150 \150

= 2x.4/1.8333—0.0427

= 2x./1.7906

= 2×1.3381= 2.67626


(i) Quartile deviation is (2/3) of the standard deviations


Q.D.= = or 3QD = 20

(ii) Quartile deviation is (5/6)” of the mean deviation


QD. =2MD. or [6QD.=5MD.

(iii) Mean deviation is the (4/5)” of standard deviation

In correlation, we will define the relationship between two continuous (or measure-

ment) variables. The main goal of correlation study is to understand the nature and strength

of the linear association between the two quantitative parameters.

1) If two variables are so inter-related in such a manner that change in one variable brings

about in the other variable, then this type of relation of variable known as correlation.

2) If we change the value of one variable that will make corresponding change in the value

of other variable on an average then we can say two variables are correlation. The value

of correlation coefficient will very from -1 to +1.

Correlation can be classified into three categories :
1) Positive, negative and zero correlation.
2) Linear and non-linear correlation
3) Simple, Partial and Multiple Correlation POSITIVE, NEGATIVE AND ZERO CORRELATION

If the values of two variables move in the same direction i.e. if the value of one variable

is increase (or decrease), then value of other variable also increases (or decreases) on an

average, then the correlation said to be positive e.g. Height and Weight (as height increases

weight also increase).
If the value of one variable increases (or decreases), then the value of other variable

decreases (or increases) on an average or in a simple manner, if the value of both variable

moves in opposite direction, then it is said to be negative correlation.

If the change in the value of one variable will not affect the value of other variable then
the correlation is zero.

Example of negative correlation

X 1 2 3 4 5 6

Y 70 60 50 40 30 20
Example of Positive Correlation


If the change in values of one variable makes a constant ratio with the change in value of
other variable, then such type of relation known as linear correlation.

Example 1. In scatter diagram, if all the points lies in straight line-
Figure 1.

The correlation is said to be a non-linear if the value in one variable does not make a
constant ratio with change in the value of other variable.

Example 2. Draw the scatter diagram of the following table.

If we study the relationship between two variables X and Y (say
correlation. e.g. Height and Weight

If we study the relationship between two variables, keeping all the other variable as

constant, then it is called as partial correlation. wo variables then it is said to be
If we study the relationship between more than t iationshtp between

multiple sacle In multiple correlation we measure the dee ae athe a
one variable on one side and combined effect of all other var!

ie. -ler<tl). Then
Note : Since the value of correlation lies between -1 and +1 (i. -1S

: ; iables.
(i) Ifr > 0, we say that a positive correlation between a

(ii) If r < 0, we say that a negative correlation between variables.

(iii) If r = 0, we say that no correlation between variables.




Graphical Methods Algebraic Methods


1. Scatter Diagram 2. Correlation Graph


1. Karl Pearson’s = 2. Spearman’s Rank 3. Concurrent 4. Two-way
Coefficicent of Difference Method _ Deviation Method Frequency Table
Correlation Method

The correlation between variables can by studied by the following ways :
(i) Scattered diagram
(ii)Correlaion graph
(iii) Karl Pearson’s Coefficient of Correlation
(iv) Spearman’s Rank Difference Method
(v) Concurrent Deviation Method
(vi) Two-way frequency table method
(i) Scattered Diagram :
In the study of correlation between two variables, by using graphical method. First we


draw scatter diagram, for which we take the value of one variable on x-axs and the value of
other variable on y-axis. The resulting graph is a scattered point or dot in a graph sheet
known as scatter diagram.

There are various types of scatter diagram,
(i) Perfect Positive Correlation

All the points are in correlation. The’ straight line in upward direction (left bottom to
right up), the correlation scatter diagram showing positive correlation is a perfect positive.

(ii) Highly Positive :
If all the points are very near to straight line in upward direction, then we say it as a

highly positive correlation.
Fig. 3 Highly positive

(iii) Positive Correlation : If all points are near to the straight line (but not very near)
the correlation is positive.

(iv) Perfect Negative : If all the points in a scattered diagram lies in a straight line in

downward direction (left top to right bottom), the correlation is perfect negative (r = -1).
(v) High Negative : If the points are very close to straight line in downward direction, the correlation is high negative.

High negative


(vi) Negative : If the points are close to straight line (not very close) in downward
direction, the correlation is negative.
(vii) Zero correlation : If the points are widely scattered in a graph, the correlation is

said to be zero.

Example 3. Following table shows the values of the variables X and Y

Draw the scatter diagram and find the correlation between the variables

Sol. First we draw the scatter diagram

From the above diagram, we can say that the variables X and Y is positive]
because all the potted points are near to the straight line.

(ii) Correlation Graph :
In this method, we use the individual values of two variables, which are potted o, the

graph sheet and we obtain two different curves on a graph sheet. By the examination of
properties of potted point, we conclude that they will be correlated or not.

Example 4. Draw the diagram and examine the correlation between variables X anjy
Data are given in the following table :

Year 1990 1995 2000 2005 2010
X 5 7 6 6 8

Y 1 4 5 4 7
Sol. First we draw the graph between variables

From the observation through graph, we can say that variables are closely related to

each other.
Merits and Demerits of Graphical Method :
Merits : ;
a) It is popular method of measuring the relationship between two variables.
b) It is very easyest method, without involving any mathematical calculation.
c) Every one can easily understood and examine it.
Demerits :
a) We can not obtain the degree of correlation.
b) Graphical method is suitable only for small number of data.
iii) Karl Pearson’s Coefficient of Correlation : |
Karl Pearson’s Coefficient of Correlation is used to measure the degree of linear rela-

tionship between two variables. It is also called moment correlation coefficient. It is denoted

by ‘r’ and defined as



Where X=x—x and Y=y-y

N = no’s of pair of values of variables
o = Standard deviation

Another form of correlation coefficient is as

IS x? > y?

7 Cov.(XY)

or FSD (x) XSD.(y)

Where Cov(XY)= oa

S.D.(x) = Standard deviation for x series

S.D.(y) = Standard deviation for y series.

Merits and Demerits of Karl Pearson’s Coefficient of Correlation.

Merits :
i) It is important method to give a precise and quantitative result with a meaningful


ii) It also gives a direction (i.e. positive or negative) as well as the degree of the corre-

lation between the variable.s

Demerits :

i) This method is a time consuming –

ii) The limitation of value of correlation is (-I1srs+1)

Example 5. Following data gives the height of father and son in inches. Find the Karl
Pearson’s Coefficient of Correlation

Height of Father X | 65 | 66 | 67 | 67 | 69 71

Height of Son Y | 67 | 68 | 64] 68 | 70 69
Sol. We know that

Karl Pearson’s Coefficient of Correlation (r=


lation between X and Y.
iv)Spearman’s Rank Coefficient of Correlation
It is a method of finding the correlation between two variables by taking their ranks.

This method of finding correlation is special useful in dealing with qualitative data. We use
it if the relative position or rank of magnitude are given, but the actual magnitude of vari-
ables are not given. It is denoted by p (rho) and defined as

ai. 0a _ 6d’
a n(n?—1) me (n’-n)

n = The numbers of pairs of observations

Xd? = Sum of squares of differences of corresponding ranks

There are two cases to calculate the rank correlation:

A) There is no tie

B) There is tie

Case (A) : When the rank correlation with no tie.

In this case, anyone of the values in x-series or y-series is not repeating. So we can use
the following steps for finding rank correlation with no tie.

Step 1 : Give rank one to the heighest value, rank two to next heighest and so on.
Step 2 : Rank x series value and y-series value separately.

Step 3 : Calculate the difference of rank in each pair of values ie. d=R,—Ry

Step 4 : Calculate the sum of squares of all d’s i.e. $d?

Step 5 : Use the formula

so 6>d
Example 6. Find the rank correlation of following data of marks in two subjects of

seven students :

Research methodology | 90 82 81 71 63 49 38

Quantitative techniques} 75 7) 72. 70 40 50 43

Sol. Let the subject “Research Methodology” be denoted by X and Quantitative Cc
niques denoted by Y

Now, use the rank correlation formula

—1<p=0.86S1, therefore the marks in two subjects are correlation.

Case (B) : When the rank correlation with tie :

In this case, one or more values in a x-series or y-series is repeated. So we have to apply

the correction factor (C-F.)

Where R, = Number of repetition of rank

Example 7. Two teachers rank five medical stud fai – sence and

the data are given below : ents based on their intelligen

Students 1 2 3 P ai

Teacher A 5 4 25 i 3 8

Teacher B 5 4 > ; 2.5

Do you agree that two teachers A

sradierié based Gn thelr fatSliirencs? and B have same degree of agreement on judging

Sol. Here the ranks of intelli i?
gence is gi pred

rank given by teacher A is considered as given so re-ranking the values is not requy Bt

considered as y series (R,) x-series (R,) and rank is given by the teac™


r = Coefficient of Correlation
n = Numbers of pairs of observations

Example 9. Find the probable error of the coefficient of correlation problem refers to

example no. 6.

Sol. We have r = 0.86 and n =7 |

Since r is greater than (6 P.E.). Then the value of r is highly significant.


It is calculated by the following formula


In multiple correlation, we study the relationship between three or more variable.
le is z and ‘x and y’ both are independent variables. Then the

SUppose the dependent variab
multiple correlation coefficient is defined as Since p= 0.65 <1, we can say that both teacher have different agreement on assessment

of IQ.

(v) Concurrent Deviation Method:

This method is based on the direction of change in the two paired variations. The coef-

ficient of concurrent deviation between two series X and Y of direction of the change is

called the coefficient of concurrent deviation. It is denoted by r, and calculated by the fol-

lowing formula :

c = Numbers of positive sign after multiplying the change direction of change of x series
and y series.

58 a FU) Eee eset taelereolen
n = Numbers of pairs of observations
Advantages : |

i) It is very simple to understanding and easy to calculate.

ii) It is also suitable for large no’s of observations. sibel

Example 8. Calculate the coefficient of concurrent devation from the following data:
Sol. To calcukite r, construct the table of change of direction.

x ‘Direction of change ¥ Direction of change DX.DY

vi) Two-way Table Method

This method is used to examine the relationship between two categorical variable.
The entries in the cells of a two way table can be displayed as frequency counts or as

relative frequencies or they can be displayed graphically as a segmented bar chart.


The probable error define the interpreted value of th e coefficient of

: A coefficient of multiple correlation lies between 0 and 1. If the value of multiple corre-

lation is one ice. ‘1’, then the correlation of variables is perfect, while, iff the value of multiple
correlation is zero i.e. ‘0’, then there is no correlation of variable.

Remark : Sometimes the multiple correlation is defined as

Question. Consider io =0.86, ,;=0.71 and 1, =0.66 are the zero order correlation
coefficients. Then find the multiple correlation coefficient.

Putting the values of 1,, =0.86, 1, =0.71 and r,,=0.66 in the above formula.

Hence, the multiple correlation is 0.8806.

(Solved Problems)

Q.1. What are the impact of Biostatistics on pharmacy practice?
(i) Aware the standard of medical practice in present scenario.
(ii) Find out barrier in order to improve it.
(iii) What kind of services should pharmacists focus on?
(iv) Is there any significant difference between methods, interventions or procedures?
(v)Identifying determinants for a disease drug related problem condition etc.
(vi) Any observed source between drug and disease.
(vii) Identify more cast effective medicine.
(viii) Making informed decision.

Q.2. Find the mean value of the following data :

Size(x) 214/6/8]10|12|14| 16

Frequency (f) |1 /3/ 5] /2 /2] 6]4 | 2


Now, coefficient of correlation (r) = 2

Q.5. Using Karl Pearson’s method, to calculate the coefficient of correlation between data
and their frequency.


Q.6. Discuss the advantage and disadvantage of Karl Pearson’s method of studying corre-


Advantage :
(i) This method represents the presence or absence of correlation between

Disadvantage :
(i) Comparatively, it is difficult method.

(ii) This method affected by the values of extreme items.
(iii) It is based on a many assumptions.





Q.1 Explain frequency distribution and their classification.

Solution : Hint (See section 1.1.6)


Q.1. Define Statistics and their importance.
Solution. Definition : Statistics is the science of collection, presentation, analysis and inter-
pretation of numerical data for logical analysis.
Importance :

(i)Numerical information is available everywhere.

(ii)The knowledge of statistical methods will help you to understand how decisions are
made and give you a better understanding of how they affect you.

Q.2. Define frequency distribution.

Solution : Suppose the value of a variable occurs twice or more in a given series of observa-

tions, then the number of accurence of the values is known as the frequency of thta value.
The way of tabulating a pool of data of a variable and their respective frequencies side by
side is called a frequency distributions.

Q.3. What do you meant by histogram?
Hint : See section 1.5.

Q.4. Define Ogive.
Hint : See section 1.5.

Q.5. Explain commulative frequency curve with a suitable example.
Hint : See section 1.5.

Q.6. Draw the frequency polygon of the following data.

Age of workers 20-30 30-40 40-50 50-60 60-70

No. of workers 15 20 26 30 6

Q.7. Draw Ogive for the following distribution
Marks obtained 10-20 20-30 30-40 40-50 50-6


Q.8. Explain the following terms :

Q.15. Define Mean deviation.
a) Weighted A.M. —_b) Geometric Mean Ans. Mean Deviation : It is defined as an average or mean of the deviations of the values
c) Harmonic Mean _ d) Mean, Median and Mode from central tendency.

Hint : See section 2.2.1 Q.16. What you meant by standard deviation?
Ans. Standard deviation (S.D.) is the square root of the arithematic mean of the squared

deviations of items taken from the arithematic mean. It is used most commonly in Q.9. Define a measure of a central tendency and explain their objective.
Statistical analysis.

Hint : See section 2.0
Q.17. Discuss the coefficient of variance.

Q.10. Find the mean, median and mode for the following data :

xX 15 20 25 30 35 40 45 50
Q.18. Explain the following terms :

f 7 10 15 20 16 0 3 2
a) Measure of dispersion

Hint : See section 3.0

Q.11. Find the S.D. and variance for the following data :
b) Quartile deviation

Xx 4 8 12 16 20

nt : See section 3.1.2
c) Method to obtain mean

¥ 5 6 10 12 9

Hint : See section 3.1.3.
Q.19. Calculate

Q.12. Obtain median and mode, from the frequency distribution.
the range for the following data regarding Hb% of 10 patients-
153, 160 § data gives the height of 5 students in a class in centimeter: 164, 182, 161,

Q.13. The following distribution is given of 100 students in 1
0th class examination. Ob- Find the standard deviation

tain the mean and S.D. (Ans. o= 9.6cm)
Marks : 1-10 11-20 21-30 31-40 41-50 Q.21. Calculate the s
No.’’s of students: 16 29 30 17 8

Weight 60-62 62-j6 4
Q.14. Define Range and Its Advantages.Range : It is defined as the difference between

the highest and lowest value in the students 8 2 Ans. !
of range is known as the “coefficient of range”.

sample. Also the relative measure (Ans. o=3.22)

d L is the lowest value in any sample. ematically, if H is the highest value an Q.22. In case of continuous
Then Range = H-L ta

ble of fHb r% eof qc uhe et n cy, calculate the standard deviation by the foll

Advantages :
i) It is easy to calculate.

ii) It is easy to understand.

iii) It is helpful in statistical quality control and weather forcast i ng. owing
Hb% 8-9 9

Q.24. In two series of adult aged 21 year and the children 3 months old following values EXERCISE 1.2
more obtained, for the height. Find the ratio, which series shows the greater varia-

Ans. Ratio of greater variation = 1.3 : 1.0 Q.1. Define a measure of a Central Tendency?
Sol. According to A.E. Waugh, “An average is a single value selected from a group of

Q.25. Ina series of boys; the mean blood pressure was 120 and standard deviation was 10. values to represent them in some way- a value which is supported to stand for
In the some series mean height and standard deviation are 160 cm and 5cm respec- whole group, of which it is a part, as typical of all the values in the group”.
tively. Calculate the character, which shows the greater variation.

Q.2. What are the objectives of the measure of a central tendency?
Ans. Blood pressure shows greater variation and 2.7 time high.

Sol. Objectives :

i)To help in decision making.
Q.26. Define Correlation and Discuss their types. ii)To obtain a single value that describe the characteristics of the whole group of data.

iii)To help for comparision.
iv)To help to make quantitative relationship betw

Q.27. Explain the method of Karl Pearson’s to find the coefficient of correlation. een different group average.
Q.3. Define the weighted arithmetic mean.
Sol. Weighted arithmetic mea

Q.28. Find Karl Pearson’s correlation coefficient for the following data : n is defined as the calculation of arithematic mean
ti bn yg putt -he weights to differ

f 10 15 25 35 26 eighted arithematic mean is given below :
yD WXx
X, =

w x=W

Q.29. Calculate correlation coefficient from the given table

X 10 15 30 25 40 38 29 45 Where

4 28 20 38 41 33 27 51 39 X,, =Weigh

ted arithematic mean
W= Weighted assigned to different items,

Q.3300.. If r,n=, = 0.0.999, 9, r,,= 0.60 and r,,== | 0.55 are the zero order correlation coefficients. Th

en Explain the following terms :
find the multiple correlation coefficient. a) Geometric mean

Hint : See section 2.2.1
b) Harmonic mean


: See section 2.2.1
c) Mode

Hint : See section 2.2.2


Scanned by CamScanner

= —atca eam lease 71


Find the mean, median and mode for the following :
* = coefficient of correlation

The following distribution of 100 students in 10″ class examination. n = No’s of pai
mean, Obtain

median. the o* Pairs of observati;o n

} Q4. Define Standard Error,
Marks 1-10 [11-20 [21-30 [31-40 ]41-50 | Ans, Standard error isd
No’s of student |


16 |29 |30 «#|17 [8 motes by SE. and defined by the
sp.<ic? following formula-




Scanned by CamScanner

_S a

72 [I (a) Eee ale) leliereyy MEASURE OF DISPERSION


1. Probability error is-
Q.1. Discuss the coefficient of correlation and its application. a) 0.6475 S.E. b) 0.6745 S.E.
Hint See article no. 3.4 c) 0.6754 S.E. d) 0.6547 S.E.

Covariance varies from
Q.2. Explain the Spearman’s rank method. a) -1 to +1 b)-1 tod
Hint See article no. 3.4 c) 0 to +1 d)-a to+

Karl Pearson’s coefficient is defined for-
Q.3. Explain the following terms : a) Ungrouped data b) Group data

a) Karl Pearson’s Coefficient of Correlation c) Both (a) & (b) d) None of these
Rank coefficient

Hint correlation wasSee article no. 3.4 developed by-
a) Karl pearson’s b) R.A. Fischer

b) Correlation Graph c) Spearman d) None of these
If r = 0.9 and P.E. = 0.032, then the valu

Hint e of n will See article no. 3.4 be
a) 14 b) 15

c) Scatter Diagram c) 16 d) 17
lf r= 0, then Cov(x, y) is equal to

See article no. 3.4
a) +1 b) -1
c) 0 d) None of these
A scatter diagram is-

a) A relation between x and y values —_-b) A statistical test
c) Both (a) and (b) d) None
If there exists any relation between the sets of variables, it is called-
a) Skewness b) Correlation
c) Linear d) None of these
Which of the following is the highest range of r?
a) O and 1 b) land0
¢) -1 and +1 d) None of these

10. Which of the following is a formula of Karl Pearson’s Coefficlent of Correlation?

a) p=l- Ps b) r= DXY
n( n -1) No.0,

°o r= d) None of t

( h
2X e
– s
1 e



a 74 a (a) Ese nasa endure
Coefficient of correlation has maximum value-

a) +1 b) -1

c) 0 d) None of these

12. If N=10, 0, =3/ 0, =2 and > XY =0.8, then the value of r will be-

a) 0.133 b) 1.333
c) 0.013 d) None of these

13. The Spearman Correlation is used with-

a) Ordinal data b) Nominal data
c) Interval data d) None of these

14. If two variables are absolutely indepenent of each other the correlation between them
must be-

a) -1 b) +1
c) 0 d) None of these


1. pb) 2. d) 3. b) 4. ¢) 5. ¢) 6. ¢) 7. a)
8. b) 9. ¢) 10. p) 11. a) 12. ¢) 13. a) 14. ¢)

° @ e@ oe °@

Scanned by CamScanner