CHAPTER 1 - INTRODUCTION
Statistics is a group of methods used to collect, analyze, present, and interpret data and to make decisions.
Descriptive statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and
Inferential statistics consists of methods that use sample results to help make decisions or predictions about a population.
Probability, which gives a measurement of the likelihood that a certain outcome will occur, acts as a link between descriptive
and inferential statistics.
Probability is used to make statements about the occurrence or nonoccurrence of an event under uncertain conditions. Population → the collection of all elements of interest.
A population consists of all elements—individuals, items, or objects—whose characteristics are being studied.
The population that is being studied is also called the target population.
Most of the time, decisions are made based on portions of populations.
Sample → the selection of a few elements from this population.
Survey → the collection of information from the elements of a population or a sample.
Census → a survey that includes every element of the target population.
Often the target population is very large.
Hence, in practice, a census is rarely taken because it is expensive and time-consuming.
In many cases, it is even impossible to identify each element of the target population.
Usually, to conduct a survey, we select a sample and collect the required information from the elements included in that
We then make decisions based on this sample information. Such a survey conducted on a sample is called a sample survey. Representative sample → a sample that represents the characteristics of the population as closely as possible.
Inferences derived from a representative sample will be more reliable. A sample may be:
• RANDOM → a sample drawn in such a way that each element of the population has a chance of being selected.
If all samples of the same size selected from a population have the same chance of being selected, we call it
simple random sampling.
Such a sample is called a simple random sample.
• NONRANDOM A sample may be selected
o with replacement, each time we select an element from the population, we put it back in the population before we
select the next element.
Thus, in sampling with replacement, the population contains the same number of items each time a selection is made. As
a result, we may select the same item more than once in such a sample.
o without replacement occurs when the selected element is not replaced in the population.
In this case, each time we select an item, the size of the population is reduced by one element.
Thus, we cannot select the same item more than once in this type of sampling.
Most of the time, samples taken in statistics are without replacement.
Consider an opinion poll based on a certain number of voters selected from the population of all eligible voters.
In this case, the same voter is not selected more than once. Therefore, this is an example of sampling without replacement.
An element or member of a sample or population is a specific subject or object (for example, a person, firm, item, state, or country)
about which the information is collected. A variable is a characteristic under study that assumes different values for different elements.
In contrast to a variable, the value of a constant is fixed. In general, a variable assumes different values for different elements.
For some elements in a data set, however, the values of the variable may be the same.
The value of a variable for an element is called an observation or measurement. A data set is a collection of observations on one or more variables. TYPES OF VARIABLES
1. Quantitative Variables → a variable that can be measured numerically.
The data collected on a quantitative variable are called quantitative data.
Such quantitative variables may be classified as either discrete variables or continuous variables.
▪ Discrete Variables → a variable whose values are countable.
→ can assume only certain values with no intermediate values.
▪ Continuous Variable → a variable that can assume any numerical value over a certain interval or intervals.
2. Qualitative or Categorical Variables → a variable that cannot assume a numerical value but can be classified into
two or more nonnumeric categories.
The data collected on such a variable are called qualitative data.
TYPES OF DATA
Cross-Section Data → data collected for many elements (𝑛 = 1, 2, 3, … , N) at the same point in time or for
the same period of time (𝑡 = 1);
Time-Series Data → data collected on the same element (𝑛 = 1)for the same variable at different points in
time or for different periods of time (𝑡 = 1, 2, 3, … , T);
Panel data → data are collected for many elements (𝑛 = 1, 2, 3, … , N) at different point in time (𝑡 =
1, 2, 3, … , T).
Group of methods used to collect, analyze, present, and interpret data and to make decisions.
Collection of methods for organizing, displaying, and describing data using tables, graphs, and summary measures.
Collection of methods that help make decisions about a population based on sample results.
A portion of the population of interest.
A survey that includes all members of the population.
Collection of data on the elements of a population or sample.
A survey that includes elements of a sample.
Population or target population
The collection of all elements whose characteristics are being studied.
Element or member
A specific subject or object included in a sample or population.
A characteristic under study or investigation that assumes different values for different elements.
A (quantitative) variable whose values are countable.
A (quantitative) variable that can assume any numerical value over a certain interval or intervals.
Data generated by a quantitative variable.
A variable that can be measured numerically.
Observation or measurement
The value of a variable for an element.
Qualitative or categorical data
Data generated by a qualitative variable.
Qualitative or categorical variable
A variable that cannot assume numerical values but is classified into two or more categories.
A sample drawn in such a way that each element of the population has some chance of being included in the sample.
A sample that contains the same characteristics as the corresponding population.
Simple random sampling
If all samples of the same size selected from a population have the same chance of being selected, it is called simple random
sampling. Such a sample is called a simple random sample.
Data or data set
Collection of observations or measurements on a variable.
Data collected on different elements at the same point in time or for the same period of time.
Data that give the values of the same variable for the same element at different points in time or for different periods of
CHAPTER 2 - ORGANIZING AND GRAPHING DATA
RAW DATA → in their original form. Typically so large that looks meaningless.
When data are collected, the information obtained from each member of a population or sample is recorded in the sequence
in which it becomes available.
This sequence of data recording is random and unranked.
Such data, before they are grouped or ranked, are called raw data. Organizing and graphing QUALITATIVE DATA
Data sets are organized into tables, and data are displayed using graphs.
FREQUENCY DISTRIBUTION → summary of data presented in the form of class intervals and frequencies.
A frequency distribution for qualitative data lists all categories and the number of elements that belong to each of the
Frequency distribution: 𝐀𝐁𝐒𝐎𝐋𝐔𝐓𝐄 𝒇𝒊 → lists all values/categories (𝑥𝑖) and associated number of elements (𝑛𝑖).
Frequency distribution: 𝐑𝐄𝐋𝐀𝐓𝐈𝐕𝐄 𝒓𝒇𝒊 and 𝐏𝐄𝐑𝐂𝐄𝐍𝐓𝐀𝐆𝐄 𝒑𝒊
The relative frequency shows what fractional part or proportion of the total frequency belongs to the corresponding
A relative frequency distribution lists the relative frequencies for all categories.
The percentage for a category is obtained by multiplying the relative frequency of that category by 100.
A percentage distribution lists the percentages for all categories.
with ∑ 𝒓𝒇
𝒊 = 𝟏
𝒑𝒊 = 𝟏𝟎𝟎 𝒓 ⋅ 𝒇𝒊
with ∑ 𝒑𝒊 = 𝟏𝟎𝟎
GRAPHICAL PRESENTATION OF QUALITATIVE DATA
→ to show how categories relate to the whole set
• PIE CHART
→ A circle divided into portions that represent the relative frequencies or percentages of a population or a sample
belonging to different categories.
A pie chart is more commonly used to display percentages, although it can be used to display frequencies or relative frequencies.
The whole pie (or circle) represents the total sample or population.
Then we divide the pie into different portions that represent the different categories.
Quantitative discrete variables (few values)
• BAR GRAPHS
→ A graph made of bars whose heights represent the frequencies (absolute,relative,percentage) of respective
To construct a bar graph (bar chart), we mark the various categories on the horizontal axis.
All categories are represented by intervals of the same width.
We mark the frequencies on the vertical axis.
Then we draw one bar for each category such that the height of the bar represents the frequency of the corresponding
We leave a small gap between adjacent bars.
Quantitative discrete variables
• PARETO CHART
→ A bar chart where the bars are in descending order.
ORGANIZING AND GRAPHING QUANTITATIVE DATA
For quantitative data, an interval that includes all the values that fall within two numbers—the lower and upper limits—is
called a class.
Note that the classes always represent a variable.
The classes are nonoverlapping; that is, each value on earnings belongs to one and only one class.
A frequency distribution for quantitative data lists all the classes and the number of values that belong to each class.
Whereas the data that list individual values are called ungrouped data, the data presented in a frequency distribution table
are called grouped data.
The class boundary is given by the midpoint of the upper limit of one class and the lower limit of the next class.
The difference between the two boundaries of a class gives the class width. The class width is also called the class size.
𝐂𝐥𝐚𝐬𝐬 𝐰𝐢𝐝𝐭𝐡 = 𝐔𝐩𝐩𝐞𝐫 𝐛𝐨𝐮𝐧𝐝𝐚𝐫𝐲 − 𝐋𝐨𝐰𝐞𝐫 𝐛𝐨𝐮𝐧𝐝𝐚𝐫𝐲
𝐋𝐨𝐰𝐞𝐫 𝐥𝐢𝐦𝐢𝐭 + 𝐔𝐩𝐩𝐞𝐫 𝐥𝐢𝐦𝐢𝐭
𝐂𝐥𝐚𝐬𝐬 𝐰𝐢𝐝𝐭𝐡 =
Constructing frequency distribution tables → when constructing a frequency distribution table, we need to make the
following three major decisions:
1. Number of classes
Usually the number of classes for a frequency distribution table varies from 5 to 20, depending mainly on the number of
observations in the data set.
It is preferable to have more classes as the size of a data set increases.
The decision about the number of classes is arbitrarily made by the data organizer.
2. Class width
Although it is not uncommon to have classes of different sizes, most of the time it is preferable to have the same width for all
To determine the class width when all classes are the same size, first find the difference between the largest and the smallest
values in the data.
Then, the approximate width of a class is obtained by dividing this difference by the number of desired classes.
Largest value − Smallest value
Approximate class width =
Number of classes
3. Lower limit of the first class or the starting point
Any convenient number that is equal to or less than the smallest value in the data set can be used as the lower limit of the first
RELATIVE FREQUENCY AND PERCENTAGE DISTRIBUTIONS
Frequency of that class
Relative frequency of a class =
Sum of all frequencies
Percentage = (Relative frequency) ⋅ 100
GRAPHING GROUPED DATA
Grouped (quantitative) data can be displayed in a histogram or a polygon.
We can also draw a pie chart to display the percentage distribution for a quantitative data set.
A histogram is a graph in which classes are marked on the horizontal axis and the frequencies, relative frequencies, or
percentages are marked on the vertical axis.
The frequencies, relative frequencies, or percentages are represented by the heights of the bars.
In a histogram, the bars are drawn adjacent to each other.
The procedure to construct a pie chart is similar to the one for qualitative data.
→ shows how data is distributed, allowing the inspection of the data for its underlying distribution, outliers, skewness,
Each contiguous bar represents a class:
o the width is proportional to the class width,
o the area to the relative frequency, and
o the height to the density
A histogram can be drawn for a frequency distribution, a relative frequency distribution, or a percentage distribution.
Quantitative continuous variables (in class)
1) Compute the relative frequency of each class, 𝒇𝒊
2) Compute the width of each class, 𝒘𝒊
3) Derive the density, as 𝒉𝒊 =
SHAPES OF HISTOGRAMS
A histogram can assume any one of a large number of shapes.
The most common of these shapes are:
1. SYMMETRIC → identical on both sides of its central point.
2. SKEWED → the tail on one side is longer than the tail on the other side.
3. UNIFORM or RECTANGULAR → has the same frequency for each class.
CUMULATIVE FREQUENCY: a running total of frequencies through the classes of a frequency distribution.
CUMULATIVE DISTRIBUTION → for each value (category/class) gives the total number of observations taking that value or
lower (or falling below the upper boundary of each class).
Cumulative frequency of a class
Cumulative relative frequency =
Total observations in the data set
Cumulative percentage = (Cumulative relative frequency) ⋅ 100
An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes
at heights equal to the cumulative frequencies of respective classes.
One advantage of an ogive is that it can be used to approximate the cumulative frequency for any interval.
We can draw an ogive for cumulative relative frequency and cumulative percentage distributions the same way as we did for the cumulative frequency
distribution. STEM-and-LEAF DISPLAY
In a stem-and-leaf display of quantitative data, each value is divided into two portions—a stem and a leaf.
The leaves for each stem are shown separately in a display
Outliers or Extreme Values
Values that are very small or very large relative to the majority of the values in a data set are called outliers or extreme values.
The percentage for a class or category is obtained by multiplying the relative frequency of that class or category by 100.
Data recorded in the sequence in which they are collected and before they are processed.
An interval that includes all the values in a (quantitative) data set that fall within two numbers, the lower and upper limits of
The midpoint of the upper limit of one class and the lower limit of the next class.
The number of values in a data set that belong to a certain class.
Class midpoint or mark
The class midpoint or mark is obtained by dividing the sum of the lower and upper limits (or boundaries) of a class by 2.
Class width or size
The difference between the two boundaries of a class.
A table that lists all the categories or classes and the number of values that belong to each of these categories or classes.
The frequency of a class or category divided by the sum of all frequencies.
A data set presented in the form of a frequency distribution.
Data containing information on each member of a sample or population individually.
A graph made of bars whose heights represent the frequencies of respective categories.
A circle divided into portions that represent the relative frequencies or percentages of different categories or classes.
A graph formed by joining the midpoints of the tops of successive bars in a histogram by straight lines.
A graph in which classes are marked on the horizontal axis and frequencies, relative frequencies, or percentages are marked
on the vertical axis. The frequencies, relative frequencies, or percentages of various classes are represented by bars that are
drawn adjacent to each other.
A histogram with a longer tail on the left side.
A histogram with a longer tail on the right side.
Uniform or rectangular histogram
A histogram with the same frequency for all classes.
A histogram that is identical on both sides of its central point.
A curve drawn for a cumulative frequency distribution.
A display of data in which each value is divided into two portions—a stem and a leaf.
The frequency of a class that includes all values in a data set that fall below the upper boundary of that class.
Cumulative frequency distribution
A table that lists the total number of values that fall below the upper boundary of each class.
The cumulative relative frequency multiplied by 100.
Cumulative relative frequency
The cumulative frequency of a class divided by the total number of observations.
Outliers or Extreme values
Values that are very small or very large relative to the majority of the values in a data set.
CHAPTER 3 - NUMERICAL DESCRIPTIVE MEASURES
MEASURES OF CENTRAL TENDENCY FOR UNGROUPED DATA
A measure of central tendency gives the center of a histogram or a frequency distribution curve.
The mode is the value that occurs with the highest frequency in a data set.
Can be computed for ALL types of variables, but a data-set can have 1, 2, 2+, or eve no mode!
The median is the value of the middle term in a data set that has been ranked in increasing order.
The median divides a ranked data set into two equal parts. It is not sensitive to outliers.
The calculation of the median consists of the following two steps:
1. Rank the data set in increasing order;
2. Find the middle term. The value of this term is the median. o if the number of observations in a data set is odd, then the median is given by the value of the middle term in the ranked
o if the number of observations is even, then the median is given by the average of the values of the two middle terms. QUARTILES
Quartiles are three summary measures that divide a ranked data set into four equal parts.
Quartiles are not sensitive to outliers (as the median, indeed)
Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are greater than Q1.
The second quartile, Q2, divides a ranked data set into two equal parts; hence, the second quartile and the median are the
Approximately 75% of the data values are less than Q3 and about 25% are greater than Q3.
The difference between the third quartile and the first quartile for a data set is called the interquartile range (IQR).
𝐈𝐐𝐑 = 𝐈𝐧𝐭𝐞𝐫𝐪𝐮𝐚𝐫𝐭𝐢𝐥𝐞 𝐫𝐚𝐧𝐠𝐞 = 𝐐𝟑 − 𝐐𝟏
Percentiles are the summary measures that divide a ranked data set into 100 equal parts.
The data should be ranked in increasing order to compute percentiles.
The 𝑘 th percentile is denoted by Pk, where 𝑘 is an integer in the range 1 to 99.
For instance, the 25th percentile is denoted by P25.
Thus, the 𝑘th percentile, Pk, can be defined as a value in a data set such that about 𝑘% of the measurements are smaller than
the value of Pk and about (100 − 𝑘)% of the measurements are greater than the value of Pk.
The (approximate) value of the 𝑘th percentile, denoted by Pk, is
𝑘 ⋅ 𝑛 th
𝑃𝑘 = value of the (
) term in a ranked data set
where 𝑛 is the sample size and 𝑘 denotes the number of the percentile. PERCENTILE RANK
𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐯𝐚𝐥𝐮𝐞𝐬 𝐥𝐞𝐬𝐬 𝐭𝐡𝐚𝐧 𝒙
𝐏𝐞𝐫𝐜𝐞𝐧𝐭𝐢𝐥𝐞 𝐫𝐚𝐧𝐤 𝐨𝐟 𝒙
𝐓𝐨𝐭𝐚𝐥 𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐯𝐚𝐥𝐮𝐞𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 𝐬𝐞𝐭
Indicates the percentile of a given value 𝑥𝑖 in the data-set.
How: divide the number of observations taking values less or equal to 𝑥𝑖 by 𝑛, total number of observations in the sample,
and multiply by 100.
#𝒐𝒃𝒔 ≤ 𝒙
𝒊 ⋅ 𝟏𝟎𝟎
The mean, (average, arithmetic mean), is the most frequently used measure of central tendency.
For ungrouped data, the mean is obtained by dividing the sum of all values by the number of values in the data set:
Sum of all values
Number of values
Mean for population data:
Mean for sample data:
most widely used measure of central tendency;
only for quantitative variables;
sensitive to outliers → TRIMMED MEAN to solve this problem
TRIMMED MEAN → mean computed on a subset of observations, dropping one portion at each end of the ranked data.
The trimmed mean is calculated by dropping a certain percentage of values from each end of a ranked data set.
The trimmed mean is especially useful as a measure of central tendency when a data set contains a few outliers at each
not sensitive to outliers;
only for quantitative variables;
not clear which portion to drop.
WEIGHTED MEAN → mean computed on frequency distributions
for a sequence of 𝑛 data values 𝑥1, 𝑥2,..., 𝑥𝑛 that are assigned weights 𝑤1, 𝑤2,..., 𝑤𝑛 respectively:
𝐖𝐄𝐈𝐆𝐇𝐓𝐄𝐃 𝐌𝐄𝐀𝐍 =
where ∑𝑤 is obtained by multiplying each data value by its weight and then adding the products.