Introduction

Introduction

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets.

A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is more or less a number line, labeled with what the data represents, for example, distance from your home to school. The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data. The shape of the data refers to the shape of the distribution, whether normal, approximately normal, or skewed in some direction, whereas the center is thought of as the middle of a data set, and the spread indicates how far the values are dispersed about the center. In a skewed distribution, the mean is pulled toward the tail of the distribution.

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. Remember, frequency is defined as the number of times an answer occurs. If

  • f = frequency,
  • n = total number of data values (or the sum of the individual frequencies), and
  • RF = relative frequency,

then

RF=fn.RF=fn.

For example, if three students in Mr. Ahab’s English class of 40 students received from 90 to 100 percent, then f = 3, n = 40, and RF = fnfn = 340340 = 0.075. Thus, 7.5 percent of the students received 90 to 100 percent. Ninety to 100 percent is a quantitative measures.

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity. The width of each bar is also referred to as the bin size, which may be calculated by dividing the range of the data values by the desired number of bins (or bars). There is not a set procedure for determining the number of bars or bar width/bin size; however, consistency is key when determining which data values to place inside each interval.

Example 2.9

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players; the heights are continuous data since height is measured:

60, 60.5, 61, 61, 61.5,

63.5, 63.5, 63.5,

64, 64, 64, 64, 64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5,

66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5,

68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69.5, 69.5, 69.5, 69.5, 69.5,

70, 70, 70, 70, 70, 70, 70.5, 70.5, 70.5, 71, 71, 71,

72, 72, 72, 72.5, 72.5, 73, 73.5,

74

 

The smallest data value is 60, and the largest data value is 74. To make sure each is included in an interval, we can use 59.95 as the smallest value and 74.05 as the largest value, subtracting and adding .05 to these values, respectively. We have a small range here of 14.1 (74.05 – 59.95), so we will want a fewer number of bins; let’s say eight. So, 14.1 divided by eight bins gives a bin size (or interval size) of approximately 1.76.

NOTE

We will round up to two and make each bar or class interval two units wide. Rounding up to two is a way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the width of a bar or class interval is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.

The boundaries are as follows:

  • 59.95
  • 59.95 + 2 = 61.95
  • 61.95 + 2 = 63.95
  • 63.95 + 2 = 65.95
  • 65.95 + 2 = 67.95
  • 67.95 + 2 = 69.95
  • 69.95 + 2 = 71.95
  • 71.95 + 2 = 73.95
  • 73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.

The following histogram displays the heights on the x-axis and relative frequency on the y-axis:

Histogram consists of 8 bars with the y-axis in increments of 0.05 from 0-0.4 and the x-axis in intervals of 2 from 59.95-75.95.
Figure 2.5
Interval Frequency Relative Frequency
59.95–61.95 5 5/100 = 0.05
61.95–63.95 3 3/100 = 0.03
63.95–65.95 15 15/100 = 0.15
65.95–67.95 40 40/100 = 0.40
67.95–69.95 17 17/100 = 0.17
69.95–71.95 12 12/100 = 0.12
71.95–73.95 7 7/100 = 0.07
73.95–75.95 1 1/100 = 0.01
Table 2.15
 
Try It 2.9

Construct a histogram and calculate the width of each bar or class interval. Use six bars on the histogram. The following data are the shoe sizes of 50 male students; the sizes are continuous data since shoe size is measured:

9, 9, 9.5, 9.5, 10, 10, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5,

11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5, 11.5, 11.5, 11.5, 11.5,

12, 12, 12, 12, 12, 12, 12, 12.5, 12.5, 12.5, 12.5, 14

 

Example 2.10

The following data are the number of books bought by 50 part-time college students at ABC College; the number of books is discrete data since books are counted:

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,

4, 4, 4, 4, 4, 4,

5, 5, 5, 5, 5,

6, 6

Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy four books. Five students buy five books. Two students buy six books.

Calculate the width of each bar/bin size/interval size.

Solution 2.10

The smallest data value is 1, and the largest data value is 6. To make sure each is included in an interval, we can use 0.5 as the smallest value and 6.5 as the largest value by subtracting and adding 0.5 to these values. We have a small range here of 6 (6.5 – 0.5), so we will want a fewer number of bins; let’s say six this time. So, six divided by six bins gives a bin size (or interval size) of one.

Notice that we may choose different rational numbers to add to, or subtract from, our maximum and minimum values when calculating bin size. In the previous example, we added and subtracted .05, while this time, we added and subtracted .5. Given a data set, you will be able to determine what is appropriate and reasonable.

The following histogram displays the number of books on the x-axis and the frequency on the y-axis:

Histogram consists of 6 bars with the y-axis in increments of 2 from 0-16 and the x-axis in intervals of 1 from 0.5-6.5.
Figure 2.6

Using the TI-83, 83+, 84, 84+ Calculator

Go to Appendix G. There are calculator instructions for entering data and for creating a customized histogram. Create the histogram for Example 2.10.

  • Press Y=. Press CLEAR to delete any equations.
  • Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, press CLEAR and then arrow down. If necessary, do the same for L2.
  • Into L1, enter 1, 2, 3, 4, 5, 6. Note that these values represent the numbers of books.
  • Into L2, enter 11, 10, 16, 6, 5, 2. Note that these numbers represent the frequencies for the numbers of books.
  • Press WINDOW. Set Xmin = .5, Xscl = (6.5 – .5)/6, Ymin = –1, Ymax = 20, Yscl = 1, Xres = 1. The window settings are chosen to accurately and completely show the data value range and the frequency range.
  • Press second Y=. Start by pressing 4:Plotsoff ENTER.
  • Press second Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. Arrow to the third picture (histogram). Press ENTER.
  • Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 (second 2).
  • Press GRAPH.
  • Use the TRACE key and the arrow keys to examine the histogram.
Try It 2.10

The following data are the number of sports played by 50 student athletes; the number of sports is discrete data since sports are counted:

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

3, 3, 3, 3, 3, 3, 3, 3

Twenty student athletes play one sport. Twenty-two student athletes play two sports. Eight student athletes play three sports. Calculate a desired bin size for the data. Create a histogram and clearly label the endpoints of the intervals.
 

Example 2.11

Using this data set, construct a histogram.

Number of Hours My Classmates Spent Playing Video Games on Weekends
9.95 10 2.25 16.75 0
19.5 22.5 7.5 15 12.75
5.5 11 10 20.75 17.5
23 21.9 24 23.75 18
20 15 22.9 18.8 20.5
Table 2.16
Solution 2.11
This is a histogram that matches the supplied data. The x-axis consists of 5 bars in intervals of 5 from 0 to 25. The y-axis is marked in increments of 1 from 0 to 10. The x-axis shows the number of hours spent playing video games on the weekends, and the y-axis shows the number of students.
Figure 2.7

Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if it falls on the left boundary but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.

Try It 2.11

The following data represent the number of employees at various restaurants in New York City. Using this data, create a histogram:

22, 35, 15, 26, 40, 28, 18, 20, 25, 34, 39, 42, 24, 22, 19, 27, 22, 34, 40, 20, 38, 28

Collaborative Exercise

Count the money (bills and change) in your pocket or purse. Your instructor will record the amounts. As a class, construct a histogram displaying the data. Discuss how many intervals you think would be appropriate. You may want to experiment with the number of intervals.

 

Frequency Polygons

<section data-depth="1" id="fs-idm4800336">
<h3 data-type="title">Frequency Polygons</h3>
<p id="fs-idm417872">Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.</p>
<p id="fs-idp64102880">To construct a frequency polygon, first examine the data and decide on the number of intervals and resulting interval size, for both the <em data-effect="italics">x</em>-axis and <em data-effect="italics">y</em>-axis. The <em data-effect="italics">x</em>-axis will show the lower and upper bound for each interval, containing the data values, whereas the <em data-effect="italics">y</em>-axis will represent the frequencies of the values. Each data point represents the frequency for each interval. For example, if an interval has three data values in it, the frequency polygon will show a 3 at the upper endpoint of that interval. After choosing the appropriate intervals, begin plotting the data points. After all the points are plotted, draw line segments to connect them.</p>
<div data-type="example" id="example4">
<h3 class="os-title"><span class="os-title-label">Example </span><span class="os-number">2.12</span><span class="os-divider"> </span></h3>
<p id="fs-idm56511280">A frequency polygon was constructed from the frequency table below.</p>
<div class="os-table">
<table id="fs-idp57619648" summary="Table two shoes the Frequency distribution for calculus final test scores. The table is separated into four columns, lower bound, Upper Bound, Frequency, and Cumulative Frequency. Lower 49.5, upper 59.5, frequency 5, and cumulative frequency is 5. Lower 59.5, upper 69.5, frequency 10, and cumulative frequency is 15. Lower 69.5, upper 79.5, frequency 30, and cumulative frequency is 45. Lower 79.5, upper 89.5, frequency 40, and cumulative frequency is 85. Lower 89.5, upper 99.5, frequency 15, and cumulative frequency is 100.">
<thead>
<tr>
<th colspan="4">Frequency Distribution for Calculus Final Test Scores</th>
</tr>
<tr>
<th>Lower Bound</th>
<th>Upper Bound</th>
<th>Frequency</th>
<th>Cumulative Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>49.5</td>
<td>59.5</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>59.5</td>
<td>69.5</td>
<td>10</td>
<td>15</td>
</tr>
<tr>
<td>69.5</td>
<td>79.5</td>
<td>30</td>
<td>45</td>
</tr>
<tr>
<td>79.5</td>
<td>89.5</td>
<td>40</td>
<td>85</td>
</tr>
<tr>
<td>89.5</td>
<td>99.5</td>
<td>15</td>
<td>100</td>
</tr>
</tbody>
</table>
<div class="os-caption-container"><span class="os-title-label">Table </span><span class="os-number">2.17</span><span class="os-divider"> </span><span class="os-divider"> </span></div>
</div>
<div class="os-figure">
<figure id="eip-idm19499056"><span data-alt="A frequency polygon was constructed from the frequency table below." data-display="block" data-type="media" id="fs-idm4822176"><img alt="A frequency polygon was constructed from the frequency table below." data-media-type="image/jpg" id="19498" src="https://www.texasgateway.org/sites/default/files/TEAhsstatistics/resourc... width="350" /> </span></figure>
<div class="os-caption-container"><span class="os-title-label">Figure </span><span class="os-number">2.8</span><span class="os-divider"> </span></div>
</div>
<p id="fs-id1164565376977">Notice that each point represents frequency for a particular interval. These points are located halfway between the lower bound and upper bound. In fact, the horizontal axis, or <em data-effect="italics">x</em>-axis, shows only these midpoint values. For the interval 49.5&minus;59.5 the value 54.5 is represented by a point, showing the correct frequency of 5. For the interval occurring before 49.5&ndash;59.5, (or 39.5&ndash;49.5), the value of the midpoint, or 44.5, is represented by a point, showing a frequency of 0, since we do not have any values in that range. The same idea applies to the last interval of 99.5&ndash;109.5, which has a midpoint of 104.5 and correctly shows a point representing a frequency of 0. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.</p>
</div>
<div class="statistics try" data-has-label="true" data-label="" data-type="note" id="fs-idm37683552">
<div class="os-title"><span class="os-title-label">Try It </span><span class="os-number">2.12</span></div>
<div class="unnumbered" data-type="exercise" id="eip-idp12732512">
<div data-type="problem" id="eip-idp12732768">
<p id="fs-idp42868544">Construct a frequency polygon of U.S. presidents&rsquo; ages at inauguration shown in <a class="autogenerated-content" href="#fs-idp36852784">Table 2.18</a>.</p>
<div class="os-table">
<table id="fs-idp36852784" summary="This table in a two column table with the ages of presidents at inauguration and the frequency by which the ages occur. 41.5 – 46.5 with a frequency of 4, 46.5 – 51.1 with a frequency of 11, 51.5 – 56.5 with a frequency of 14, 56.5 – 61.5 with a frequency of 9, 61.5 – 66.5 with a frequency of 4, 66.5 – 71.5 with a frequency of 2.">
<thead>
<tr>
<th>Age at Inauguration</th>
<th>Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>41.5&ndash;46.5</td>
<td>4</td>
</tr>
<tr>
<td>46.5&ndash;51.5</td>
<td>11</td>
</tr>
<tr>
<td>51.5&ndash;56.5</td>
<td>14</td>
</tr>
<tr>
<td>56.5&ndash;61.5</td>
<td>9</td>
</tr>
<tr>
<td>61.5&ndash;66.5</td>
<td>4</td>
</tr>
<tr>
<td>66.5&ndash;71.5</td>
<td>2</td>
</tr>
</tbody>
</table>
<div class="os-caption-container"><span class="os-title-label">Table </span><span class="os-number">2.18</span><span class="os-divider"> </span><span class="os-divider"> </span></div>
</div>
</div>
</div>
</div>
<p id="fs-idp48307936">Frequency polygons are useful for comparing distributions. This comparison is achieved by overlaying the frequency polygons drawn for different data sets.</p>
<div data-type="example" id="fs-idp21707856">
<h3 class="os-title"><span class="os-title-label">Example </span><span class="os-number">2.13</span><span class="os-divider"> </span></h3>
<p id="fs-idm34843712">We will construct an overlay frequency polygon comparing the scores from <a class="autogenerated-content" href="#example4">Example 2.12</a> with the students&rsquo; final numeric grades.</p>
<div class="os-table">
<table id="fs-idm10950720" summary="Frequency Distribution for Calculus final test scores is the title of the table. It is split into four columns: Lower Bound, Upper Bound, Frequency, and Cumulative Frequency. From left to right the columns read 49.5, 59.5, 5, 5. The second row – 59.5, 69.5, 10, 15. The third row – 69.5, 79.5, 30, 45. The fourth row – 79.5, 89.5, 40, 85. The fifth and last row 89.5, 99.5, 15, 100. ">
<thead>
<tr>
<th colspan="4">Frequency Distribution for Calculus Final Test Scores</th>
</tr>
<tr>
<th>Lower Bound</th>
<th>Upper Bound</th>
<th>Frequency</th>
<th>Cumulative Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>49.5</td>
<td>59.5</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>59.5</td>
<td>69.5</td>
<td>10</td>
<td>15</td>
</tr>
<tr>
<td>69.5</td>
<td>79.5</td>
<td>30</td>
<td>45</td>
</tr>
<tr>
<td>79.5</td>
<td>89.5</td>
<td>40</td>
<td>85</td>
</tr>
<tr>
<td>89.5</td>
<td>99.5</td>
<td>15</td>
<td>100</td>
</tr>
</tbody>
</table>
<div class="os-caption-container"><span class="os-title-label">Table </span><span class="os-number">2.19</span><span class="os-divider"> </span><span class="os-divider"> </span></div>
</div>
<div class="os-table">
<table id="fs-idp39914624" summary="Frequency Distribution for Calculus Final Grades is the title of this table. It is split into four columns with the categories: Lower Bound, Upper Bound, Frequency, and Cumulative Frequency. The table reads from left to right starting with row 1 – 49.5, 59.5, 10, 10. Row 2 – 59.5, 69.5, 10, 20. Row 3 – 69.5, 79.5, 30, 50. Row 4 – 79.5, 89.5, 45, 95. Row 5 – 89.5, 99.5, 5, 100.">
<thead>
<tr>
<th colspan="4">Frequency Distribution for Calculus Final Grades</th>
</tr>
<tr>
<th>Lower Bound</th>
<th>Upper Bound</th>
<th>Frequency</th>
<th>Cumulative Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>49.5</td>
<td>59.5</td>
<td>10</td>
<td>10</td>
</tr>
<tr>
<td>59.5</td>
<td>69.5</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>69.5</td>
<td>79.5</td>
<td>30</td>
<td>50</td>
</tr>
<tr>
<td>79.5</td>
<td>89.5</td>
<td>45</td>
<td>95</td>
</tr>
<tr>
<td>89.5</td>
<td>99.5</td>
<td>5</td>
<td>100</td>
</tr>
</tbody>
</table>
<div class="os-caption-container"><span class="os-title-label">Table </span><span class="os-number">2.20</span><span class="os-divider"> </span><span class="os-divider"> </span></div>
</div>
<div class="os-figure">
<figure id="eip-id1165746871888"><span data-alt="This is an overlay frequency polygon that matches the supplied data. The x-axis shows the grades, and the y-axis shows the frequency." data-display="block" data-type="media" id="fs-idm24364960"><img alt="This is an overlay frequency polygon that matches the supplied data. The x-axis shows the grades, and the y-axis shows the frequency." data-media-type="image/jpg" id="34724" src="https://www.texasgateway.org/sites/default/files/TEAhsstatistics/resourc... width="350" /> </span></figure>
<div class="os-caption-container"><span class="os-title-label">Figure </span><span class="os-number">2.9</span><span class="os-divider"> </span></div>
</div>
</div>
<p id="fs-idp59234016">Suppose that we want to study the temperature range of a region for an entire month. Every day at noon, we note the temperature and write this down in a log. A variety of statistical studies could be done with the data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.</p>
<p id="fs-idm25640016">One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we don&#39;t have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.</p>
</section>

Constructing a Time Series Graph

Constructing a Time Series Graph

To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By using the axes in that way, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.

Example 2.14

The following data show the Annual Consumer Price Index each month for 10 years. Construct a time series graph for the Annual Consumer Price Index data only:

Year Jan Feb Mar Apr May Jun Jul
2003 181.7 183.1 184.2 183.8 183.5 183.7 183.9
2004 185.2 186.2 187.4 188.0 189.1 189.7 189.4
2005 190.7 191.8 193.3 194.6 194.4 194.5 195.4
2006 198.3 198.7 199.8 201.5 202.5 202.9 203.5
2007 202.416 203.499 205.352 206.686 207.949 208.352 208.299
2008 211.080 211.693 213.528 214.823 216.632 218.815 219.964
2009 211.143 212.193 212.709 213.240 213.856 215.693 215.351
2010 216.687 216.741 217.631 218.009 218.178 217.965 218.011
2011 220.223 221.309 223.467 224.906 225.964 225.722 225.922
2012 226.665 227.663 229.392 230.085 229.815 229.478 229.104
Table 2.21
Year Aug Sep Oct Nov Dec Annual
2003 184.6 185.2 185.0 184.5 184.3 184.0
2004 189.5 189.9 190.9 191.0 190.3 188.9
2005 196.4 198.8 199.2 197.6 196.8 195.3
2006 203.9 202.9 201.8 201.5 201.8 201.6
2007 207.917 208.490 208.936 210.177 210.036 207.342
2008 219.086 218.783 216.573 212.425 210.228 215.303
2009 215.834 215.969 216.177 216.330 215.949 214.537
2010 218.312 218.439 218.711 218.803 219.179 218.056
2011 226.545 226.889 226.421 226.230 225.672 224.939
2012 230.379 231.407 231.317 230.221 229.601 229.594
Table 2.22
Solution 2.14
This is a times series graph that matches the supplied data. The x-axis shows years from 2003 to 2012, and the y-axis shows the annual CPI.
Figure 2.10 The annual amounts are plotted for each year. Then, consecutive points are connected with a line.
 
Try It 2.14

The following table is a portion of a data set from a banking website; use the table to construct a time series graph for CO2 emissions for the United States:

CO2 Emissions
  Ukraine United Kingdom United States
2003 352,259 540,640 5,681,664
2004 343,121 540,409 5,790,761
2005 339,029 541,990 5,826,394
2006 327,797 542,045 5,737,615
2007 328,357 528,631 5,828,697
2008 323,657 522,247 5,656,839
2009 272,176 474,579 5,299,563
Table 2.23

Uses of a Time Series Graph

Time series graphs are important tools in various applications of statistics. When a researcher records values of the same variable over an extended period of time, it is sometimes difficult for him or her to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.