How Do We Process Continuous Data So That We Can See the Shape of Distribution

Class-based methods

Histograms

    {Fig. 1}

    MIhistA.gif

    A histogram is a graph in which class interval frequencies of continuous variables are represented by the areas of bars centred on the 'class interval' on the horizontal (x) axis. The figure shows a histogram of the cattle weights given above . There is no 'right number' of classes (sometimes called bins) for a histogram, although somewhere between 12 and 20 are commonly recommended. The optimal number depends on the number of observations and (critically) what features you wish to bring out in the distribution. It may be necessary to use several different bin sizes to properly explore the shape of a distribution.

    The area of each block in the histogram is then drawn so that it is proportional to the frequency of its interval. If the class intervals are all of equal size, as is usually the case, the height of each block is equal to the class frequency on the y-axis.

    When you plot a histogram, it is assumed that you are dealing with a continuous variable - hence the bars touch each other, indicating that the limits of the classes are contiguous. But make sure that your classes do not overlap. Thus you should not specify classes as 445-470 and 470 to 495, as it is not clear into which category a reading of 470 would fall.

    Warning: some computer software programmes unfortunately do precisely this.

    Most of those that we have looked at put the value that is repeated (in this case 470) in the lower class (i.e. 445 - 470). We strongly advise you always check your software, no matter how simple the analysis!

    {Fig. 2}

    MIhist2g.gif

    We can also use a histogram to display a cumulative frequency distribution. The frequency distribution of cattle weights is more or less symmetrical, so when we plot a cumulative frequency distribution we get an s-shaped (known as sigmoid) curve. This display is useful when we wish to show what percentage of individuals are less than a certain value. For example, 9 of the 30 animals (30%) weighed less than (or equal to) 480 kg. A cumulative distribution histogram is often given in outline form only as step plot such as that shown here for the same data on cattle weights (second graph).

    Many graphics packages offer cumulative plots as an alternative to the simple frequency histograms.

 

Frequency polygons

    {Fig. 3}

    MIpoly.gif

    An alternate methods of displaying a frequency distribution of a continuous variables is to use a frequency polygon. The mid-points of each class interval are joined up with straight lines. Remember to also include points having zero observations.

    These two methods are especially useful if you want to plot more than one distribution on the same graph. Overlapping histograms can be confusing even with just two, whilst several polygons can be shown together. Frequency polygons are also more appropriate if you have a large number of classes.

Stem-and-leaf plots

    A stem-and-leaf plot is a way of displaying numbers in a visual histogram-like display. Unlike histograms this looses no information. Consider the distribution of cattle weights again:

    The procedure is as follows:

    1. Write down the leading digit(s) (every digit other than the final one) to the left of a vertical line - these form the stem. These leading digits are arranged in order, from the lowest (42) to the highest (57).

    2. Then assign the last digit of each number to its correct position depending on its leading digits, so forming the 'leaves'. Normally they can take any value from 0-but in this case measurements were to the nearest 5 kg so the leaves can only take the values 0 or 5. Thus for the first number (420) write down a 0 to the right of 42, and so on until all 30 numbers have been done.

    We have shown the stem-and-leaf plot beside a histogram of the same data on its side for ready comparison. In this case the only additional information we retain in the stem-and-leaf plot is the numbers of 0s and 5s - but usually a stem-and-leaf plot will have a lot more detail than a histogram. If you want to compare two similar sets of data you can show them in a back-to-back stem-and-leaf plot. They are mostly used to illustrate absolute frequencies of relatively small samples.

{Fig. 4}

MIhist2d.gif

Line & bar diagrams

    {Fig. 10}

    Ecoaar06.gif

    A 'line diagram' or a 'bar diagram can be used to represent the frequencies of discrete measurement variables (as well as ordinal and nominal variables) by the heights of lines centred on the value (or class) on the horizontal (x) axis. These are similar in appearance to histograms, but the bars do not touch. This emphasizes the discrete nature of the variable. The first figure here shows a line diagram of the distribution of the number of female water voles per colony. The second figure shows a bar diagram for the same variable. For relative frequencies the height of the line/bar represents the percentage of observations within that class. The bars can also be drawn horizontally rather than vertically. One form of the bar diagram - the deviation bar diagram - has a horizontal line in the middle with bars above and below, indicating the deviation from zero.

    {Fig. 11}

    Ecoaar07.gif

    The first figure is a stacked bar diagram showing both the number of female and male voles per colony. The number of males is simply added on to the number of females and shown with a different colour or shading. Another way to show the same information would be to use a multiple bar diagram - two bars side by side would be used for each class, one for the number of females and one for the number of males. In the simplest bar diagrams each bar represents a single count or category. However, sometimes counts are grouped, much the same way as weights are grouped in a histogram. This is especially true in bar diagrams of the numbers of parasites or parasite eggs, where the numbers can be very large.

    Take care -

    1. Many software packages do not discriminate between histograms and bar diagrams. It is up to you to set the bar width so that the bars are touching or separate as appropriate
    2. Bar diagrams of nominal data can be misleading. Authors sometimes present unranked categories as if they were ranked. They also omit categories that should be included, thus artificially inflating the other categories.
    3. Bar diagrams are also used to display data that are not in the form of a frequency distribution, such as the means or medians of measurement variables.

    {Fig. 12}

    Ecoaar08.gif

    When we come to the cumulative frequency distribution of the number of female voles per colony - we find the shape is quite different to that for the cattle weights. This is the typical shape of the cumulative frequency distribution for a skewed distribution with most of the observations in the first class.

    It is very useful to be able to interpret cumulative frequency distributions as they can be more informative than the simple frequency distribution plot. You will get more practice doing this in Session 1.4

 

Density plots

Dot Plots

    The commonest form of dot plot is known as a dot histogram. The vole numbers and cattle weight data are shown as traditional (non-jittered) dot histograms below. The green points in the second image of Fig. 7 below are the same plot for the vole numbers but with the axes reversed. For discrete data these plots may be a useful alternative to a histogram or a jittered dot plot, and are relatively popular.

    {Fig. 14}

    2dotps.GIF

    Dot histograms do have the advantage that the points are spaced such that every observation is shown. However, one can argue that this systematic arrangement produces a biased picture, and some prefer a jittered dot plot.

Jittered dot plots

    {Fig. 5}

    MIhist05.gif

    Jittered dot plots are sometimes used to display frequency distributions without grouping observations into class intervals. They are most useful for small to medium samples, where histograms are unduly sensitive to the exact class intervals.

    To produce a jittered dot plot, you plot the value of each observation on one axis against a random number on the other axis. Values used to jitter are not shown as, being randomly chosen, they are of no interest. Random numbers are used because they avoid spurious patterns, which can mislead the eye. Most software packages now contain uniform random number generators. These generate numbers whose values are equally likely to lie anywhere between 0 and 1. Failing that, random numbers can be looked up in printed tables.

Rank scatterplots

Cumulative rank scatterplots

    A rank scatterplot is probably the most powerful way to examine a frequency distribution.

    {Fig. 6}

    U1ranks3.gif

    The cattle weight data are arranged in ascending order, and the ranks numbered in that order. Sequentially ranking 30 observations in this way means that each observation has a unique rank (r) from 1 to 30.

    Now we can plot a scatterplot of the ranks against their values to give a cumulative rank scatterplot also known as a 'quantile scatterplot'. Sequential ranks are usually preferable for descriptive work.

    Instead of assigning each value a unique sequential rank, we can instead give a mean rank to tied observations. The second figure uses mean ranks rather than sequential ranks, which smoothes the relationship between rank and weight. Although they are less conventional, mean rankings can be very useful for estimation and inference.

    {Fig. 7}

    Ecoaar4a.gif

    Histograms are apt to be misleading, or problematic, when applied to heavily-tied or highly discrete variables - or to long-tailed or U-shaped distributions.

    We will use the vole number data as an example of a discrete variable. As before we rank observations in ascending order, and plot a scatterplot of the ranks against their values. In this graph we used sequential ranking.

    Alternatively, we can also group observations into sets whose values are identical - and rank each set separately. Plotting the rank of (rx) observation in these sets, against their values, yields the plot shown in the second graph. This is very similar to the bar plots shown below.

    Note that these graphic displays do not use class intervals - in effect, the class interval is set by the fact that these measurements were recorded to the nearest 1 vole. If every value is different the class interval is effectively zero, and each interval will contain either no values or one value.

    This result can also be achieved on ordinary software packages by dividing data by a very large number of class intervals. The problem with doing this with continuous data, such as cattle weights, is the number of ties depends upon the accuracy of your measurements - and the degree to which they are rounded. If this is not constant, your class intervals will vary accordingly! We provide an example of this problem at the beginning of Unit 3.

Empirical cumulative distribution functions

    An obvious disadvantage of this 'quick and dirty' rank scatterplot is the Y-scale depends upon the number of observations, which makes it harder to compare distributions.

    {Fig. 8}

    MIhist2j.gif

    The simplest way of de-scaling, or 'standardizing', ranks is to divide each rank (r) by the number of observations, n to give the relative rank of each observation. Using our cattle data this time, we have plotted the relative rank of each observation (r / n) against its value.

    Because it is rather important theoretically, this (cumulative) plot is known, rather grandly, as an empirical cumulative distribution function (ECDF). To emphasize its difference from a theoretical smooth continuous cumulative 'population' distribution function, an ECDF is plotted as a step-function rather than a simple line plot. If every observation is different the plot steps up by a jump of 1 / n at each observation - or by m / n, when m observations are of identical value (tied).

  Regarding proportions & relative ranks

    You may have noticed that, since r is the sequential rank, r/n describes what proportion of observations (p) have ranks which are less than or equal to r. In which case p does not vary from zero to 1 (=n / n), as you might have assumed - because r cannot be less than 1, then p cannot be less than 1 / n .

    This may not matter much when n is really large - but it does cause problems if you assume p can be anywhere from zero to 1, or when you are interested in a distribution's more extreme values, or when you are trying to estimate quantiles of some larger 'population' - from which your sample was taken.

    A simple and useful correction to the relative rank (r/n) is to subtract 1/(2n) or 0.5/n, giving p = (r-0.5)/n. This corrected relative rank allows the value of p to vary from 0.5/n to (n-0.5)/n. One advantage of this is that is it suggests there is a definite possibility of observing a value greater than your sample's maximum, whereas the relative rank implies there is no possibility - which is generally a underestimate! In other words this correction helps to reduce bias.

    Another such correction, where p equals (r-1)/(n-1) does ensure the median is unbiased but, because it enables p to be from zero to one, it usually underestimates the probability of observing any more extreme values.

P-value plots

    Cumulative rank scatterplots may be simple to produce but they are not an easy way to compare locations, or to decide if samples are distributed symmetrically. Another important reason for examining frequency distributions is to examine their 'tails' or 'outliers' or 'extreme quantiles', that is their more 'divergent' values.

    Albeit seldom found outside statistical journals, there is a simple improvement to displaying cumulative distributions which produces an empirical P-value plot . This (rescaling function) was developed for infinitely large distributions that are continuous (so no two values are the same).
    1. Plot the cumulative distribution, in other words p on y (where, by convention, p is the proportion less than or equal to y). Then:
    2. Plot the inverse cumulative distribution, that is 1 - p on y (where, by convention, 1 - p is the proportion greater than or equal to y).
    Since this results in every observation being plotted twice, to avoid unnecessary duplication, points are usually omitted for which p or 1 - p are greater than 0.5

    {Fig. 9}

    pvp01.GIF

    You can use the same method for sample distributions provided you allow for the fact they are finite. In other words the value of p should be {r-0.5}/n, where n is the sample size and r is the sequential rank of y.
    The P-value plot (right) shows the result of this rescaling upon our cattle weight data. It conveys more information than a histogram or a box and whisker plot. While this graph shows the cumulative and inverse cumulative distributions colour-coded, and has two y-axes, it may be easier to provide that information in the text, and use lineplots instead.

Advantages of rank scatterplots

    Rank scatterplots, and their ilk, have several advantages -

    1. Because they do not use class-intervals, they involve no loss of information and can cope with any data that can be ranked;
    2. You can more accurately determine which value demarcates a given proportion of your observations;
    3. Scatterplots allow you to plot the observations in any order (some packages provide ranks, but do not sort the observations into rank order).
    4. Interpolation and function-fitting are easier, more transparent, and less arbitrary.
    5. They provide valuable insight into some important but intractable issues - such as mid-P-values.

    Despite these advantages, and their popularity among statisticians, rank scatterplots and quantile plots are seldom used to explore or present biological data. However, because they are such powerful and transparent tools, we use rank scatterplots and quantile plots extensively in this course.

 

Dot charts & pie charts

Dot charts

    The problem with bar diagrams and line diagrams is that much of the 'ink on the page' is redundant. In other words, if you look at Fig. 11, the vertical lines carry no information - only the location of the top of the line carries any information. Hence in a dot chart the bars or lines are dispensed with, and the top of the line indicated with a dot.

{Fig. 13}

U1dotc.gif

    This type of plots is strongly advocated in some of the texts on data display we have included below - but academia can be very conservative in some respects, and they are still relatively rare in the literature.

Pie charts

    A pie chart can be used to display the relative frequency distribution of a nominal variable. The pie chart has a long history of use in displaying frequency distributions - it was first used by Florence Nightingale to show the relative amounts of a budget spent upon different components of hospital administration in the army. We have used one to display the data we gave above on the reasons people gave for hunting wildlife in Tanzania.

{Fig. 15}

Ecocrp05.gif

    The first figure shows a simple 2-dimensional pie chart. If one is going to use this (rather poor) form of display, this is the only type to use since it is unbiased. The other types are all designed to accentuate the size of one or more sectors at the expense of the others.

    For example, the second figure shows one sector partially extracted or 'exploded' to draw attention to that particular sector. The third figure appears 3-dimensional. The central area is still circular, but because the edges of sectors in front are shown, they appear bigger. The fourth figure shows an elliptical-shaped pie chart. These should always be avoided as some sectors have a disproportionately larger area than other sectors.

mcnemarwics1985.blogspot.com

Source: https://influentialpoints.com/Training/display_of_frequency_distributions.htm

0 Response to "How Do We Process Continuous Data So That We Can See the Shape of Distribution"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel