Course sections

Introduction to Analytics, Lecture 6

Graphical summary using SAS/GRAPH introduction to Histogram, Box plots, Scatter diagram

Histogram

Quantitative Data

 

Quantitative data refers to the data comprising of numerical observations like Sales, profits etc. The main techniques of presenting quantitative data are:

  • Histogram
  • Scatter Plot

In this section we would learn in depth about histograms and then see how we can create histograms in SAS.

 

 

What is a Histogram?

 

A histogram is a graphical representation of the distribution of data, which is an estimate of the probability distribution of a continuous variable, usually in bar graph form, and was first introduced by Karl Pearson in 1891.

The first step in creating a histogram is to divide the entire value range into a series of intervals called “bins” and then to “drop” the individual values into the bins that they belong to. The width of the bin is determined by the range and may or may not be equal to the other bins. If the bins are of equal width, then the height or vertical axis of the bar determines the frequency of the occurrence for that set, but if the bins are not of equal width, then the area of the bar or rectangle represents the frequency of occurrence while the vertical axis represents the density. In both cases, all the bars in the histogram touch to indicate that the variable or data is continuous.

This can be used to visualize data or phenomena with both a contiguous factor and an occurrence factor. For example, a histogram can be used to visualize the commute time of people going to work with the horizontal axis representing time, so the bins are divided according to time, while the vertical axis represents the number of people that fall under that specific travel time.

 

A histogram is a display of statistical information that uses rectangles to show the frequency of data items in successive numerical intervals of equal size. In the most common form of histogram, the independent variable is plotted along the horizontal axis and the dependent variable is plotted along the vertical axis. The data appears as colored or shaded rectangles of variable area.

 

Applications of Histograms

  1. Identifying the most common process outcome: By simply collecting all data related to the final state of the process and organizing it in a histogram, any special trends will quickly become apparent.
  2. Identifying data symmetry: A histogram can help us in realising that whether a particular variable is symmetric (normal) or not. In Analytics it’s very important that the variables are all normally distributed, otherwise, we can’t apply any analytical technique on them
  3. Spotting deviations: the histogram is easily the most useful tool for spotting oddities and identifying worrying trends. Keeping a list of histograms that have been produced in the course of your work and referring back to it can further make things easy to analyze, as you will additionally know when a deviation is potentially caused by old issues, or by a recent change in your operations.
  4. Spotting areas that require little effort: Last but definitely not least, a histogram can be helpful in determining when you’re wasting too much effort or resources on a specific task. Sometimes, a certain part of your process will not require as much attention as you think it does, and a histogram depicting the current resource allocation can immediately reveal that.

Let’s now turn our focus on how we can create histograms in SAS.

PROC UNIVARIATE DATA=mylib.CANDY_SALES_SUMMARY;

VAR SALE_AMOUNT;

HISTOGRAM SALE_AMOUNT;

RUN;

This is the representation of quantitative data. The univariate keyword is used to generate all the key descriptive statistics related to a particular variable. Here, the variable under consideration is sale_amount. The code to generate histogram is histogram. If no dimension is mentioned then, it is by default, a 2-dimensional diagram.

PROC UNIVARIATE DATA=mylib.CANDY_SALES_SUMMARY;
VAR SALE_AMOUNT;
HISTOGRAM SALE_AMOUNT;
CLASS SUBCATEGORY;
RUN;

The univariate option in the code generates all the descriptive statistics associated with the variable sale_amount in the data set candy_sales_summary. Another objective of the code is to construct a histogram for the same variable using the key-word histogram. The total amount of sales is generated for each of the subcategories, which is specified using the keyword class.

The diagram below would be the output of the above code.

WhatsApp chat