The word “Relationships” is used in many contexts. We use it when we talk about our family, friends, sports teams, etc., however, how often do you hear people talk about the relationships between data points; one variable to another variable. How can data points have relationships? After all it’s just data, isn’t it?
Data has a story to tell. If we study it closely and learn to speak it’s language we might find data has a lot to say. With the right techniques, we can learn and gain valuable insights from the data. As we spend time understanding it, we might discover that data points are dependent upon each other or that they share no relationship at all. Data has a story to tell, we just need to take a closer look.
Let’s take a closer look at data containing Cereal Nutritional Ratings and the Ingredients for 77 Brands of Cereal a data set used in Week 36 of 2020 for MakeoverMonday; a weekly learning opportunity for data enthusiasts created by Andy Kriebel. There are various methods for exploring data, and I like to start by connecting the data to the Field Summary Tool in Alteryx. The Field Summary Tool will help us analyze our data by supplying descriptive statistics, scatterplots and histograms.
The first output I review is the O Anchor. Its a table that provides descriptive statistics about the fields, min, max, median, standard deviation, percent missing, number of unique values and the mean. All of these statistics can provide value during our analysis.
As I review the table, I noticed a few negative numbers in the grams of complex carbohydrates, grams of sugar and milligrams of potassium. These must be data entry mistakes and should be adjusted depending upon the type of analysis you are doing. For my purposes, I made the assumption the negative sign (-) was the typo and removed it from the data before proceeding to one of my favorite tools, the Pearson Correlation Matrix (PCM) a method for understanding the relationships in the data. Before we talk about the PCM, let’s continue reviewing the information available in the Field Summary Tool which includes histograms.
Its best practice when exploring data to review the distribution of each variable in equal bin sizes with a histogram. A histogram provides a visual way to evaluate the amount of distribution to determine whether the distribution is normal (like a bell shaped curve), skewed to the left or right, or bimodal. It’s also good method for identifying outliers as each bar represents the frequency within each bin.
As an example, the histogram for Ratings of Cereal has equal bin sizes of 10 for each bar. The distribution is slightly skewed to the right where the median rating is 40.4. We can see a large number of cereals have ratings within the range of 30-40 with one outlier scoring 93.7. If we were planning to conduct any predictive modeling, the distribution of the histogram can influence the performance of the model and is a necessary step to evaluate during exploratory data analysis.
If you don’t have Alteryx, you can create a statistical summary and histograms in Tableau, Tableau Prep, or even Excel. For more advanced statistical analysis, R Studio which uses a free open source R programming language is also an option.
Let’s proceed to one of my favorite methods to understanding relationships; the Pearson Correlation Matrix (PCM). The PCM is often used to understand the relationships between the data variables by providing the correlation coefficients which help us understand the strength of the relationship and whether the relationship is positive, negative or has no relationship at all. In order to apply the PCM to the data, it is important to know it can only be applied on data sets with positive numerical values. Therefore, sometimes data clean up is necessary before utilization.
In Alteryx, I used the Pearson Correlation Tool and selected the variables I wanted to review. The output was a table displaying the correlation coefficients for each variable and how they relate to each other. To read the coefficients, we must line the column header with the row header to see the relationship for the two variables. In the table below, the correlation is .904 between Milligrams of Potassium and Grams of Fiber. This is a positive correlation with a correlation of 1 considered a perfect correlation. Foods such as fresh fruit and vegetables, nuts, and whole grains are high in fiber and high in potassium, therefore, they share a relationship in the cereals that contain those ingredients.
This is great! I can start understanding the relationships between the variables but reading from column to row can be difficult. I wanted something with a color heat map or a scatter plot that would provide additional guidance and that’s when I thought of Tableau. I knew that I could build a scatter plot of all the variables in Tableau and surely there must be a way to build a correlation matrix.
First let’s build the Pair Plot to visually understand the relationships between the variables, identify any outliers and view the slope of each simple linear regression. To start, as I mentioned earlier, I removed the negative sign from the data set and reviewed for any missing values.
Building the Pair Plot in Tableau
Drag your first two measures on rows and columns. The measure should be exactly the same. In my example, I was interested in the relationship of the ingredients to the ratings, therefore, I made rating my predictor variable.
Next under Analysis, remove the check next to Aggregate Measures and a scatter plot should display. Continue adding all the measures in sequence until you have the pair plot created.
To make it easier to read, under format remove grids line and zero lines from columns and rows.
Next add a trend line (under Analytics) to understand the slope of the graph and whether or not the model is statistically significant. A p-value less than .05 is considered statistically significant. The lower the p-value the more significant the model. The linear trend line also supplies the R-Square which is the coefficient of determination and describes how well the trend line fits the data. It helps us understand the proportion of variability in the dependent variable that is explained by the independent variable. The range is 0 to 1, and the closer the R-square is to 1 the better the fit to the linear regression model.
In the image on the far left below we see that a simple linear regression equation for Milligrams of Potassium and Grams of Fiber has a R-square of .81. We visually see the plots are fitted closely to the linear trend line with a few outliers. Additionally, the p-value is less than .05 noting that this model is statistically significant. In comparison, Ratings of Cereal to Grams of Complex Carbs on the far right has zero to limited correlation with the plots randomly distributed from the trend line that barely has a slope. This is confirmed by the R-Square of .003 and Correlation of .05. Therefore, there is no linear relationship between these two variables.
The simple linear regression equation for Ratings of Cereal and Grams of Sugar has a negative correlation. We can gather some more insight by right clicking on the line and selecting describe trend line. In the image below, we can see that for each gram of sugar added to the cereal the rating has the potential to drop by 2.43 points. The p-value is statistically significant as well.
If your interest is to predict the nutritional ratings, it is important to review the relationship of all variables due to multicollinearity. This can occur when one variable is highly related to another variable and their relationship is dependent upon one another.
The last item to address in our Pair Plot, is to touch up the size of the plots and color. I selected purple with opacity of 22% and a light border. Now I can start to evaluate the relationship and distribution for all the variables in this data set.
That’s how you build a Pair Plot!
This is what inspired my MakeoverMonday visualization. In my visualization, I manually calculated the correlation coefficients but felt there must be a solution to creating a full correlation matrix in Tableau.
After submission I found a great video by SuperDataScience by Kirill Eremenko on how to create a full correlation matrix in Tableau.
Once I found this gem, I wanted to apply transparent sheets to overlay the Pair Plot on top of the Correlation Matrix to create one complete view on a dashboard. To accomplish this, I needed to format the Pair Plot to remove the worksheet color and make it transparent. Under format shading set worksheet to None for color.
For the Correlation Matrix, I removed row and column dividers from format borders by sliding the level all the way to the left.
Next, format the font size and color of the text to your liking. It should look similar to the image below.
Create a dashboard with the Correlation Matrix as Tiled and the Pair Plot as Floating. The floating dashboard (Pair Plot) with it’s transparent background will sit on top of Correlation Matrix. Sizing the floating dashboard is a bit tricky with lining up the squares perfectly between both sheets on the dashboard.
Now we can review our correlation coefficients at the same time as we analyze our linear regression models to get a better understanding of the relationships between each ingredient and the influence on the nutritional ratings. We can begin to understand this data’s story.
Please feel free to download my Cereal Correlations workbook on Tableau Public.
Thanks for this! I’m attempting to create something similar in Tableau and this is really helpful because my data structure is similar to what you used here. I had also come across the SuperDataScience tutorial but the structure there was different. Did you restructure your data to create the correlation matrix? If so, how?