Scatter plots04:40 minutes

Video Transcript

TranscriptScatter plots

Poor Billy Fakespeare the Ghost - his Medieval Party was a bust. Hardly any ghost guest showed. But, to celebrate his 400th birthday, he’s determined to have a big Luau themed shindig with lots and lots of guests. To plan the perfect party, he uses scatter plots.

Postive correlations

On a Cartesian plane, scatter plots are used to show the relation between variables to identify trends. Take a look at this scatter plot – it shows the relation of the popularity of a DJ to the number of guests attending a party. For example, a DJ with a 50 percent popularity rating had 200 guests in attendance and a DJ with a popularity rating of 80 percent had 350 guests. The graph indicates a trend: The more popular the DJ, the greater the attendance at the party. Notice the points on the graph are grouped together - this indicates a high correlation.

And since both variables increase together, the correlation is positive. When points are grouped together, you can draw a 'trend line' also known as the 'line of best fit'and by using any two points that lie on or near the line, you can calculate the slope of the line. And then use the slope and one of the known points to write an equation for the trend line. For this line, using slope equal to 5 and the ordered pair 50 and 200, we can figure out the equation of the line. You can also use the trend line to predict unknown values for 'x' and 'y'. For 'x' equal to 20, we can determine that 'y' is equal to 50 is a better prediction than 'y' is equal to 300.

Negative correlations

Fakespeare thinks he’s got the entertainment for the party all figured out. He invites DJ Mozart to rock the house, but he wonders, is music enough? What about games? He does some research. Take a look at the table. Is there a trend between the number of silly party games and party attendance? Let’s design a scatter plot. For the x-axis, list the number of games, and for the y-axis, list the attendance. Now, plot the order pairs. Hmmm, the points are grouped together, so the data is highly correlated, but as the number of games increases, the number of guests decreases and this indicates a negative correlation.

When there is a negative correlation, as one variable increases, the other decreases. You don’t need to be a genius to figure out that party games are a terrible idea, so Fakespeare decides, there will be no party games. What about refreshments? Will having tropical drink umbrellas make people want to come to the party? Let’s take a look at the scatter plot and see if there's a trend. The points on the graph are very spread out, so there is no correlation and no trend. Tropical drink umbrellas might not increase attendance, but they won’t have an adverse effect either, so Fakespeare orders a case just because he likes them. It seems as though Fakespeare has got everything under control, but do you? Let’s make sure you are good to go with scatter plots.

Correlation Interpretation

When the data is spread out with no pattern, that means there is little to no correlation and no trend. Althought this scatter plot shows the points grouped together, there is no trend. If the line of best fit is horizontal that means that what we measure on the x-axis has no influence on what we're measuring on the y-axis. What if the line of best fit is vertical? Since the slope of a vertical line is undefined, there is no correlation and no trend. One last note: If there is a correlation, don’t automatically jump to the conclusion that there is also a trend. You will need to use common sense because sometimes a correlation is not causation – meaning, one thing does not necessarily cause the other. Take a look at this example. Based on the trend line you might think the house number and party attendance are related, but that’s coincidence, not a trend. When interpreting trends, remember to use common sense. Fakespeare’s party is a huge success! Too bad though. none of the photos that were snapped lasted very long, maybe they're on to something?

Videos in this Topic

Statistics: Data Distribution (7 Videos)

Scatter plots Exercise

Would you like to practice what you’ve just learned? Practice problems for this video Scatter plots help you practice and recap your knowledge.

Hints

We examine if the value of $x$ has an impact on the value of $y$.

A line is given by the equation $y=mx+b$, where $m$ is the slope and $b$ is the $y$-intercept.

A line with a positive slope is increasing. This one has a negative slope and is decreasing.

Solution

On a coordinate plane scatter plots are used to show relationships between variables in order to recognize trends.

Take the first scatter plot as an example: It shows the impact of the popularity rating of a DJ to the number of guest attending a party. A DJ with a 50% popularity rating has 200 guests in attendance and one with 80% leads to 350 guests.

So we can assume a trend (correlation): The higher the DJ popularity rating the higher the number of guests.

The points on the graph are grouped closely together. This indicates a high correlation. In this case the correlation is positive.

So you can draw a trend line, also called the line of best fit.

For the line of best fit you can calculate the slope as well as the $y$-intercept using two given points on the line.

• Interpret the different scatter plots.

Hints

Is there a line that fits the given data? If so, this line is given by the equation $y=mx+b$. Where $m$ is the slope.

An increasing line of best fit has a positive slope and thus a positive correlation.

If the data isn't grouped at all there is no correlation.

If the data doesn't change depending on $x$, that means a line of best fits parallel to the $x$-axis, there is no correlation.

Solution

Let's consider the diagrams from left to the right:

1. When the date is spread out with no pattern we can conclude that there is no correlation and no trend.
2. But even if data is grouped together we can't conclude a correlation. If the line of best fit is horizontal we have then measure on the $x$-axis has no influence on what we're measuring on the $y$-axis. Therefore, no correlation exists.
3. If the line of best fit is a vertical line, the slope is undefined. Thus, we have no correlation and no trend.
4. An increase in grouped data from left to right represents a positive correlation
5. A decrease in grouped data from left to right represents a negative correlation.
Note: correlation does not mean causation.

• Draw a scatter plot.

Hints

Pay attention to the labelling of the $x$- as well as $y$-axis.

If you want to draw the point $(220,190)$ draw a line parallel to the $x$-axis passing $y=190$ and one parallel to the $y$-axis passing $x=220$. The intersection of those lines is the wanted point.

The age is represented by $x$, while the number of friends is represented by $y$.

Solution

Here you see the resulting scatter plot. To each age ($x$) there is a number of friends ($y$) given. So we can conclude, in total seven, ordered pairs, which you can see in this diagram from left to the right:

• $(220,190)$
• $(230,170)$
• $(250,160)$
• $(280,140)$
• $(320,140)$
• $(350,130)$
• $(380,120)$
How can you draw a given ordered pair in a coordinate plane?

Let's have a look at $(280,140)$:

• Draw a line parallel to the $x$-axis passing $y=140$.
• Draw a line parallel to the $y$-axis passing $x=280$.
• The intersection of those lines is the wanted point.

• Interpret the given scatter plot.

Hints

An increasing line of best fit stands for a positive correlation.

The $x$-axis represents the amount of effort needed and the $y$-axis represents the amount of fun had.

Solution

Let's pick some pets:

With turtles, the effort they take isn't so much... however, the resulting amount of fun isn't too high either.

With cats and dogs, perhaps the most beloved pets, the effort for a cat is a little bit less than the effort for a dog. According to this diagram, the fun is also a little bit less for a cat than for a dog. But perhaps cat lovers wouldn't agree.

The pets which take the most effort are the horses, and they are also the animal which are the most fun.

We can conclude that the data seems to be grouped, and that the line of best fit is increasing. So we have a positive correlation. So, the higher the effort the higher the fun and vice versa.

• Determine the slope-intercept form of the line of best fit.

Hints

Use this formula to find the slope.

Use the slope-intercept form of a line ($y=mx+b$) to find the $b$ term by plugging in either point as $x$ and $y$.

"DJ with a 50% popularity rating has 200 guests in attendance" can be represented by the ordered pair $(50, 200)$.

"DJ with 80% popularity leads to 350 guests" can be represented with the ordered pair $(80, 350)$.

$(50, 200)$ this point gives us $x_1 = 50$ and $y_1 = 200$.

$(80, 350)$ this point gives us $x_2 = 80$ and $y_2 = 350$.

Solution

Any linear equation can be expressed in slope intercept form as $y=mx+b$.

1. We first determine the slope $m$ by the formula:
• $m=\frac{y_2-y_1}{x_2-x_1}$.
• So we need two points. Those are given by the information of the impact of 50% (80%) popularity rating on the number of guests 200 (350).
• So we have two points $(50,200)$ and $(80,350)$. Now we put the coordinates of those points in the formula above to get
• $m=\frac{350-200}{80-50}=\frac{150}{30}=5$.
2. This gives us $y=5x+b$ with an unknown y-intercept. Last we put the coordinates of one point into this equation. We picked to use the point $(50, 200)$ and it looks like:
• $200=5(50)+b$.
• Subtracting $250$ results in the y-intercept $b=200-250=-50$.
3. So, the linear equation is $y=5x-50$.

• Explain what kind of data you can represent in a scatter plot.

Hints

Here you see an example of a bar graph.

An ordinal data set is one where each data point is assigned a numerical quantity which establishes an ordering on the entire set of data.

A nominal data set is one where each data point is assigned to a distinct category, which does not provide a measurement or order on the set of data.

The bar graph represents a nominal data set.

Let's have a look at an example: if three people lived in house number $1$, four people lived in house number $2$, five people lived in house number $3$, and so on, then we couldn't conclude that the house number tells us anything about the number of people living in that house.

Solution

Scatter plots are used to show relations between variables to recognize trends. The data use must be ordinal in order to make a scatter plot, as there must be a way to order the points so that they can be compared.

You can use scatter plots to try to find correlations. However, a positive (or negative) correlation doesn't have to imply a trend. For example, if three people lived in house number $1$, four people lived in house number $2$, five people lived in house number $3$, and so on, then we couldn't conclude that the house number tells us anything about the number of people living in that house.

For ordinal data, bar or line graphs can also be used as well.

nominal data cannot be represented with a scatter plot, so bar graphs are usually used instead.