187
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A Student-Centered Exploration of Influential Points in Linear Regression Using Desmos

, &

Abstract

Linear regression, an introductory statistics topic, typically involves finding and evaluating the line of best fit as well as making predictions based on it. However, the subtopic of influential points is often neglected. We present and report on findings of a series of four interactive student-centered activities in Desmos that facilitate student discovery of potential influential points and determine the conditions that would result in them having a large effect on the position and strength of a linear relationship. Placing these activities in the Desmos environment enables students to test their conjectures about leverage points and points with large residuals as they receive immediate feedback that promotes the development of understanding of influential points. Supplementary materials for this article are available online.

1 Introduction

Scatterplots provide an opportunity for students to explore the relationship between two quantitative variables. Typical early explorations often involve fitting a line, informally or formally assessing the fit, interpreting the parameters (slope and y-intercept) of the line as well as the corresponding value of the correlation coefficient, and then using the line to make predictions. The statistics topic of fitting a line to a bivariate dataset appears at both the secondary school and undergraduate levels. Departures from linearity is formally listed as Topic 2.9 of the AP Statistics Course Framework (College Board Citation2019) as well as being included in introductory statistics texts at the undergraduate level (De Veaux, Velleman, and Bock Citation2018, chap. 8; Moore, Notz, and Fligner Citation2018, chap. 5; Tintle et al. Citation2020, chap. 10; Lock et al. Citation2021, chap. 2). Being aware of and understanding the effect of unusual data points on the position of a line of best fit and the strength of a linear relationship is a critical, but too often neglected step for students as they learn about interpreting a potential linear relationship between variables. It is important that students recognize that a line of best fit, the least squares regression line (LSRL), can be mathematically calculated for any set of bivariate quantitative data. The decision to use a line of best fit to make predictions should be based on several factors. An examination of a scatterplot can inform whether a relationship (linear or nonlinear) is relevant to the data. A scatterplot can also reveal the existence of points that may have a great influence on the position of the line and strength of a linear relationship, especially in the case of small datasets. An examination of the context of the data and how the data was gathered can help the statistician decide whether unusual points should remain included in the dataset. Assuming a scatterplot indicates that a linear relationship is appropriate, the strength of a linear relationship can be measured by the correlation coefficient and inferential statistics can be used to justify a nonzero slope.

The approach to teaching and learning about influential points has evolved and broadened from a comparison of static scatterplots and their LSRLs to a more dynamic, student-centered engagement involving conjectures and exploration. Prior to the influx of dynamic statistical environments (DSE), a term which we use to include applets and web-based software platforms, the teaching of influential points was often examined through the lens of static cases provided by the teacher or text. A scatterplot of the original data and its associated LSRL would be plotted alongside the LSRL resulting after the removal of an indicated leverage point and perhaps separately alongside the LSRL resulting from the removal of a point with a large residual. In addition to observing the effect of the removal of a point on the resulting position of the LSRL, students would also be asked to observe the change in the correlation coefficient r as they determined the influence of a point. The introduction of DSEs provided students with greater opportunities to conjecture and test their predictions, moving learning environments away from those that depended heavily on direct, teacher-led instruction. Students using a DSE in the study of influential points may first surmise and select which point of a given dataset they believe to be influential, and then remove that point to observe the resulting effect on the LSRL and r. In some cases, DSEs allow students to add a point to a dataset and observe its effects. In other words, a novice student using a DSE is often more engaged in an investigative learning environment. Our sequence of activities is embedded in a student-centered, dynamic, and interactive statistics learning environment that encourages even greater engagement in learning than most DSEs. Students not only identify influential points but also tease apart the interplay of a point’s x- and y-coordinate positions relative to the overall trend of the data and investigate the effect of sample size on the resulting influence of a point on the position of the LSRL and value of r. Additionally, students and teachers receive feedback on student understanding that is greater in quantity and broader in type than that which they typically have access to in most DSEs.

Active learning as a means to engage students in the learning process and to promote higher-order thinking is well supported in the literature (Braun et al. Citation2017; Freeman et al. Citation2014; Kogan and Laursen Citation2014). “Using active learning methods in class allows students to discover, construct, and understand important statistical ideas as well as to engage in statistical thinking” (Carver et al. Citation2016, p. 18). In an effort to place ownership of discovery and learning about influential points in linear regression in the hands of the student, we created a series of interactive, inquiry-based activities that use the free on-line learning platform known as Desmos (Citation2022). Though Desmos may be best known for its online graphing calculator tool (Schwartz Citation2016), teacher-created activities in Desmos Classroom, such as those described in this article, allow teachers to design, share, and modify unique learning experiences that encourage active student learning. Glaze, Moyer-Packenham, and Longhurst (Citation2021) found that whereas teachers reported using adaptable computer tutoring systems for their students’ procedural practice and calculators for routine calculations and visualizations, teachers cited that Desmos was well designed to engage students in mathematical explorations to promote conceptual understanding. Secondary school teachers reported several factors that influenced their choice to integrate Desmos into their mathematics lessons: the capacity to support student understanding of mathematics, the promotion of student engagement, the ease of use, the collaborative feature of sharing student solutions, the multiple feedback methods, and the availability of Desmos on multiple types of devices (McCulloch et al. Citation2018). The activities that we created provide opportunities for students to explore statistical thinking while taking advantage of the assets that have made Desmos popular among mathematics teachers.

We advocate that the hands-on exploratory environment of Desmos provides a rich and engaging opportunity for introductory statistics students, at both the high school and undergraduate levels, to explore, conjecture, and receive immediate feedback as they explore the topic of influential points in linear regression. In this investigation we present a series of hands-on, student-centered, Desmos activities that promote the dynamic discovery that not all points contribute equally to a linear relationship. Through interactive exploration, students identify potential influential points and investigate their effects on the position of the line of best fit and the strength of the magnitude of the correlation.

Like other dynamic statistical environments (DSE) that have a focus on linear regression (Bedford et al. n.d; Chance and Chance 2022; Loveland Citation2015; Pearson Citation2019; Variyath and Nadarajah Citation2022), our Desmos activities allow users to move points and observe the effect on the position of the LSRL and the magnitude of the correlation. However, the Desmos platform offers even more extensive learning opportunities by providing teachers with far greater capacity to design and sequence explorations that promote higher-order thinking and learning of statistical concepts. Our sequence of activities is pointedly designed to scaffold understanding and confront misconceptions about key concepts of influential points as students progress from comparing the effect that unusual points (a leverage point or a point with a large residual) have on the line of best fit to determining how the location of a leverage point and the sample size affect a leverage point’s influence on the line of best fit and its associated correlation.

Desmos activities can be designed for students to progress with feedback at their own pace. While multiple-choice questions are often the prominent method of feedback and assessment in most DSEs, Desmos activities provide an assortment of prompts for feedback. We use several features of the Desmos platform to provide students with immediate feedback: a comparison of their sketch of a speculated line of best fit to the respective true line of best fit, the use of a moveable point to observe the effect of the point’s position on the line of best fit, and the use of card sorts and multiple-choice questions to check understanding through formative assessment. Additionally, open-ended prompts that request a typed student response further provide the teacher with insight into individual and collective student thinking. Teachers can readily observe and assess in real-time the progress of students by viewing student screens in the Desmos Teacher Dashboard. Student responses are accessible to the teacher either individually or collectively problem by problem, and can be anonymized and shared with the class. By analyzing the distribution of student responses, teachers can plan future discussions and activities accordingly. The flexibility of presentation within Desmos, for example the ability to change color and size of features, enables the teacher to draw attention to important aspects of the activity. We often chose to use a change in both color and size within the activities we created. For most users, the change in color is likely more eye-catching, but we found that the change in size helped highlight key features for users with whom we worked who have some degree of color blindness. Another notable benefit of Desmos, compared to most other DSEs, is its library of teacher-created activities that are readily accessible for other teachers to use and modify for their own classroom needs. Our Desmos activity on influential points resides within the Desmos Classroom Activities, ready to be discovered, used, and modified by other teachers.

Our Desmos investigation consists of a sequence of four sequenced activities. In the first activity, students make predictions and then confirm which point (a leverage point or a point with a large residual) of a given dataset has the greatest effect on the position of the line of best fit. In a similar vein, in the second activity students discover that adding a point with a fixed large residual has the greatest effect on the position and fit of the regression line when the x-coordinate of the added point is an outlier. In the third activity students explore the effect of an influential point on the correlation coefficient r. Students discover various ways that an influential point can increase or decrease the strength of the linear relationship, as well as even changing the sign of r and the direction of the slope of the line of best fit. Students then test their understanding of influential points with two card sort activities—the first examines the effect of an additional point on the position of the LSRL and the second targets the effect of an additional point on the strength of a relationship. In the final activity students compare the effect of a leverage point on small versus large datasets. Our set of Desmos activities provides readers and learners with an extensive and scaffolded DSE exploration of influential points. Our analysis of student responses to the various prompts embedded in the Desmos activity enables us to further contribute to the literature by providing insight into students’ misconceptions and understandings regarding influential points.

Our set of four Desmos activities assumes that students have introductory experience with scatterplots and the line of best fit. The activities are primarily geared toward exploratory statistical reasoning, but allow multiple access points for students of various experiences and background knowledge. While students as young as middle schoolers may discover the potential effect of leverage points and points with large residuals on the position of the line of best fit (an extension of the Common Core State Standards 8.SP.A.1 and 8.SP.A.2; National Governors Association 2010), those with greater statistical experience can assess the effect of influential points on the strength of the relationship through an examination of Pearson’s correlation coefficient r. Students with a background in inferential statistics can explore the effect of influential points on the p-value associated with a test on the slope of the line of best fit. While we provide screenshots of selected Desmos slides within figures throughout our discussion, we highly encourage the reader to experience the entirety of the four activities. We provide a link for the teacher version https://teacher.desmos.com/activitybuilder/custom/63e29b5b82f755895d2be08b so that the instructor can access both the activity and the teacher Dashboard as well as modify the activity to their needs. Changes to the original activities were made during the review process to improve their clarity as well as to strengthen connections to their contextual setting. All four activities are conceptually oriented with the first two activities structured in a contextual situation where discussion is encouraged in context about the effect of influential points and the impact this has on carrying out the statistical problem-solving process. Within Appendix A (supplementary material) we include a copy of the original slides as presented to participants of our study and within Appendix B (supplementary material) we present the revised slides. Logos, images, and links that refer to Desmos products are used with permission from Desmos.

An important question not addressed in our Desmos activities is what should one do if they encounter an influential point in a small dataset. Students may ask, can a leverage point with a large residual be removed since it doesn’t fit the overall pattern? While it is appropriate to check for (and potentially correct) a data error for a point with a large deviation in either the x- or y-value, one should not simply remove the point without further attention. At the same time, a statistical conclusion should not rest on the existence of one point. While we ground several of our activities in real-life settings and ask students to interpret findings contextually, we have chosen to largely focus on the mechanics and effects of influential points instead of addressing influential points within the larger setting of the statistical process (Franklin and Bargagliotti Citation2020). To fully negotiate how to manage an influential point, one should return to an examination of the underlying investigative question and the data collection process. This may reveal factors to help make a decision regarding the influential point and aid with the interpretation of results based on data that includes such a point.

2 Participants

In addition to presenting the interactive activities that we created, we provide feedback from their implementation to glean both students’ developing understanding of influential points and their disposition toward the activity itself. The activities were developed over a period of seven months, and were tested and revised based on three pilot iterations each with one user. The sample for the implementation that we present consists of 20 undergraduate students from one section of an introductory calculus-based probability and statistics course, taught by one of the authors, at a public university in the mid-Atlantic region. All but one of the students were majoring in computer science, with the remaining student majoring in mathematics secondary education. Linear regression was the final topic of the course and the Desmos activity on influential points, which served as the students’ introduction to influential points, was assigned as the final homework assignment. Students were told that the homework was optional and that the grade would be based solely on completion. Twenty of the twenty-three students enrolled in the course completed the assignment and thus became the sample for this study. All procedures were approved by the Institutional Review Board of Towson University. Individual and collective responses to embedded prompts, available within the Desmos Teacher Dashboard, serve as the data for our analysis. While we chose to have the participants complete the activity as a homework assignment, the option of having students complete the activity in class would allow the teacher to periodically pause the activity for class discussion or clarification.

3 Discussion of the Four Desmos Activities

Prior to completing the Desmos activities we asked the students questions regarding their experience with Desmos and their preference for learning style. Forty-five percent (9 of 20) of the participants indicated that their exploration with this activity was their first experience with Desmos. Of the 55% (11 of 20) of students who had previous experience with Desmos, all reported use of its graphing calculator. Only one student had ever completed a teacher-created activity such as ours. Students’ self-report on their learning style overwhelmingly supported the use of discovery activities and sharing thoughts with classmates. Eighty percent (16 of 20) agreed or strongly agreed with the statement “I enjoy activities that encourage me to discover mathematical concepts on my own,” while 90% (18 of 20) agreed or strongly agreed with the statement “I enjoy comparing my answers, solutions and thoughts with my classmates.”

3.1 Activity 1–Which is the More Influential Point, a Leverage Point or a Point with Large Residual?

The goal of the first activity (Desmos slides 2–17) is to have students discover that some types of points have greater effect on the line of best fit than others, specifically that a leverage point (a point whose x-coordinate value is much smaller or larger than that of other data values) has the potential to most greatly affect the line’s position. Using Desmos, students make predictions as to which data points in an existing scatterplot they believe might have a disproportionate influence on the position of the line of best fit. Students then adjust the vertical position of these points to discover the effect that the points have on the position of the line. The exploration begins with a description of a hypothetical study that examines a potential relationship between two quantitative variables, the number of steps walked on average per day and weight loss over a 10-week period. In recent years, there has been great interest in personal tracker devices which measure our physical activity, including the number of steps we take each day, with studies (Espinoza et al. Citation2021; McDonough, Su, and Gao Citation2021) indicating personal trackers to be an effective way to promote physical activity with the goal of weight loss. Even though ours is a hypothetical study, we present the data () with detail that students might encounter with an authentic study.

Fig. 1 Context for Activity 1 (Slide 3).

Fig. 1 Context for Activity 1 (Slide 3).

The sample size for our hypothetical study is small, numbering 16 overweight adults. In this study, participants were encouraged to walk at least 10,000 steps per day on average. At the end of 10 weeks the average number of steps walked (x) and total number of pounds lost during the trial (y) were determined for each participant. Note that a negative value of y would indicate that the participant gained weight.

Presented with a scatterplot of the data, students are asked to sketch a line that fits the data and to provide a typed response that describes the overall trend of the data. The subsequent Desmos slide presents students with both their predicted line of best fit, noted in blue, as well as the revealed true line of best fit, notated in black (). All 20 students correctly drew a line segment with positive slope reflecting the general trend of the data. Accompanying the graphical representation of the LSRL [black] are Desmos-provided numerical values for the slope, the y-intercept, Pearson’s correlation coefficient r, and r2. In Desmos the equation of the line is stated mathematically as y = mx + b. We were unable to alter the display of the equation of the line in the left-side bar of slides with graphs. However, in slides where we included our own text, we did include the appropriate use of ŷ for the line of best fit, though we continued to use m and b for slope and y-intercept, respectively, to be consistent with Desmos notation instead of the typical statistical notation of ŷ=a+bx.

Fig. 2 A comparison of Student 9’s conjectured line of best fit [blue] to the true LSRL [black] (Slide 5).

Fig. 2 A comparison of Student 9’s conjectured line of best fit [blue] to the true LSRL [black] (Slide 5).

The students’ responses to the prompt Describe the overall trend of the data reflected their sketches. Responses of three students (Student 3, Student 5, and Student 12) are provided as examples. The number of the student indicates the order in which they accessed the Desmos activity.

S3: The trend of the data is positive and is somewhere in the middle of the cluster, I feel.

S5: The data seems to trend positively with increased steps per day indicating greater amount of weight loss.

S12: The data trends upwards with two potential outliers.

If the Desmos activity is being completed during class time, with the opportunity for periodic class discussions, the teacher may choose to pause the activity after students have responded to the prompt to facilitate a discussion about the appropriateness of the linear model. Sample questions that promote discussion include: How well does the line appear to fit the data set? How well does the line appear to follow the overall trend of the data? Is it surprising that the line intersects so few points? What does the value of r appear to indicate about the strength of the relationship? As the focus of this study is the relationship between average steps walked daily and predicted weight loss over 10 weeks, students are prompted in a multiple-choice question (with randomized responses) to interpret the value of the slope (). Students are provided with immediate feedback. If they select an incorrect answer, they receive the feedback “Not quite there, try again,” and are given the opportunity to change their response until they receive the response “Correct!”

Fig. 3 Interpretation of slope (Slide 7).

Fig. 3 Interpretation of slope (Slide 7).

introduces the notion that not all points have equal influence on the position of the regression line. This can be described as each point tries to pull the regression line toward itself, but some, those that are influential, do so more successfully than others. Students are asked to predict the data point they believe has the greatest influence on the position of the regression line. From our discussions with previous classes prior to this study, we had anticipated that most students would identify either the point with the largest residual or the leverage point to be most influential. Both of these points deviate from the overall pattern of the data points. The majority of students (11 of 20) did indeed choose the data point (12,500, 14.33) with the largest residual. However, unexpectedly, five students chose the data point (11,204, 5.67) with the smallest residual. Only one student correctly chose the leverage point (9700, 1).

Fig. 4 Student 11’s prediction (12,500, 14.33) of the most influential point (Slide 8).

Fig. 4 Student 11’s prediction (12,500, 14.33) of the most influential point (Slide 8).

If the activity is being completed during class, the teacher may choose to display an anonymized distribution of choices to facilitate a discussion that encourages students to justify their choice of which point is the most influential. The teacher should look for opportunities to build from student descriptions of the points to introduce vocabulary to describe (12,500, 14.33) as the point with the largest residual, having the greatest vertical distance from the line (largest error of prediction). Similarly, the term leverage point should be introduced for (9700,1) and be described as a point whose x-value is either much smaller or larger than the rest of the data points. formally defines the terms leverage point and point with large residual and provides descriptions that delineate these two points.

Fig. 5 Definitions of leverage point and point with large residual (Slide 9).

Fig. 5 Definitions of leverage point and point with large residual (Slide 9).

Capitalizing on the discovery environment of Desmos, students use moveable points to adjust the y-values of the two data points in question as they make observations and draw conclusions about the effect that an increase in each point’s vertical distance from the original LSRL has on the position of the resulting line of best fit (). Students first investigate the influence that a point with a large residual (12,500, 14.33), whose x-coordinate value lies near the median of the x-values of the dataset, has on the position of the line as they vertically move the point further from the original LSRL. Then students subsequently explore the influence of the leverage point (9700, 1) by moving it further from the LSRL.

Fig. 6 Examining the effects of moveable points on the position of the LSRL (Slides 10 and 11).

Fig. 6 Examining the effects of moveable points on the position of the LSRL (Slides 10 and 11).

Color-coding and point size is of particular significance in this and subsequent slides. In each slide, the point whose vertical height is being adjusted is noted by a larger size and colored font as is the resulting line of best fit. All other data points are stationary and marked in black font, as is the original line of best fit. In addition to observing the change in the position of the LSRL, students may observe the effect on the correlation coefficient r. Teachers are encouraged to acknowledge that the use of moveable points allows for investigation and discovery in a what-if setting, but that, in general, statisticians do not manipulate or adjust data values.

After observing the effects of manipulating both the leverage point and the point with the largest residual, students are asked to determine which of the two points had a greater effect on the position of the LSRL ().

Fig. 7 Identifying the most influential point (Slide 12: Student 1).

Fig. 7 Identifying the most influential point (Slide 12: Student 1).

Seventy-five percent (15 of 20) correctly chose the leverage point. Their explanations ranged from descriptions of their visual observations to more developed reasons as to why the leverage point has the potential for greater influence.

S1: The leverage point appears to have a greater impact (by eye) when it is moved.

S5: As you move further away from the cluster of data points that are toward the center of the graph, a single outlier value in either direction heavily influences the line of best fit since its value will be abnormal in both the x and y directions. Only slight changes occur when there is a residual point within a cluster of similar x-valued points since at least one dimension is staying consistent.

S7: The point with the largest residual did not effect the overall direction of the line of best fit. But the leverage point rotated the line of best fit from its central point. The point with the largest residual did not effect where its endpoints would lie.

S17: When the leverage point slider is moved, it tilts the graph in addition to moving it, whereas the point w/the large residual only shifts the graph.

Fifteen percent (3 of 20) incorrectly responded that the point with the largest residual was the most influential. Student 19 rejected the possibility that the leverage point was most influential, mistakenly interpreting the negative y-values of the dynamic point to indicate a negative number of steps instead of negative weight loss (weight gain).

S19: While the leverage point greatly changed the shape of the line when its residual was increased, this caused its y value to become negative, which is not possible in this scenario (you can’t walk a negative number of steps). Moving the leverage point’s y-value to 0 showed it having a lesser impact on the line of best fit than that of the point with the large initial residual.

The remaining two students chose the response that neither point affects the position of the line of best fit more greatly than the other.

While Desmos is particularly well-suited to allow students to explore the effect of individual points on the line of best fit by modifying the y-coordinate of data points, a more traditional approach to examining the influence of individual points is to compare the position and fit of the line of best fit with and without the point. provides a static comparison of three lines of best fit and their respective values of r, with the original data and LSRL in black, a red dotted LSRL resulting from removing just the point with the largest residual, and a blue dashed LSRL resulting from the removal of just the leverage point.

Fig. 8 A comparison of the original LSRL (black) to those with removal of unusual points (Slide 14 with S8’s response).

Fig. 8 A comparison of the original LSRL (black) to those with removal of unusual points (Slide 14 with S8’s response).

Presented with the comparison of the three lines and the question “What happens to the original line of best fit when we REMOVE the point with the largest residual or the leverage point?,” students offered explanations that referenced key contrasts within the visual display.

S5: When the largest residual is removed, it seems like the y-intercept is relatively unchanged, but the slope is slightly decreased. But when the leverage point is removed, the y-intercept changes by a large margin, and the slope decreases by about 2x when the red line does. It appears like the leverage point is about 2x as influential to the line of best fit overall.

[Note that the student appears to have mistaken the lower left corner of the visual display to be the origin of the coordinate system. The values of y when x = 9500 are similar for both the original LSRL and the dotted red line when the point with largest residual is removed.]

S8: When you remove the largest residual point the line angles slightly off the line of best fit but when you move the largest leverage point it moves the line completely

S14: the lines basically flipped because if you imagine a see-saw, if a heavy item is on the other side and just disappears the lighter item on the other side will fly up

The familiar reference to a see-saw, as made by Student 14, can aid student understanding of a leverage point. Just as a child sitting further from the pivot point of a seesaw provides more leverage, so too does a point that is further in the horizontal direction from the fulcrum (x¯,y¯) of the LSRL. Along with the visual display of the three lines, also presents the equation of each line and its corresponding correlation. Three of the students specifically commented on the effect of the removal of the points on the resulting correlation.

S6: The slope decreases when either are removed, as well as the y-intercept. When the leverage point is removed the correlation value decreases drastically.

[Note that while the magnitude of the y-intercept decreased, the value of the y-intercept increased.]

S7: The correlation of the line without the largest residual became stronger than before. The correlation of the line without the leverage point was much weaker than before. The leverage point created a more significant impact on the line of best fit than the largest residual point.

S10: Removal of the residual increases the correlation coefficient and decreases the slope of the best fit line. Removal of the leverage point decreases the correlation coefficient and decreases the slope of the best fit line.

Summarizing student responses to prompts within Activity 1, there was great growth in student understanding that the leverage point, with an x-coordinate value as an outlier, had a greater effect on the position of the LSRL than the point with the large residual whose x-coordinate value was within the main cluster of the dataset. Notably, whereas only one student (5%) initially chose the leverage point to be most influential, at the completion of Activity 1, 15 students (75%) correctly chose the leverage point.

A comparative discussion of the influence of the leverage point and the point with the largest residual, within the context of the problem, can complement observations based on the graphical and numerical representations of the line of best fit and accompanying value of r to help solidify understanding. The leverage point (9700, 1) represents a person who took far fewer daily steps on average than the rest of the subjects in the study, an outlier in its x-coordinate value. This data point is the only contribution to the linear relationship for values of x less than 10,000 steps. The person’s weight loss (almost zero pounds) is noticeably less than what would be predicted from either an extrapolation of the LSRL based on the dataset that excludes the leverage point or the LSRL based on all data points (see ). The leverage point has a large influence on the position of the LSRL as its inclusion acts as an anchor tilting the line toward it. The removal of the leverage point results in a prediction of a half-pound less of weight loss for each additional increase of 1000 steps walked. In comparison, the person represented by the point (12,500, 20.83) with the largest residual lost an unusually large amount of weight compared to other people who took a similar number of steps each day on average. [Note, that there are six subjects who took between 12,000 and 13,000 steps.] The influence of the point with the largest residual on the position of the LSRL is less than that of the leverage point as its removal results in far less change in the slope of the line (a change in slope of 0.0005 for the leverage point and a change of slope of 0.0002 for the point with largest residual). A multiple-choice question prompts students to interpret and compare the slope for the original dataset and the dataset with the leverage point excluded ().

Fig. 9 A comparison of the difference of predicted weight loss for each additional 1000 steps walked (Slide 17).

Fig. 9 A comparison of the difference of predicted weight loss for each additional 1000 steps walked (Slide 17).

Students are also presented with multiple-choice questions to compare the strength of the relationship with and without each of these unusual points ().

Fig. 10 A comparison of the strength of the relationship with and without unusual points in the context of the problem (Slides 15 and 16).

Fig. 10 A comparison of the strength of the relationship with and without unusual points in the context of the problem (Slides 15 and 16).

A key take-away is that the leverage point offers the only contribution to the relationship of steps walked and weight loss when relatively few steps are taken and has the potential to have great influence, while the point with the largest residual is one of several data values that contributes to the relationship for mid-range value of steps walked and has less potential for influence. Whereas removing the leverage [anchor] point for this dataset weakens the relationship, removing the point with the largest residual or greatest error of prediction strengthens the relationship.

3.2 Activity 2–When Are Points with a Large Residual the Most Influential?

The second activity (Desmos slides 19–22), introduces a new dataset of Southwest Airlines stock price (Macrotrends Citation2023) versus oil price (Macrotrends.net Citation2023), and strengthens students’ developing understanding that a point whose x-coordinate is either much smaller or greater than the rest of the dataset has the greatest potential to influence the position of a line of best fit and strength of the linear relationship. The initial weak negative trend indicates that as oil price increases, the stock price decreases. Whereas in the first activity students observe the effects of changing the y-coordinate of two points whose x-coordinates are fixed, in this activity () students observe the effect of adjusting the position of an added point to the data, sliding it parallel to the original line of best fit at a fixed vertical distance from the original LSRL. The goal of this activity is to have students observe that though the vertical distance to the original LSRL remains constant, the effect of the added moveable point on the resulting LSRL is greater when its x-coordinate value is very small or very large compared to the x-coordinate values of other data points. Comparatively, the added point has little effect when its x-coordinate value falls near the center of the x-coordinate values of all data points. Strategically, the added point is placed in a starting position that has no effect on the slope of the original line of best fit. It is interesting to note that while typically adding a point changes the LSRL, Desmos through its dynamic nature can enable one to find a point that has no effect.

Fig. 11 Observing the effect of an added point at a fixed distance from the LSRL with variable x-coordinate value (snapshots of dynamic Slide 20).

Fig. 11 Observing the effect of an added point at a fixed distance from the LSRL with variable x-coordinate value (snapshots of dynamic Slide 20).

In response to the prompt: “For what values of x was the point on the parallel line the most influential on the resulting red line of best fit?” nearly 90% (17 of 19) correctly reported that the added point influenced the position of the resulting LSRL when the x-coordinate value was extremely low or extremely high. Some responses provided numerical values for the x-coordinate values while others were descriptive.

S17: the extremes of 40 and 70

S3: When x is really big and when x is really small aka the x cannot be very close to the middle of the line or the graph, otherwise it will not change much.

After collecting and analyzing data we have expanded the prompt for this activity to ask the question in both graphically-oriented and contextual manners: “Slide this added point to determine the x-coordinate values of the added point that have the greatest effect on the position of the line of best fit. In other words, what values of the price of a barrel of crude oil have the greatest effect on the line of best fit used to predict the price of Southwest stock?” Framing the question in language about the real-life example may elicit responses and discussion that focus more on the context, which in turn may help students better understand why a leverage point has such great influence. Adding a point with fixed residual has little to no effect when the price of a barrel of crude oil is typical compared to other data values; however, for an unusually low oil price around $40 per barrel the slope of the LSRL becomes much steeper and the value of r (–0.843) indicates a strong negative linear relationship—as oil price increases, SW stock price decreases. If a leverage point for an unusually high oil price, around $70 per barrel, is added, the slope of the revised LSRL is nearly horizontal with the value of r (0.0973) indicating a very weak relationship between oil price and SW stock price.

In an effort to provide multiple entry points in Activity 2, as well as in Activity 3, within Desmos we have calculated, coded, and provided the p-value [associated with the test for of a nonzero slope of the LSRL] on the side-bar. The p-value is not automatically provided by Desmos. While the activities are focused on exploratory statistical reasoning without drawing attention to the p-value, teachers whose students have already been exposed to the p-value may take the initiative and choose to discuss the effect that an influential point can have on the p-value.

3.3 Activity 3–under What Conditions Are Leverage Points Influential?

The focus of the third activity (Desmos slides 23–36) expands from identifying an influential point and observing its effect on the position of the LSRL to clarifying under what conditions a leverage point greatly affects the value of the correlation coefficient. Within the previous two activities the movement of the dynamic point was restricted to either the y-coordinate value (Activity 1) or along a line parallel to the original LSRL (Activity 2). In this activity, students freely manipulate both the x- and y-coordinates of the point as they are challenged to position the point to achieve the largest (and smallest) value of the correlation coefficient r. Students with experience in statistical inference can observe how the p-value associated with the test for a nonzero slope of the LSRL [provided on the sidebar within Desmos] changes in tandem with the value of r. After exploring how the position of a leverage point can dramatically change both the position of the LSRL and its corresponding value or r in a dataset that otherwise indicates no linear relationship, students are then challenged in two card sort activities to predict how the inclusion of a leverage point affects the position of the line of best fit and the correlation coefficient.

In the first part of this activity students discover that a leverage point, added to a dataset with near zero correlation, has the potential to significantly strengthen the linear relationship depending upon its location relative to the overall trend of the data. Students are presented with a dataset of 16 values—the x-value is the average daily steps that the person walks (data from Activity 1) and the y-value is the average miles per gallon corresponding to the car that the person drives. Based on the context of the data, students are initially asked a multiple-choice question to best describe the strength of this relationship as either strong, slight, or none (). Fifty-nine percent of responding students (10 of 17) indicated that there is no relationship between the average miles per gallon and the average number of daily steps, while 35% (6 of 17) indicated there was a slight relationship, and 6% (1 of 17) indicated that there is a strong relationship. Students are subsequently presented with a scatterplot that visually (a nearly horizontal LSRL) and numerically (r = 0.001 and a p-value = 0.996) confirms that there is no relationship between the number of steps that one walks and the average miles per gallon for the car that they drive.

Fig. 12 Activity 3—Data and initial plot with near zero correlation (Slides 23 and 24).

Fig. 12 Activity 3—Data and initial plot with near zero correlation (Slides 23 and 24).

Students then encounter a zoomed-out image of the scatterplot, this time with a dynamic point, designated in blue font and larger size, whose x and y coordinate values can be freely adjusted. While dragging the point across the scatterplot, students observe how simultaneously the position of the LSRL and corresponding value of r are affected (). Students are challenged to adjust the location of the moveable point within the window of the scatterplot so that the correlation is as strong as possible. As students move the point from its initial non-leverage point location where r ≈ 0 and the LSRL is horizontal to extreme leverage positions in the four corners of the window, the following changes can be observed: (a) the slope of the LSRL changes from near zero to either clearly positive or negative, (b) the value of r changes from near 0.0 to greater than 0.5 in magnitude, and [for students with experience in inferential statistics] (c) the p-value decreases in value from 0.996 to less than 0.010.

Fig. 13 Positioning a leverage point to strengthen an x–y relationship (Slides 26 and 27).

Fig. 13 Positioning a leverage point to strengthen an x–y relationship (Slides 26 and 27).

In the original Desmos activity, students were asked Where did you place your leverage point to achieve the largest value of r? Unfortunately, many of the students (37%; 7 of 19) gave incomplete answers by only identifying the x-coordinate value or using similar imprecise language such as furthermost right. These same students, with one additional student, also gave incomplete answers to the second question Where did you place your leverage point to achieve the smallest value of r? Additionally, 21% (4 of 19) of students appeared to have misread the two questions as they provided a point that resulted in a large value of r in magnitude for the first question and a point that resulted in a value of r close to zero for the second question. We recognize the confusion that students had with the phrase “the smallest value of r” and have since revised both the directions and associated prompts (), using language about the strongest negative correlation in conjunction with value of r closest to–1. Of the remaining students who provided either a specific point or a location defined both horizontally and vertically, and did not apparently misinterpret the question incorrectly in terms of magnitude of r, 100% (8 of 8) correctly identified a point that achieved a very large value of r and 100% (6 of 6) correctly identified a point that achieved a very small value of r.

By altering the position of one point, students observe the tremendous power that a leverage point has on the relationship between gas mileage and daily steps walked. Initially the scatterplot of the data indicates no relationship between the variables as intuition would lead one to believe. But, as the position of just one point is freely manipulated, the slope of the line of best fit can become clearly positive (negative) indicating that as steps walked increase, gas mileage increases (decreases) while the relationship appears to strengthen as indicated by the magnitude of r. Later, in Activity 4, students discover the connections between sample size and the effect of an influential point, further clarifying when a leverage point has such great effect.

At the end of Activity 3, students are presented with two card sorts that challenge them to match scatterplots to appropriate descriptions of the x–y relationships. Card sorts are often used in Desmos activities for quick formative assessment and immediate feedback for students. In both card sorts, all scatterplots include a black regression line fit to the original data points also marked in black, as well as a newly added leverage point marked in blue. In the first card sort () students must consider how the position of the regression line will be affected by the addition of a leverage point or, in one case, a non-leverage point with a large residual. As they place each scatterplot on its appropriate description, students need to delineate that leverage points that follow the overall trend of the data will have little effect on the position of the resulting LSRL while those that deviate from the overall trend have the potential to greatly affect the position. In short, not all leverage points are influential points.

Fig. 14 Card Sort 1—Match each graph to its correct description. Adding the leverage point (blue) greatly changes the position of the LSRL for Graphs 3, 4, 6, but not for Graphs 1, 2, 5 (Slide 30).

Fig. 14 Card Sort 1—Match each graph to its correct description. Adding the leverage point (blue) greatly changes the position of the LSRL for Graphs 3, 4, 6, but not for Graphs 1, 2, 5 (Slide 30).

In the second card sort (slides 33–35) students are challenged to match each of three scatterplots to one of two descriptions about the effect that the indicated leverage point will have on the strength of the relationship ().

Fig. 15 Card Sort 2 (Slide 34).

Fig. 15 Card Sort 2 (Slide 34).

Students are not given the value of r for either the original dataset or the dataset including the leverage point. If the initial relationship was relatively strong—as evidenced by the overall closeness of the data points to the original LSRL—then a leverage point that follows the general trend of the data would strengthen the relationship resulting in an increase in the magnitude of r, while a leverage point that deviated from the general trend of the data would weaken the relationship resulting in a decrease in the magnitude of r. If the initial relationship was relatively weak, then a leverage point that is extremely far removed from the data points in the x direction can strengthen the relationship resulting in an increase in the magnitude of r. The original data and LSRLs of Graphs 2 and 3 are identical and indicate relatively strong relationships compared to that of Graph 1. The addition of a leverage point that falls very close to the original LSRL will strengthen the relationship in Graph 2, while an added leverage point that deviates greatly from the overall trend will weaken the relationship in Graph 3. At the conclusion of both card sorts, students are presented with correct answers and explanations. The correct matching for Card Sort 2 () provides students with both the value of r for the original data and the value of r for the dataset that includes that added leverage point, providing them with a numerical lens to evaluate the strength of a relationship.

Fig. 16 Correct answers and explanations for Card Sort 2 (Slide 35).

Fig. 16 Correct answers and explanations for Card Sort 2 (Slide 35).

The take-away from Activity 3 is multi-pronged. Given a dataset with no relationship (steps walked vs. mpg of car driven), the movement of one point to a leverage position (very small or very large number of steps walked) with a y-coordinate value (mpg for car that the person drives) that is also an outlier vertically, can make the relationship appear linear as observed by a change of position of the LSRL and increase in the magnitude of r. However, the meaningfulness of this relationship should be questioned as it relies heavily on one point. Note that students can move the point below the x-axis where the gas mileage is negative and nonsensical. The two card sort activities expand on the effect that a leverage point can have. An added leverage point that follows the overall trend of the other data points will not greatly alter the position of the original line of best fit but will likely increase the strength of the relationship and magnitude of r. A leverage point that deviates from the overall trend of the existing data points can greatly alter the position of the line of best fit and result in a decreased (increased) magnitude of r when the original relationship was strong (weak).

3.4 Activity 4–How Does Sample Size Affect the Influence of a Leverage Point?

The discussion of influential points is most important for small datasets. The goal of the final activity, Activity 4 (Desmos slides 37–41), is to have students discover and explore that whereas a leverage point can have great influence on small datasets, it has much less influence on large datasets. This activity focuses on conceptual understanding of the effect of sample size on the influence of leverage points without a contextual setting. Students compare side-by-side snapshots of scatterplots () that have nearly identical lines of best fit and values of r but differ on their sample size (n = 4 and n = 12). Students are asked to consider the effect of adding a common leverage point (20,15) to each dataset and sketch their conjectured new line of best fit for each scatterplot that includes the added leverage point.

Fig. 17 A graphical comparison of the effect of a leverage point that does not fit the overall trend of the data on a small and a larger dataset (Slides 38 and 39).

Fig. 17 A graphical comparison of the effect of a leverage point that does not fit the overall trend of the data on a small and a larger dataset (Slides 38 and 39).

We were curious to see if students would assume that the effect of the leverage point was the same for both small and larger datasets. Similar to Student 5, 90% (18 of 20) of students drew sketches that indicated that the leverage point (20,15) had much greater effect on the position of the line of best fit for the data with the smaller sample size ().

Fig. 18 A comparison of the line of best fit for the original data with Student 5’s conjectured line and the true line of best fit for data that included the added leverage point (Slide 40).

Fig. 18 A comparison of the line of best fit for the original data with Student 5’s conjectured line and the true line of best fit for data that included the added leverage point (Slide 40).

In , students are presented with the graphical and algebraic representations of both the LSRL of the original data and the modified LSRL based on data with the added leverage point for both sample sizes. Corresponding correlation coefficients for both small and large datasets are also given.

Fig. 19 A numerical and graphical comparison of the effect of a leverage point on a small and a larger dataset (Slide 41 with S1’s response).

Fig. 19 A numerical and graphical comparison of the effect of a leverage point on a small and a larger dataset (Slide 41 with S1’s response).

Students were asked: “For which dataset (small or large) does the leverage point have the greater influence? Explain.” Ninety percent (18 of 20) of students indicated that the leverage point had greater influence on the smaller dataset. Most (16 of 20) students’ explanations specifically compared the contribution or weight of one data point on the LSRL for small versus larger datasets. A subset of those (3 of 20) attended to the effect of the leverage point on the correlation coefficient.

S1: Small–a smaller dataset is more susceptible to single points changing their slope. I think of it as a single point taking up a much larger proportion of the dataset so it has a bigger impact.

S7: For the small dataset, the leverage point is more influential than the large dataset since there is a greater difference from their Pearson’s correlation coefficient of the best line with the leverage point to the best line without the leverage point.

S8: small because there’s less points to keep the line of best fit stable

S16: The dataset to the left has a leverage point with a greater influence because of how different the red line is compared to the line of best fit.

S18: g1 because there are less data points so the red point has more affect on changing the line of best fit

We chose to create datasets with original sizes of n = 4 and n = 12 so we could clearly illustrate the nearly identical initial relationships on the two scatterplots. We acknowledge that both datasets are small, and we encourage teachers to have their students imagine the effect of the leverage point (20,15) on a third dataset of much larger size (perhaps n = 120) with a nearly identical initial relationship between x and y in terms of the equation of the LSRL and value of r. In summary, the students demonstrated understanding that the topic of leverage points is of particular concern for small datasets where they have the potential to have a great effect on the relationship between variables, and that as the size of the dataset increases the effect of a leverage point becomes much less.

At the end of the four Desmos activities we included a couple questions that addressed students’ dispositions toward the learning experience. Ninety percent (18 of 20) agreed or strongly agreed with the statement “I enjoyed the discovery aspect of dragging points to see the resulting change in the line of best fit.” And 85% (17 of 20) agreed or strongly agreed with the statement “I liked being able to discover and learn, largely at my own pace.”

4 Summary

The sequence of four Desmos activities served to introduce students to the topic of influential points in linear regression and helped to scaffold a robust understanding of the potential effect that leverage points can have on both the position of a linear relationship and its correlation. In addition to being able to identify leverage points that have the greatest influence on the line of best fit, students were able to explore how the location of a leverage point relative to the other points could either strengthen or weaken a linear relationship measured by the correlation r. Finally, students successfully identified that leverage points were more influential for smaller datasets.

Complementing the goal of facilitating student understanding of influential points, the Desmos activities were designed to create an interactive and engaging learning environment that promoted student ownership of their learning. Students were able to make and test conjectures and explore the effects of dynamic points on the position of the LSRL. The inquiry, constructivist environment was found to be important to students themselves, as both the pre- and post-survey questions reflected overwhelming support for this type of learning.

Supplementary Materials

The four Desmos activities are available at https://teacher.desmos.com/activitybuilder/custom/63e29b5b82f755895d2be08b. The original slides of the Desmos activities used in this article to collect student responses can be found in Appendix A. The revised slides for the Desmos activities can be found in Appendix B.

Supplemental material

Appendix_B_228731730_revised.docx

Download MS Word (4 MB)

Appendix_A_228731730.docx

Download MS Word (3.6 MB)

Disclosure Statement

The authors declare no competing interests.

Data Availability Statement

Data for this paper is available at https://doi.org/10.17605/OSF.IO/EK7HB.

Unknown widget #5d0ef076-e0a7-421c-8315-2b007028953f

of type scholix-links

References