Least Median Of Squares And Regression Through The Origin

least median of squares and regression through the origin supporting files online at http://www.wabash.edu/econexcel/lmsorigin by

Least Median of Squares and Regression through the Origin
Supporting files online at
http://www.wabash.edu/econexcel/LMSOrigin
By
Humberto Barreto
Department of Economics
Wabash College
Crawfordsville, IN 47933
[email protected]
and
David Maharry
Department of Mathematics and Computer Science
Wabash College
Crawfordsville, IN 47933
[email protected]
The authors thank Michael Axtell, Frank Howland, and anonymous
referees
for suggestions and criticisms.
Comments welcome
Do not quote without the author’s permission
January 2005
Abstract
An exact algorithm is provided for finding the Least Median of Squares
(LMS) line for a bivariate regression with no intercept term. It is
shown that the popular PROGRESS routine will not, in general, find the
LMS slope when the intercept is suppressed.
A Microsoft Excel workbook that provides the code in Visual Basic is
made available at www.wabash.edu/econexcel/LMSOrigin
Keywords: LMS, Robust Regression, PROGRESS
1. Introduction
Rousseeuw [1984] introduced Least Median of Squares (LMS) as a robust
regression procedure. Instead of minimizing the sum of squared
residuals, coefficients are chosen so as to minimize the median of the
squared residuals. Unlike conventional least squares (LS), there is no
closed-form solution with which to easily calculate the LMS line since
the median is an order or rank statistic. A general non-linear
optimization algorithm performs poorly because the median of squared
residuals surface is so bumpy that merely local minima are often
incorrectly reported as the solution.
Although a closed-form solution does not exist and brute force
optimization is not reliable, several algorithms are available for
fitting the LMS line (or hyperplane). Perhaps the most popular
approach is called PROGRESS (from Program for RObust reGRESSion). The
program itself is explained in Rousseeuw and Leroy [1987] and the most
recent version is available at http://www.agoras.ua.ac.be/. Several
software packages, such as SAS/IML (version 6.12 or greater), have an
LMS routine based on PROGRESS.
This paper focuses on the special problem of finding the LMS fitted
line through the origin in the bivariate case. The next section
presents the model and defines the LMS line. Section 3 shows that the
PROGRESS algorithm gives an incorrect solution, in general, when the
intercept is restricted to zero. Section 4 presents an analytical,
exact method for finding the minimum median squared residual for the
bivariate, zero intercept case. Finally, a simple example is provided
to illustrate the algorithm and show why PROGRESS fails in the
zero-intercept case.
2. The model
============
Suppose that observed values of y are generated according to the model
yi = xi + i. Given a realization of n (xi, yi) points, the problem
is to find the ‘best’ choice for the slope of a straight line that
passes through the origin, .
One choice for the ‘best’ line is the line that minimizes the median
of the individual squares of the deviations, or residuals. Given this
objective, one chooses the value of the slope, m, to minimize the
median value of the squared residuals:

Given a value of m, the n squared deviations can be ordered and the
median found as the ‘middle’ one. If n is odd the median is the
item, while if n is even the median is the mean of the n/2 and
the n/2 + 1 terms. Table 1 shows a simple example.

Table 1. Five Points Example of the LMS Fit
The smallest median squared residual, 0.64, is achieved when m=2.4.
Other choices of m will generate greater median squared residual
values.
3. PROGRESS and LMS with a zero intercept
The PROGRESS algorithm is based on sampling subsets of points from the
data in order to generate candidate LMS estimates. The size of each
subset is determined by the number of coefficients to be estimated.
For the table above, there is one coefficient to be estimated and five
observations so there are five subsets, each containing one data
point. For each of the n subsets, the slope is computed. Using this
slope, the squared deviation of each of the data points is calculated
and the median of these squared deviations is determined. The PROGRESS
estimation of the LMS slope is the subset that produces the minimum
median deviation.

Table 2. Five Points Example with PROGRESS, No Intercept
Using the data from Table 2, PROGRESS determines that the minimum
median squared residual has a value of 1, found when the slope is 1.5.
Thus PROGRESS generates an LMS line . Note that this is not the
true minimum median squared residual, which has a value of 0.64 for
the slope of 2.4.

(a) Scatter Plot with Fits (b) Median SR Objective Function
Figure 1. Five Points Example Fits
It is well known that PROGRESS will yield an exact, correct solution
in the case of simple regression (i.e., bivariate regression with an
intercept and slope) when all two-observation subsets are examined.1
Given the best slope of the subsets, the intercept (or location
parameter) is adjusted and an exact LMS fit is guaranteed. The casual
user may mistakenly suppose that if PROGRESS finds an exact solution
for a simple bivariate regression of the form y = mx + b, then an
exact solution also will be obtained for y = mx. Unfortunately, this
does not follow.
4. An exact algorithm for LMS with a zero intercept in the bivariate
case
The central idea of this algorithm that provides an exact solution to
the LMS problem is the observation that there are a finite number of
slopes, bounded by the number of pairs of points, which could provide
the minimum median deviation. In most cases not all of the O(n2)
points need to be checked to determine the optimal slope.
For a given x,y pair, the square of the deviation of the fitted line
from the actual data point is given by d2(m)=(y-ŷ)2=(y-mx)2. As a
function of the slope, m, this squared deviation forms a parabola with
an upward concavity. The deviation has its minimum value of 0 when the
line passes directly through the (x,y) point where ŷ=y= mx. A given
set of n data points, defines a set of parabolas . For
a given value of the slope m these parabolas can be ordered and the
data point which defines the median deviation can be determined.
The range of slopes to search in order to find the one that produces
the minimum median deviation is bounded by mmin=min(yi/xi) and mmax=max(yi/xi)
over all the n data points for which the x-value is not zero. Each
parabola rises monotonically for all values of the slope that are
greater than the slope that minimizes it, which is m= yi/xi. This
implies that all of the n parabolas are rising when the slope is
greater than mmax=max(yi/xi). The minimum median deviation can not
occur in a region where all of the deviations are rising. A similar
statement may be made about slopes less than mmin since in this case
all n parabolas are falling when the slope is less than mmin=min(yi/xi).
Thus the slope that produces the minimum median deviation must have a
value between these two bounds.
The minimum value of a continuous function such as the median of the
square of the deviations can occur only at (1) one of the end points,
(2) at a point where the derivative of the function is 0, or (3) at a
point where the derivative is not defined. Thus the search for the
slope that produces the global minimum can be restricted to the points
where two parabolas intersect and the derivative of the function is
undefined or at a point where the median parabola attains its minimum
value which is 0.
The second case, in which the global minimum occurs at a slope where
the median parabola reaches its minimum value of 0, occurs only when a
majority of the data points lie on a straight line, causing the
minimum median deviation to occur when the ‘best’ fit straight line
passes directly through these points, giving each of these points a
deviation of 0. For all other sets of data, the first case provides
the set of slopes to check for a minimum of the median of the
deviations.
To show why the first case might produce a global minimum of the
median deviation one must notice that if one determines which data
point produces the median deviation at a given value for the slope,
this data point continues to generate the median deviation as the
slope, m, increases until a value of the slope is reached where the
parabola describing this median data point intersects the parabola due
to some other data point. At this slope this next data point becomes
the median. Since the median parabola is always either rising or
falling between intersections the minimum point must occur at one of
the end-points where an intersection occurs.
An algorithm to determine the exact minimum is based on the idea of
following the median parabola as the slope sweeps from mmin to mmax,
keeping track of the minimum deviation and the corresponding slope.
The problem is to find the value of the slope that causes the median
deviation to achieve its minimum value. The algorithm to solve this
problem begins by determining the data point that causes the median
deviation at the point mmin=min(yi/xi). The value of the slope and the
value of the deviation at that slope are stored.
Given the parabola that produces the median deviation for that slope,
the slope is increased to the next smallest value of the slope where
the median parabola intersects one of the other parabolas. At this
point the deviation is compared with the minimum deviation discovered
so far. If a new minimum has been found, the deviation and the slope
which produced it are stored. This procedure is continued as long as
the value of the slope is less than the maximum possible slope, mmax
=max(yi/xi). At the conclusion the minimum median deviation and the
corresponding slope have been found.
5. An Example
Figure 2 shows an example of this algorithm using the five data points
from Table 1. The slope varies from mmin=1.4 due to observation 5 to mmax=3
due to observation 3.

Figure 2. Individual Observation and Median SR as a function of m
Each parabola is labeled by the data point that relates to that
parabola, obs 1, obs 2, obs 3, obs 4 and obs 5. The median squared
residual for a given slope, m, is the median, or middle, one of the y
values of the 5 parabolas. The thick line follows the median, or 3rd,
deviation in this example of 5 data points. The vertical line at m =
1.85 shows that the second observation, (x2=2,y2=4), has a squared
residual value of 0.09 (=(4-1.85*2)2). At m = 1.85, the median squared
residual has a value of 1.96 and is due to the fourth observation, (x4=4,
y4=6).
For this data set, observation 5 has its minimum at a slope of 1.4
which is the minimum slope of the 5 observations while observation 1
has its minimum at a slope of 3.0 which is the maximum slope. Thus the
minimum median deviation must occur somewhere in the interval between
1.4 and 3.0. The algorithm begins with a slope of 1.4, choosing
observation 2 as the median parabola at that slope. It then determines
that the parabola due to observation 2 intersects the parabola for
observation 5 at a slope of approximately 1.57. At this point the
parabola for observation 5 becomes the median deviation with a value
of 0.74. At the slope of 1.67 this parabola intersects the parabola
due to observation 1. The deviation at this point is 1.77. Parabola 1
intersects with parabola 4 at a slope of 1.8 with a value of 1.44.
Parabola 4 intersects parabola 3 at a slope of 2.0 with a deviation of
4.0. Finally parabola 3 intersects parabola 2 at a slope of 2.4 with a
deviation value of 0.64. From these 5 intersections, described by the
x-y points {(1.57,0.74), (1.67,1.77), (1.8,1.44), (2.0,4.0),
(2.4,0.64)}, the algorithm chooses the point with the smallest
deviation. The 5th one, at a slope of 2.4 produces the smallest
deviation with a value of 0.64. This is the solution to the problem.
The ‘best’ straight line using these 5 data points is the one with a
slope of 2.4. All other slopes produce larger median squared
deviations.
It is possible for more than two parabolas to intersect at a point,
but the parabola that becomes the new median can be determined by
ordering the intersecting parabolas based on their slope and curvature
at the point of intersection. In the case of an even number of data
points it is necessary to follow two parabolas, representing the (n/2)th
deviation and the (n/2+1)th deviation, since the median is the average
of these two values.
When there are n data points the efficiency of this algorithm is O(n2
log n) in the worst case. It requires determining the intersections of
each of the parabolas with the median parabola to choose the next
intersection to use. It is possible that each parabola might be the
median parabola at some point of the algorithm. Thus one may have to
determine as many as intersections and these intersections
have to be ordered.
Figure 2 can also be used to show another view of how the PROGRESS
algorithm works in the zero intercept case. For each value of the
slope that causes the straight line to pass through a data point,
giving a squared residual of zero, the PROGRESS algorithm computes the
median deviation of the 5 data points. It then chooses the slope that
causes the minimum value of this set of deviations. The discussion
presented in the previous paragraphs makes clear why this approach
fails—the global minimum squared residual will not, in general, be
associated with a slope where a squared residual value for an
individual observation is zero. This will provide the correct result
only in the case where a majority of the data points lie on a straight
line through the origin.
6. Conclusion
=============
When applying Least Median of Squares, coefficients are chosen so as
to minimize the median of the squared residuals. Because the median is
not sensitive to extreme values, it can outperform conventional least
squares when data are contaminated. This paper makes two contributions
to the LMS literature:
1.
PROGRESS, the standard algorithm for fitting the LMS estimator,
does not find the true LMS fit when the intercept is suppressed.
Any computations based on the estimated slope (such as regression
diagnostics and estimated standard errors) are also wrong.
2.
For a bivariate regression with a zero-intercept, , an
algorithmic method based on keeping track of the median squared
residual is demonstrated.
References
Barreto, Humberto (2001) “An Introduction to Least Median of Squares,”
unpublished manuscript, http://www.wabash.edu/econexcel/LMSOrigin
(LMSIntro.doc).
Rousseeuw, Peter J. (1984) “Least Median of Squares Regression,”
Journal of the American Statistical Association, 79 (388), 871-880.
Rousseeuw, Peter J. and Annick M. Leroy (1987) Robust Regression and
Outlier Detection, John Wiley & Sons: New York.
1 “In simple regression (p=2), it follows from (Steele and Steiger
1986) that if all 2-subsets are used and their intercept is adjusted
each time, we obtain the exact LQS.” Rousseeuw and Hubert [1997], p.
9.
10
111584.doc

Menu

Least Median Of Squares And Regression Through The Origin