Find a line that is closest to scattered points

algorithmsgeometrypuzzles

I came across the below interview question on Glassdoor:

A scatter graph of points on a page, draw a horizontal line across the page such the the perpendicular y distance to the line from all points in aggregate is minimized. Describe an algorithm for placing this line optimally

My approach:

I think we can compute the mean of y-distances and place the line there.

However, I am not sure if this is correct or if there is a better approach to solving this problem.

Best Answer

First, since we only care about the y-distance and will be drawing a horizontal line, we only need to think about the y-coordinates of the points and the y-coordinate that defines the line. The distance between a point and our line will be the absolute difference between the y-coordinate of the point and the y-coordinate that defines the line.

So, rephrasing the problem, we have a set of numbers, y_1 to y_n, and need a number, z, that minimizes the aggregate of the absolute differences between z and the points y_1 to y_n. Instead of minimizing the aggregate, we can just minimize the sum and get the correct result (aggregate = sum / number_of_points).

It turns out that it is the median that does this, not the mean.

https://math.stackexchange.com/questions/113270/the-median-minimizes-the-sum-of-absolute-deviations

For intuition, have points at y-coordinates 10, 10, 10 and 110. The median is 10, aggregate distance is (0+0+0+100)/4=25. The mean is 140/4=35, aggregate distance is (25+25+25+75)/4=37,5. In fact moving the line any distance, d, away from y-coordinate 10 towards 100 increased the distance to 3 points (with d) while only decreasing the distance to 1 point (with d) and hence increasing the aggregate.

(If we would take squares of distances then the mean would be the correct answer)