Find a line that is closest to scattered points

algorithmsgeometrypuzzles

I came across the below interview question on Glassdoor:

A scatter graph of points on a page, draw a horizontal line across the page such the the perpendicular y distance to the line from all points in aggregate is minimized. Describe an algorithm for placing this line optimally

My approach:

I think we can compute the mean of y-distances and place the line there.

However, I am not sure if this is correct or if there is a better approach to solving this problem.

Best Answer

First, since we only care about the y-distance and will be drawing a horizontal line, we only need to think about the y-coordinates of the points and the y-coordinate that defines the line. The distance between a point and our line will be the absolute difference between the y-coordinate of the point and the y-coordinate that defines the line.

So, rephrasing the problem, we have a set of numbers, y_1 to y_n, and need a number, z, that minimizes the aggregate of the absolute differences between z and the points y_1 to y_n. Instead of minimizing the aggregate, we can just minimize the sum and get the correct result (aggregate = sum / number_of_points).

It turns out that it is the median that does this, not the mean.

https://math.stackexchange.com/questions/113270/the-median-minimizes-the-sum-of-absolute-deviations

For intuition, have points at y-coordinates 10, 10, 10 and 110. The median is 10, aggregate distance is (0+0+0+100)/4=25. The mean is 140/4=35, aggregate distance is (25+25+25+75)/4=37,5. In fact moving the line any distance, d, away from y-coordinate 10 towards 100 increased the distance to 3 points (with d) while only decreasing the distance to 1 point (with d) and hence increasing the aggregate.

(If we would take squares of distances then the mean would be the correct answer)

Related Solutions

Closest Points Between Rectangles – Algorithm Solutions

How you do it depends on what your "business" requirements are.

If simplicity of source code (maintainability) is your primary concern, then brute force calculation of all of the point to point distances will be a simple and effective method. If you have to optimize for processing speed, then you may want to find an optimized algorithm. I suspect that brute force will be nearly optimal in any case since there are only 16 point-to-point distances to calculate.

Algorithms – Finding Points with Pairwise Distances Matching a Distance Matrix

This is the standard problem known as "Multidimensional Scaling".

There are lots of varieties, so I recommend reading the wikipedia page on the subject (it's not so long), but in your case, it looks like plain PCA (Principal component analysis) is the way to go.

In particular, a (squared) distance matrix D is closely related to B, the product of the transposed position matrix X with itself:

B = -0.5 * J D J where J = I - 1/n (i.e. 1-1/n on the diagonal, -1/n elsewhere)

B = X^T X (by construction)

PCA will give you the eigenvalues and eigenvectors such that

B = U^T L U

with the columns of U being the eigenvectors, and the diagonal of L being the eigenvalues.

Then it's just a question of multiplication:

X = L^(1/2) U

If you're not used to math this may sound a little scary, but PCA is almost certainly available as a library whatever platform you're on, and then it's just a question of a multiplying a diagonal matrix with matrix, which is easy to do manually if necessary. PCA will reconstruct a full-dimensional space, so if you want just the 2 or 3 most significant dimensions, simply crop the resultant matrix. More significant dimensions are those with the highest eigenvalue; most PCA algorithms list them in order, so you can just take the top few rows.

There are various documents online that explain this in more detail; alas wikipedia falls short here. See, for instance: Wickelmaier, F. (2003). An introduction to MDS. Reports from the Sound Quality Research Unit (SQRU), No. 7 pages 9-11 for more info.

The above is a general pattern you might apply in many languages. To verify this works, I tested it using Eigen which is the fastest linear algebra library I know of that can express this kind of thing simply.

I'll walk you through it:

#include <iostream>
#include <Eigen/Core>
#include <Eigen/Eigenvalues> 

using namespace Eigen;
using namespace std;

#define DBG(s) (cout<< #s<<":\n"<<(s)<<"\n\n" )


int main(int argc, char* argv[])
{

The above is just boilerplate referring to Eigen and simple console output function.

    auto const n = 4;//4 points.

    MatrixXd points(3, n);
    points <<
        0.0,   4.0,  2.0,   3.0,
        3.0,   1.0,  0.0,  -4.0,
        0.01, -0.1, -0.05, -0.01;


    MatrixXd D(n, n);
    for (int i = 0; i < n; i++)
        D.col(i) = (points.colwise() - points.col(i)).colwise().squaredNorm().transpose();

    DBG(D);
    //      0 20.0121 13.0036 58.0004
    //20.0121       0  5.0025 26.0081
    //13.0036  5.0025       0 17.0016
    //58.0004 26.0081 17.0016       0

I'm initializing the input (squared) distance matrix D with real distances to make sure they sort of make sense. Notice that the third dimensions values are all really small - I intend to make a 2D reconstruction and I'd like data that's not quite perfect :-).

    MatrixXd J = MatrixXd::Identity(n, n) - MatrixXd::Constant(n, n, 1.0 / 4.0);
    DBG(J);
    // 0.75 -0.25 -0.25 -0.25
    //-0.25  0.75 -0.25 -0.25
    //-0.25 -0.25  0.75 -0.25
    //-0.25 -0.25 -0.25  0.75

    MatrixXd B = -0.5 * J * D * J;
    DBG(B);
    //  14.0648 -0.940469  0.561906  -13.6862
    //-0.940469   4.06641 -0.436719  -2.68922
    // 0.561906 -0.436719 0.0626563 -0.187844
    // -13.6862  -2.68922 -0.187844   16.5633

Note that B has a non-zero diagonal (you worried about this in the comments below).

    SelfAdjointEigenSolver<MatrixXd> eigendecomp(B);
    DBG(eigendecomp.eigenvalues());
    //-1.77725e-015
    //  0.000568306
    //      5.61749
    //       29.139

Note that Eigen returns eigenvalues in ascending order; also note that the first eigenvalue is negative. This is a problem, because we're going to take the square root. Of course, this is just a numeric artifact; since our input data was 3D, it makes sense that the least-significant dimention of a 4D representation is just noise. It's tiny too. In general, you need to make sure that you pick the right number of dimensions to use, because too few and you'll introduce error; too many and you'll run into negative eigenvalues. Note that if any of these negative eigenvalues were large, that would indicate your input is bad, such as that it violates the triangle inequality that A to C via B is always longer than A to C directly.

Your input above has such a violation: points 0 and 2 are the same (have 0 distance), yet point 0 is at distance 0.4 from point 1, but point 2 is at distance 0.2. Clearly regardless of dimensionality or where you put the points, this is impossible. That's going to cause negative eigenvalues that aren't just numerical noise. There are various hacks to mitigate that, but let's get back on track and use those eigenvalues:

    MatrixXd X = eigendecomp.eigenvalues().bottomRows(2).cwiseSqrt().asDiagonal()
        * eigendecomp.eigenvectors().transpose().bottomRows(2);

That's just the wordy way of saying X = L^(1/2) U while only considering the most significant 2 dimensions.

    DBG(X.transpose() * X);
    //  14.0647 -0.940508  0.562078  -13.6863
    //-0.940508   4.06638 -0.436623  -2.68925
    // 0.562078 -0.436623 0.0622356 -0.187691
    // -13.6863  -2.68925 -0.187691   16.5632

Right, so we can verify that the decomposition works: this is quite similar to B. It's not exactly the same because we chopped off two dimensions, after all.

    DBG(X);
    //-0.999577   1.99533 -0.232164 -0.763594
    //  3.61463  0.291588 0.0912992  -3.99751

These points turn out not to be very similar to the original points we started from. That's not actually surprising; after all, distances are invariant under rotation and translation, so you'll find that these points are zero-centered, and they're arbitrarily rotated compared to the points we started with. You could actually reconstruct the rotation matrix effectively used with another PCA, but let's not :-).

    MatrixXd D2(n, n);
    for (int i = 0; i < n; i++)
        D2.col(i) = (X.colwise() - X.col(i)).colwise().squaredNorm().transpose();
    DBG(D2);
    //      0 20.0121 13.0028 58.0004
    //20.0121       0 5.00187 26.0081
    //13.0028 5.00187       0 17.0008
    //58.0004 26.0081 17.0008       0

    return 0;
}

As a final verification, note that the distances these new points generate are pretty much identical to the distances we started with. Numerical inaccuracies and more importantly our 3D -> 2D reduction mean it's not completely perfect, but you should be able to see that D is almost the same as D2.

Note for when your data really isn't very clean and/or doesn't actually represent Euclidean distances: MDS (using PCA) solves the optimization problem of minimizing the squared error of the distance. That's all fine and dandy, but notice that errors in large distances are (over-)emphasized here (1² - 0² = 1, but 11² - 10² = 21, so MDS will try 21 times as hard to fix the second error). If your distances aren't perfect, PCA will try to make the "most significant" i.e. largest distance fit best. This is often actually exactly the wrong thing to do; usually, you care more about local structure than global structure. IF your distance matrix is inaccurate and you cannot reconstruct a good overall position matrix, you might want to look at techniques that at least reconstruct local neighbourhoods best. State of the art there (AFAIK) is t-SNE: you won't get an absolutely accurate map from that technique, but you will get the right things close to each other, at least.

Good luck - sounds like a fun problem!

Best Answer

Related Solutions

Closest Points Between Rectangles – Algorithm Solutions

Algorithms – Finding Points with Pairwise Distances Matching a Distance Matrix

Related Topic