Algorithms – How to Find Matching Profiles?

algorithmsdatadocument-databasespattern matching

I'm developing backend for a dating app, in which each user has

a profile of his/her characteristics
a profile of ideal match's characteristics

There are dozens of characteristics like gender, height, looks and so on.
Some characteristics are strings, others are numbers or arrays.
Each characteristics has ascribed an importance factor, ranging from 0 to 4.
0 means not important at all and 4 means absolutely necessary.

so a user's match objects are like this:

    {      
      {
         gender: 'female',
         importance: 4
      }
      {
        eyeColor: ['blue', 'green'],
        importance: 2   
      } ,
      {
       ethnicity: [],
       importance: 0
      }
      heightMin: 150,
      heightMax: 200, 
      heightImportance: 3,
      ....    
    }

The data are saved in mongodb and the backend is in node.js.

I'm new to data science. I just know that there are some formulas to find similarities/distances between vectors, like Euclidean or cosine similarities. But I'm not sure which method (if any) is the most relevant in this circumstances?

Appreciate your hints.

Best Answer

Identify the different kind of characteristics

Your sample data illustrates very well that different kind of characteristics need to be handled in a different way:

Heigh is a scalar attribute: a profile has one numeric value, but the ideal always looks for a range.
Ethnicity is a unique attribute: a profile has only one, but the ideal may identify several alternatives.
Eyes could be multiple-value attribute: although most of us have only one color in his/her profile, some people have several. And the ideal can identify several colors with the intent of finding one of those. For example if the ideal is "green,blue" it should be understood as "green OR blue". A profile having both should match. But a profile having only blue should match as well.
Hobbies (not in your example) would be option attribute: a profile could have several, and the ideal would have several. THen, the more hobbies match, the higher the affinity.

Define a scoring function

Once all the characteristics properly categorized in this way, you are ready to build a general scoring function that:

Scores each pair of characteristics: this can be as simple as 1 (match) and 0 (no match). It can be more subtle to show that a match is more or less strong, with 1.0 (all options are there) 0.8 (4 out of 5 options are there) ... 0 (no match). It could also be a more elaborate calculation with thresholds, ceilings, etc.
Aggregates the global score of a profile : Here, you need to experiment in order to find a meaningful aggregation. For example, should 2 matching characteristics of importance 1 outbalance a match of importance 2 ? Another example: should the absence of a match of importance 3 match not reduce the score ?
Eliminate not acceptable results: importance 4 is absolutely necessary, so a no-match on that criteria shall result in a global score of 0, whatever the result on other criteria is.

Improve performance

You then have to complement your scoring with:

a preselection logic, that uses at least some ideal criteria to select a subset of relevant records: this avoids to calculate the matching score for all the profiles of your database
a filter to eliminate scores which are too low, especially if there are many matches.
final sorting to present the most successful profiles first.

Future improvements

You could thing of the following, but at a later stage:

Should the score be unidirectional only ? Think a moment: the nice young lady will get her profile matched by an awful lot of old men and after a series of uninteresting solicitations, she'll leave the site. What if you would combine somehow score(ideal 1, profile2) with score(ideal 2, profile1)
String values will compare very ineffectively. So you may think in the end of a different encoding schema that could be processed quicker (you spoke of some vectors). But this is the cherry on the cake. Start simple.

Best Answer

Identify the different kind of characteristics

Define a scoring function

Improve performance

Future improvements

Related Solutions

Algorithms – Data Structures for Pattern Matching

Vector instead of Linked List

Data flows through Listeners, then disappears

Continuously register Matchers listening to the data

Matchers update internal state until they reject/accept a pattern

Efficiency gains by eliminating duplicate calculations

Related: Parser Generators

Best method for Pattern Matching on Binary String

Related Topic