Algorithm/data structure to answer “what recipes can I make with this set of ingredients?”

algorithmsdata structures

Formally, let s(U, Q) = {V | VU and VQ} where U, Q, and V all represent sets, and U, more specifically, represents a set of sets. For the sake of example, U might be a set of the (sets of) ingredients required for various recipes in a cookbook with Q representing the set of ingredients I have V representing a recipe I could make with those ingredients. The query s(U, Q) corresponds to the question "What all can I make with these ingredients?"

What I'm looking for is a data representation that indexes U in such a way that it supports efficient queries of s(U, Q) where Q and all members of U will generally be small compared to the union of all the members of U. In addition, I'd like it to be able to efficiently update U (e.g., add or remove a recipe).

I can't help but think that this problem must be well understood, but I haven't been able to find a name or reference for it. Does anyone know of a strategy for solving this efficiently or a place where I can read more about it?

As far as thinking about a solution, one thought I had was to build a decision tree for the set U. At each node in the tree, the question "does your ingredient list contain x?" would be asked with x chosen to maximize the number of members of U that get eliminated by the answer. As U gets updated, this decision tree would need to be re-balanced to minimize the number of questions required to find the correct result. Another thought is to represent U with something like an n-dimensional boolean 'octree' (where n is the number of unique ingredients).

I believe that "What recipes can be made with these ingredients?" can be answered by taking the cartesian product of the (sets of ingredients required for the) recipes in the cookbook with the powerset of the ingredients one has and filtering the resulting ordered pairs for pairs in which both elements are equal, but this is not an efficient solution, and what I'm asking about is how to optimize this kind of operation; how would one compose this in SQL such that it would be efficient and what does SQL do that allows this to be efficient?

Although I use the illustration of a cookbook of recipes and a set of ingredients, I anticipate that the number of 'recipes' and the number of 'ingredients' will be very large (up to hundreds of thousands each), though the number of ingredients in a given recipe and the number of ingredients in a given ingredient set will be relatively small (probably around 10-50 for a typical 'recipe' and around 100 for a typical 'ingredient set'). Additionally, the most common operation will be the query s(U, Q), so it should be most optimal. This also means that a brute force algorithm that requires checking every recipe or operating over every ingredient would be undesirably slow on its own, however. With clever caching, I think this might not perform too badly, though.

Best Answer

For the numbers you gave, just brute force it.

Here's a JavaScript program that brute forces it for 10 ingredients in the DB, 10 recipes in the DB, each recipe needs 2 ingredients, and I have 5 ingredients available:

var i, j;
var numIngredients = 10;
var numRecipes = 10;
var numIngredientsPerRecipe = 2;
var numIngredientsInQuery = 5;

function containsAll(needles, haystack){ 
  var i, len;
  for(i = 0 , len = needles.length; i < len; i++){
      if(haystack.indexOf(needles[i]) == -1) {
          return false;
      }
  }
  return true;
}

// Set up a fake DB of recipes
var ingredients = [];
for (i = 0; i < numIngredients; i++) {
    ingredients.push(i);
}
console.log('Here are the ingredients:', ingredients);

var recipes = [];
for (i = 0; i < numRecipes; i++) {
    var neededIngredients = [];
    for (j = 0; j < numIngredientsPerRecipe; j++) {
        neededIngredients.push(Math.floor(Math.random() * numRecipes));
    }
    recipes.push({ recipeId: i, needed: neededIngredients});
}
console.log('Here are the recipes:', recipes);

// Set up a fake query
var ingredientsAvailable = [];
for (i = 0; i < numIngredientsInQuery; i++) {
    ingredientsAvailable.push(Math.floor(Math.random() * numRecipes));
}

console.log("Here's a query:", ingredientsAvailable);

//Time how long brute force takes
var start = Date.now();
var result = [];
for (i = 0; i < numRecipes; i++) {
    var candidateRecipe = recipes[i];
    if (containsAll(candidateRecipe.needed, ingredientsAvailable)) {
        result.push(candidateRecipe);
    }
}
var end = Date.now();
console.log('Found ' + result.length + ' recipes in ' + (end - start) + ' milliseconds.');
console.log(result);

It runs in 0 milliseconds. I chose these small numbers so you could run it yourself a couple of times and convince yourself it does what you want and is relatively free of bugs.

Now change it so that we have 1'000'000 ingredients in the DB, 1'000'000 recipes in the DB, 50 ingredients per recipe and 100 ingredients available to me. I.e. values that are all equal or greater than the largest use case you gave.

It runs in 125 milliseconds under nodejs, and this is with the dumbest implementation with absolutely no effort to optimize.

Related Topic