I was wondering how does tineye carry a search. Does it store all the images and then extracts exif data? Which in turn must be stored in a database and queried against. So probably it is using some sort of keyword/pattern matching algorithm…
Algorithms – Tineye.com Search Algorithm Explained
algorithmspattern matchingsearch
Related Solutions
There are a number of other search algorithms. Smith-Waterman is one of the better ones for human text, while BLAST is (so far) the best for searching DNA sequences. When you are presented text with various spelling errors such as hlep
instead of help
, then you are looking for the minimum edit distance.
For a library to implement a number of these functions in CLR in SQL Server 2005 (and later), look at the source forge project SimMetrics. Blog post about SimMetrics.
http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html
Soundex was developed because the primary differences between regional speech variations was almost exclusively in the vowels - which is why it tosses vowels out. It isn't good at coping with transposed letters.
Although it would be near impossible to track % complete accurately due to an undetermined number of links and keywords it is possible to show a rough status via depth. For example the first depth would be the url/s processed from the top level.
(100/Total Pages) * Pages Processed = % current status
Total Pages = Select count() from master_links
Pages Processed = Select count() from master_links where processed=true. When you have processed the page simply set the flag in the db.
(This could similarly be done by populating an array with your db values and using the index value as your pages processed)
Note: You can only get the status for each level. Do not start crawling your sub_links until all the master_links are crawled - this will also allow you to avoid duplicate url crawls and should have a minimal impact on the total time.
The squares in the diagram below represent the pages which need to be processed. Inside each box is the percentage complete if you were processing them left to right. This is for illustrative purposes the percentage would be based on this:
Your output would show percentage complete of that level:
e.g. Master Links 40% complete
or
e.g. Master Links 100%
Sub Links 49.8%
This should still give you enough info to indicate the progress, after all you cannot guess the actual density of keywords and links...
Best Answer
The TinEye FAQ reads:
Chasing this then to Idee and a google patent search for "idee image search" brings up a number of patents (mostly named "Methods and Systems for Content Processing").
While my digital signal processing background isn't there, these patents do appear to be similar to what TinEye implements... or if not specifically licensed by TinEye, do similar things with other algorithms that accomplish the same end (many of the results appear to be referencing things that TinEye does). The patents are much larger than those I have glanced at before - some reaching over 100 pages.
Unfortunately, neither of the founders of Idée, Inc. come up in the patent search - which is often a valid approach to finding the patents they started with.