This is obviously non-trivial, but there are algorithms that at least attempt to do things like this. I hasten to add, however, that they're statistical, so trying to use only two sentences as a basis is going to be extremely iffy at best.
The usual approach runs something like this:
- filter out stop words
- use a thesaurus to substitute one canonical word for each word
- count occurrences of words in each document/sentence
- compute the cosine distance between the base document(s) and each candidate similar document
- pick the N closest to the base documents
Note that there's room for a lot of variation here though. For example, the thesaurus can get considerably better results if it's context sensitive, and to maintain the context you often want to retain the stop words, at least until that step is complete. For example, consider your base documents about the weather being compared to: "I have a cold", and "It is cold". If you follow the steps above, these will both be just "cold" by step 2, and both seem equally close to the base documents.
With a context-sensitive thesaurus step (an ontology, really), you'd use the extra words to disambiguate the two uses of "cold", so when you compute distances, one would refer to the disease named "the cold", and the other to "cold weather". The base documents would both refer to cold weather, so your result would show "It is cold" as similar, but "I have a cold" as different.
If you're trying to keep things simpler, however, you might skip the thesaurus completely, and just stem the words instead. This turns "rainy" and "raining" both into "rain", so when you do comparisons they'll all show up as synonymous.
As far as details go, there are quite a few lists of stop-words easily found. At least in my testing, the choice isn't particularly critical.
For a thesaurus, I've used the Moby Thesaurus, with some (substantial) processing to basically invert it -- i.e., rather than finding multiple synonyms for one word, find one canonical word for a given input.
There aren't as many papers on context-sensitive synonym/definition searching -- but some are still quite good. A lot of work on the "semantic web" and related ontologies is along this line as well (though little of it is likely to be of much help in your case).
For stemming, the Porter Stemmer is well known. There's a newer, slightly modified version (Porter2) that should be covered somewhere on the same page(s). Another well-known algorithm is the Lancaster Stemmer. There's also the Lovins stemmer, but I wouldn't really recommend it1 -- though it's still widely known because it was the first (well known) stemming algorithm published. Note that most (all?) of these strip only suffixes, not prefixes.
Quite a few papers discuss cosine distance. It's well enough known that even the Wikipedia entry for it is pretty decent.
Quite a few people have already assembled these pieces together into coherent (at least they generally try to be coherent) tool-kits, complete programs, etc. A few reasonably well known examples include WordNet, NLTK, Apache OpenNLP, and Freeling.
1In particular, Lovins only ever removes one suffix from a word. If you had, for example, "Loverly" and "lovingly", Porter would reduce both to "lov" and they'd show up as synonyms, but Lovins would reduce them to "lover" and "loving", respectively, and they'd show up as different. It is possible to repeat the Lovins algorithm until it removes no more suffixes, but the result isn't very good -- Porter has quite a bit of context sensitivity so (for example) it only removes one suffix if it did not remove another; multiple applications of Lovins wouldn't take this into account.
Since you are moving away from a single master node (which is appropriate) you will have to change a few things. You will need to setup a Quorum. Since you already have 9 nodes, you are in good shape. For a Quorum to work you need 2n+1 nodes where (n) is the number of nodes that can go down and the system will still work. Within the Quorum a vote will take place on who the leader is, and what transactions are successful. This can be used to pass around configuration information and ensure everyone is synchronized without a database.
There are existing technologies out there that can help you with this. One of thos is ZooKeeper. It is an open source Apache v2 product for Distributed Coordination. You will need something along these lines. Whether it is using ZooKeeper or rolling your own their white papers will be invaluable. It can also be used to maintain your configuration information about each node.
ZooKeeper is written in Java but I have created a project (ZooKeeperNet that will allow it to be embeded within .NET application using IKVM. If this isn't acceptable then you'll want to read about Leader Elections when determining who will be the current Master node. I suggest reading all their Wiki pages and Recipes to get an idea of what you need to account for in a proper distributed system.
Just so you have a good understanding. ZooKeeper is the backing coordination system of Hadoop and HBase. Hadoop is a distributed Map/Reduce framework.
If you already aren't, you can use WCF adhoc or registry discovery information when attempting to find the current master node in your system. If only a single Master node is alive it will be the only registered to support IMaster features. Then your slave nodes will listen on each other's znodes for each other to go away, picking up being the Master almost immediately.
Keep in mind that in order to be high efficient, the data each node needs to work with has to be close (i.e. on the node itself) to the node. If one node acts as a data intermediary you won't be as efficient as you could if the nodes could pull data in a distributed fashion.
Best Answer
One technique that can be used to perform clustering on multi-dimensional numeric data is the Kohonen self-organising feature map. It's a little too involved to describe here, but should be included in any beginner's level text on machine learning.
This just leaves the problem of how to convert your data to numeric form. To do this, I'd first run an an analysis to find a reasonable number (say 100) of words that appear in many of your strings, but not too many. You're looking for words in the middle of the frequency distribution, as these carry the most useful information. You can then use the presence or absence of these words as inputs to your feature map.