Automatic categorization or indexing of MediaWiki articles

mediawiki

I have a MediaWiki instance with thousands of articles in a certain scientific field. They're in a flat space with no categorization. I'd like to organize these automatically using data mining and language-processing techniques. In theory, I think looking for statistically unlikely phrases in each document would provide a good starting point.

Right now I can do something like that through the MediaWiki API — pull down the documents, analyze them, and automatically write back categories or tags.

But is there another way to do this? Looking around the web shows that there's been a huge amount of work on this kind of problem in general — but nothing that works specifically with MediaWiki in an automated, integrated solution. Is there such a thing?

Best Answer

This is only a partial solution, but if you use the Replace Text extension you can globally add categories based on specific text. Of course, the categorisation text would then appear wherever the statistically unlikely phrase was located:

you find, "statistically unlikely phrase"

you replace it with: "statistically unlikely phrase + category"