C# – Efficient Strategy for Searching Large Text Areas for Multiple Values

asp.net-mvccsearchstrings

I have a requirement for a service that does the following.

Take a block of text and identify the server names in it (by name or ip address). So given:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec libero felis, accumsan in nunc id, lacinia rutrum libero. Server1 Praesent iaculis consequat est quis elementum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos Server2 himenaeos. Cras aliquet nisl non tortor interdum semper. Nulla commodo dignissim justo, eu accumsan neque eleifend ut. Etiam malesuada volutpat dolor 192.168.0.2 laoreet placerat. Maecenas posuere ipsum mattis egestas elementum.

The service would return:

  • Server1
  • Server2
  • Server3 (which has ip Address 192.168.0.2)

there are around 7,000 servers and addresses in my DB. So at the moment the only strategy I have is to take the text block as a string and loop through all the servers twice (name and ip) issuing a string.Contains().

Issuing 14,000 Contains seems a bit "brute force". Is there a more elegant way to achieve the same result.

For context this is a rest service running on ASP.Net MVC and C#.

Best Answer

If your current code is simple and fast enough for your needs, do nothing. Just to optimize because "it seems a bit brute force" is not a good reason, it will mostly complicate things for no benefit. Do not fall into the trap of premature optimization.

However, if your current code really is too slow for your purposes, first measure where the bottleneck is. Is it really calling 14.000 times string.Contains, or is it selecting the 14.000 server names / Ip addresses from your database? The first issue might be approached by splitting up the text into words which may be potentially a server name, and utilizing a hashset or a more sophisticated data structure. The second issue might be approached by splitting up the text the same way,using the words as a SELECT criteria, assumed your database is properly indexed. The latter one could increase the number of roundtrips, to avoid that, you could implement a stored procedure in your DB, pass the text once over the network and let the SP do the work.

All of these solutions, however, will result in more complicated code than you have now, so make sure this is worth the hassle, otherwise you are probably sacrificing a maintainable solution for useless overcomplication.

Related Topic