Design Pattern for Multi-Threaded URL Fetcher in Java

design-patternsjavamultithreading

I'm looking for hints and suggestions on a design for a multi-threaded URL fetcher in java. Specific requirements are:

  • To fetch each one of around 1,000 URLs periodically
  • The interval between each fetch will be URL specific
  • Intervals are likely to be 2 mins to 1 hour

I'm imagining I will need a bunch of fetchers each running in their own thread that get pushed the next URL to fetch when in a "ready" state.

I will need to handle errors, e.g quit querying a specific URL if it repeatedly times out or 404s.

Any ideas much appreciated.

Thanks

Best Answer

1000 threads (some of them active once in 2 hours) is a big no-no. So is starting a new thread for each job which may finish few seconds later. Make one "scheduler" thread that selects URLs for retrieval, and a number of worker threads that report their state to the scheduler. Scheduler sequentially:

  • performs worker thread pool management: -- if no threads are free, spawns some new ones. -- if more than X threads (say, 3) are idle, ends extra threads.
  • selects new URL to retrieve at the moment (or skip the step),
  • finds the first free thread, assigns it the job,
  • collects results from threads that finished (if any)

Then sleep and repeat the loop. Essentially, you have a semi-realtime parent thread that does all "fast" jobs and worker threads that have busy-wait states.

Of course the URL distribution can be done through Observer pattern, modified to "consume" the message if a "client" accepts it (hand out URL to retrieve). The list of threads can be a linked list to be traversed recursively.

Related Topic