Spring – Integrating Spring Batch with Web Scraping

application-designspring

I need to develop a batch processing that will be executed every day. These are the steps:

  1. Read each line of a database table that contains a URL (SQLite)
  2. Extract some data, say Users, from that website by scraping it. Each website may contain 1..n users.
  3. Persist each User in a local NoSQL database.
  4. Send each User(one by one) through third-party REST API.

I'm going to implement this process using Spring Batch and I have thought this design:

  • Item Reader: Read each URL of SQLite database using JdbcCursorItemReader.
  • Item Processor: Scrap and deserialize users from website. (Url -> List<User>)
  • Item Writer: for each User, persist it in database and send it through REST API.

Is this approach right? Should I change any step? I have never worked with Spring Batch so I'm willing to change the technology if needed. I need some advice before start developing since I need this process to be very robust.

Best Answer

This is generally a good application for Spring Batch, and you seem to understand the logical separation of Reader, Processor and Writer fairly well.

There are certain things that you should consider and think about when it comes to an application like this. Spring Batch gives you the concept of chunking where rather than read/process/write each record one at a time you could read in several items as a chunk, process them as a single transaction, and write them as a single transaction. Something that isn't clear to me based on your question is what your domain model will look like in your application to where this is possible. It sounds as if there is a one to many relationship from URL to Users. You would likely read in a single URL and build a collection of User objects that are ready to be processed and written out as a single transaction.

The second thing I would consider in your design and generally a good practice to get into when designing software is to document what your system constraints are.

  • Are there alternative means to retrieve required data about a User apart from screen scraping? If not document the constraints that exist.
  • What software system or component requires the User data to be provided by your software (REST API). Does this third party software have the ability to take a batch file for input as opposed to REST API? Are there other potential interfaces that might be more reliable?

Also good to document Risks:

  • Screen scraping presents tight coupling between the web design and application and your batch job

In light of this information I would design like such:

Reader

  • Retrieve the URL from database
  • Screen scrape for user data
  • Create a List<User> objects for the Processor step

Processor

  • Integration of data from multiple Readers if applicable?
  • Special processing rules or calculation of derived data?
  • Preparation of User object for Writers

Writer

  • One unique writer for persisting to your database
  • Second unique writer for POST to REST API

Each chunk will be composed of users in a single URL. Each chunk in process should be transacted so that in the event of an exception or failure, any persisted changes can be rolled back. In the event of an exception, is it possible to define custom rollback behavior for the REST API?

Your final considerations should be the supportability and maintainability of the batch job. You might want to consider Spring Batch Admin for this. Any time where your business process depends on URL resources for internal or external network, screen scraping, and the availability and proper functioning of a REST API I would consider this sufficiently high risk. There are many potential points of failure in this job so not only are Transactions and good exception handling a must, you will also want the ability to administer this easily and with minimal manual intervention.

Spring Batch Admin maintains a database of historical jobs as well as currently running jobs and paused and failed jobs. You can configure a Spring Batch job managed with Spring Batch Admin to pick up where the failed job left off. Perhaps your job got through 350 URL's of 400 to scan. No need to clean up and start over if you can restart the failed job instance, it will pick up at record 351 and try again. You may even be able to have it wait a few minutes and try several times before sending notifications out.

I hope this gives you things to consider.