Spring – Integrating Spring Batch with Web Scraping

application-designspring

I need to develop a batch processing that will be executed every day. These are the steps:

Read each line of a database table that contains a URL (SQLite)
Extract some data, say Users, from that website by scraping it. Each website may contain 1..n users.
Persist each User in a local NoSQL database.
Send each User(one by one) through third-party REST API.

I'm going to implement this process using Spring Batch and I have thought this design:

Item Reader: Read each URL of SQLite database using JdbcCursorItemReader.
Item Processor: Scrap and deserialize users from website. (Url -> List<User>)
Item Writer: for each User, persist it in database and send it through REST API.

Is this approach right? Should I change any step? I have never worked with Spring Batch so I'm willing to change the technology if needed. I need some advice before start developing since I need this process to be very robust.

Best Answer

This is generally a good application for Spring Batch, and you seem to understand the logical separation of Reader, Processor and Writer fairly well.

There are certain things that you should consider and think about when it comes to an application like this. Spring Batch gives you the concept of chunking where rather than read/process/write each record one at a time you could read in several items as a chunk, process them as a single transaction, and write them as a single transaction. Something that isn't clear to me based on your question is what your domain model will look like in your application to where this is possible. It sounds as if there is a one to many relationship from URL to Users. You would likely read in a single URL and build a collection of User objects that are ready to be processed and written out as a single transaction.

The second thing I would consider in your design and generally a good practice to get into when designing software is to document what your system constraints are.

Are there alternative means to retrieve required data about a User apart from screen scraping? If not document the constraints that exist.
What software system or component requires the User data to be provided by your software (REST API). Does this third party software have the ability to take a batch file for input as opposed to REST API? Are there other potential interfaces that might be more reliable?

Also good to document Risks:

Screen scraping presents tight coupling between the web design and application and your batch job

In light of this information I would design like such:

Reader

Retrieve the URL from database
Screen scrape for user data
Create a List<User> objects for the Processor step

Processor

Integration of data from multiple Readers if applicable?
Special processing rules or calculation of derived data?
Preparation of User object for Writers

Writer

One unique writer for persisting to your database
Second unique writer for POST to REST API

Each chunk will be composed of users in a single URL. Each chunk in process should be transacted so that in the event of an exception or failure, any persisted changes can be rolled back. In the event of an exception, is it possible to define custom rollback behavior for the REST API?

Your final considerations should be the supportability and maintainability of the batch job. You might want to consider Spring Batch Admin for this. Any time where your business process depends on URL resources for internal or external network, screen scraping, and the availability and proper functioning of a REST API I would consider this sufficiently high risk. There are many potential points of failure in this job so not only are Transactions and good exception handling a must, you will also want the ability to administer this easily and with minimal manual intervention.

Spring Batch Admin maintains a database of historical jobs as well as currently running jobs and paused and failed jobs. You can configure a Spring Batch job managed with Spring Batch Admin to pick up where the failed job left off. Perhaps your job got through 350 URL's of 400 to scan. No need to clean up and start over if you can restart the failed job instance, it will pick up at record 351 and try again. You may even be able to have it wait a few minutes and try several times before sending notifications out.

I hope this gives you things to consider.

Related Solutions

Java – Programmatically extending Hibernate table/entity definitions in Spring, how

As far as I can tell, what you are describing can be trivially implemented with an Embeddable class and not need of AOP.

@Embeddable
class Audit {
  Timestamp created;
  String createdBy;
  String updatedBy;
}

Then you can simply add your embeddable object within your entities.

@Entity
class Order {
   @Embedded
   Audit audit;
}

Using inheritance, you might just as well make this available for a set of entities:

@MappedSuperClass
class MotherOfAllEntities {
   @Embedded
   Audit audit;
}

@Entity
class ChildClass extends MotherOfAllEntities {
}

If the information being added is just audit information, as you seem to suggest in your post you may consider using something like Hibernate Envers which seems to be an implementation of what you're apparently suggesting with your question

@Entity
@Audited
class Order {
}

And Envers will make sure to store audit information about the entity in some other data structure.

Since Envers seem to be an implementation of something like what you want to do from scratch, you might just as well give it a look to their code and see how they did it, but yet again, if it is to implement what you described, to me it sounds like killing a fly with a bazooka.

Java – Synchronisation with offline system

One approach that I've been investigating for awhile now (with some success) to have client data sync up with server data, without relying on dates (which might be unreliable) or synchronous requests, is a combination of JSON Patches (perhaps POJOs in your case) and event sourcing.

The basic idea is that instead of storing the current state on the client and the server, the client and server store a list of changes, and message each other through either events or patch requests.

So instead of having the client send all the data plus a date to the server, the client sends an event, along with a revision number that corresponds to the last time the client thinks the data has updated. Something like this:

Server.send("MODIFY FOO", 3);

Once the server gets this event (asynchronously), it reconciles it with other events it may have already received. For example, it's possible that another client working with the same data might have already modified some things, and now the revision number on the server is at 5. Thus, this revision will need to be applied before the last 2 were applied, and all clients will need to be notified of this change.

Once the server finishes, it notifies all the interested clients of the changes have been made, and the new current revision number. The client then applies these changes and updates its internal revision number.

Your mileage may vary, but I hope that helps.

Edit: Another name for this approach, or a variation of it, is called message queuing, as mentioned in this related question.