Bulk Operations – Should a Single Failure Cause a Complete Fail?

application-designfailuretransactionuser-experience

In the API I'm working on there's a bulk delete operation which accepts an array of IDs:

["1000", ..., "2000"]

I was free to implement the delete operation as I saw fit, so I decided to make the whole thing transactional: that is, if a single ID is invalid, the entire request fails. I'll call this the strict mode.

try{
savepoint = conn.setSavepoint();

for(id : IDs)
    if( !deleteItem(id) ){
        conn.rollback(savepoint);
        sendHttp400AndBeDoneWithIt();
        return;
    }

conn.commit();
}

The alternative (implemented elsewhere in our software suite) is to do what we can in the backend, and report failures in an array. That part of the software deals with fewer requests so the response doesn't end up being a gigantic array… in theory.

A recent bug occurring in a resource-poor server made me look at the code again, and now I'm questioning my original decision – but this time I'm motivated more by business needs rather than best practices. If, for example, I fail the entire request, the user will have to try again whereas if a number of items get deleted, the user can finish the action and then ask an administrator to do the rest (while I work on fixing the bug!). This would be the permissive mode.

I tried looking online for some guidance on the matter but I've come up empty handed. So I come to you: What is most expected by bulk operations of this nature? Should I stick with strict more, or should I be more permissive?

Best Answer

Its okay to do a 'strict' or a 'nice' version of a delete endpoint, but you need to clearly tell the user what happened.

We're doing a delete action with this endpoint. Likely DELETE /resource/bulk/ or something similar. I'm not picky. What matters here is that no matter if you decide to be strict or nice, you need to report back exactly what happened.

For example, an API i worked with had a DELETE /v1/student/ endpoint that accepted bulk IDs. We'd regularly send off the request during testing, get a 200 response and assume everything was fine, only to find out later that everyone on the list was both IN the database still (set to inactive) or not actually deleted due to an error which messed up future calls to GET /v1/student because we got back data we weren't expecting.

The solution to this came in a later update that added a body to the response with the ids that weren't deleted. This is - to my knowledge - a sort of best practice.

Bottom line, no matter what you do, make sure you provide a way to let the end user know what's going on, and possibly why its going on. IE, if we picked a strict format, the response could be 400 - DELETE failed on ID 1221 not found. If we picked a 'nice' version, it could be 207 - {message:"failed, some ids not deleted", failedids:{1221, 23432, 1224}} (excuse my poor json formatting).

Good luck!

Best Answer

Related Solutions

How to deal with programming projects that fail

Software Maintenance – Case Study of Software Failure Due to Maintenance Oversights

Related Topic