SQLite on Google Cloud Persistent Disk

google-compute-enginesqlite

Does Google's Persistent Disk have proper reader/writer locks for concurrent access (many virtual machines accessing data from a single Persistent Disk) which is required for SQLite?

According to the SQLite FAQ:

SQLite uses reader/writer locks to control access to the database. […] But use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

I would really like to know if Google's persistent disk locks properly for writes. I research AWS's EFS and it does not have a proper locking system for supporting SQLite.

Best Answer

Update (30 Aug 2019)

Dqlite by Canonical provides a distributed, high-availability SQLite which may suit this use case. It's open-source under Apache 2.0, and written in C, so it could be a drop-in replacement for SQLite. See also discussion on HN for more context.

Earlier answer

It sounds like from your comments that MySQL and Google Cloud SQL are not an option due to your architecture requiring the use of a single SQLite file.

Also, per SQLite docs, NFS is not an option due to the locking issues.

Here are some other options to consider.

Alternative distributed filesystem

In addition to NFS, there are a number of other distributed filesystems you may want to evaluate, such as Ceph, GlusterFS, OrangeFS, ZFS, etc. In addition to your own research, consider reaching out to SQLite users or developers for guidance and past experiences.

Use NFS, but enforce single writer-at-a-time

The NFS issue appears to be about locking, which is only needed for writes: as long as you can guarantee that only one process has the database locked for writes at-a-time, several other processes can open it for reads so this should be OK (please confirm/verify that this is the case).

Thus, as long as there's an external method to ensuring a single-writer, you may be able to use NFS. Consider using a distributed lock service such as Apache ZooKeeper, HashiCorp Consul or CoreOS etcd for the lock service, and you can store your SQLite on NFS.

This, of course, relies on each process with direct access to SQLite database to properly close it when it no longer needs to write to it, so correctness is hard to enforce, as it relies on all software to be correct and cooperating.

Lightweight RPC server

You mentioned that your architecture (which cannot be changed at this time) relies on SQLite, but if it's possible to have them call an RPC service instead of opening the file directly, you can have that server be the only point of opening the SQLite database and avoid the locking issue from multiple concurrent users. However, this means that you would have to change all the client code to call the RPC service instead of opening the SQLite database directly, which is a non-trivial amount of work.

Conclusion

None of these options are trivial and will require work. The reason is that:

In contrast to many other database management systems, SQLite is not a client–server database engine. Rather, it is embedded into the end program.

As such, it is not the right solution for multiple accessors, and hence a bunch of workarounds are required.

Longer term, if you're in a situation where you would have to make a significant changes to continue to be able to maintain this system, you may want to consider migrating to MySQL or Google Cloud SQL instead of investing into workarounds to continue to use SQLite.

Related Solutions

Get sqlite error, “unable to open database file”

Aha, just stumbled across an article explaining this. Also Django have info on their NewbieMistakes page.

The solution is to make sure the directory containing the database file also has write access allowed to the process.

In my case, running this command fixed the problem:

sudo chown www-data .

Is this idea about distributed database server with centralized storage feasible

You could actually build a good distributed database environment if you partition (or target) your reads and writes at different databases. We do such work, and the trick is very simple. You have the master database on a file server and target all writes to it. You have a local copy of the database on every user's computer and you target the reads to it. You now also need a synchronizing mechanism between the master database and the local databases. This can be done in multiple ways. One way is to have a "delta" table in the master database. This delta table will contain the transactions that have been applied in the master database. Whenever the user's application performs a read or write operation, the delta on the master is first checked and updated locally. Only the transactions in the delta not yet applied (which can be checked based on time stamp) need to be applied. You could even have a background process doing this continuously. This delta could be a daily delta (or a weekly delta) when it is flushed. If a user has not logged on for a week or so, you just simply copy over the whole database to the user's computer. The advantage of having a local copy is that users can query stuff even when they are offline and - believe it or not - this is pretty fast even when you are online updating stuff.