C++ Design – Reviewing Serialization Design

cc++11design

I am writing a C++ application. Most applications read and write data ^{citation needed} and this one is no exception. I created a high level design for the data model and serialization logic. This question is requesting a review of my design with these specific goals in mind:

To have an easy and flexible way to read and write data models in arbitrary formats: raw binary, XML, JSON, et. al. The format of data should be decoupled from the data itself as well as the code that is requesting serialization.
To ensure that serialization is as error-free as reasonably possible. I/O is inherently risky for a variety of reasons: does my design introduce more ways for it to fail? If so, how could I refactor the design to mitigate those risks?
This project uses C++. Whether you love it or hate it, the language has its own way of doing things and the design aims to work with the language, not against it.
Finally, the project is built on top of wxWidgets. While I am looking for a solution applicable to a more general case, this specific implementation should work nicely with that toolkit.

What follows is a very simple set of classes written in C++ that illustrate the design. These are not the actual classes that I have partially written so far, this code simply illustrates the design I am using.

First, some sample DAOs:

#include <iostream>
#include <map>
#include <memory>
#include <string>
#include <vector>

// One widget represents one record in the application.
class Widget {
public:
  using id_type = int;
private:
  id_type id;
};

// Container for widgets. Much more than a dumb container,
// it will also have indexes and other metadata. This represents
// one data file the user may open in the application.
class WidgetDatabase {
  ::std::map<Widget::id_type, ::std::shared_ptr<Widget>> widgets;
};

Next, I define pure virtual classes (interfaces) for reading and writing DAOs. The idea is to abstract the serialization of data from the data itself (SRP).

class WidgetReader {
public:
  virtual Widget read(::std::istream &in) const abstract;
};

class WidgetWriter {
public:
  virtual void write(::std::ostream &out, const Widget &widget) const abstract;
};

class WidgetDatabaseReader {
public:
  virtual WidgetDatabase read(::std::istream &in) const abstract;
};

class WidgetDatabaseWriter {
public:
  virtual void write(::std::ostream &out, const WidgetDatabase &widgetDb) const abstract;
};

Finally, here is the code that gets the proper reader/writer for the desired I/O type. There would be subclasses of the readers/writers also defined, but these add nothing to the design review:

enum class WidgetIoType {
  BINARY,
  JSON,
  XML
  // Other types TBD.
};

WidgetIoType forFilename(::std::string &name) { return ...; }

class WidgetIoFactory {
public:
  static ::std::unique_ptr<WidgetReader> getWidgetReader(const WidgetIoType &type) {
    return ::std::unique_ptr<WidgetReader>(/* TODO */);
  }

  static ::std::unique_ptr<WidgetWriter> getWidgetWriter(const WidgetIoType &type) {
    return ::std::unique_ptr<WidgetWriter>(/* TODO */);
  }

  static ::std::unique_ptr<WidgetDatabaseReader> getWidgetDatabaseReader(const WidgetIoType &type) {
    return ::std::unique_ptr<WidgetDatabaseReader>(/* TODO */);
  }

  static ::std::unique_ptr<WidgetDatabaseWriter> getWidgetDatabaseWriter(const WidgetIoType &type) {
    return ::std::unique_ptr<WidgetDatabaseWriter>(/* TODO */);
  }
};

Per the stated goals of my design, I have one specific concern. C++ streams can be opened in text or binary mode, but there is no way to check an already-opened stream. It could be possible through programmer error to provide e.g. a binary stream to an XML or JSON reader/writer. This could cause subtle (or not so subtle) errors. I would prefer the code to fail fast, but I am not sure this design would do that.

One way around this could be to offload the responsibility of opening the stream to the reader or writer, but I believe that violates SRP and would make the code more complex. When writing a DAO, the writer should not care about where the stream is going: it could be a file, standard out, an HTTP response, a socket, anything. Once that concern is encapsulated in the serialization logic it becomes far more complex: it must know the specific type of stream and which constructor to call.

Aside from that option, I am not sure what would be a better way to model these objects that is simple, flexible, and helps to prevent logic errors in the code that uses it.

The use case with which the solution must be integrated is a simple file selection dialog box. The user selects "Open…" or "Save As…" from the File menu, and the program opens or saves the WidgetDatabase. There will also be "Import…" and "Export…" options for individual Widgets.

When the user selects a file to open or save, wxWidgets will return a file name. The handler that responds to that event must be general purpose code that takes the file name, acquires a serializer, and calls a function to do the heavy lifting. Ideally this design would also work if another piece of code is performing non-file I/O, such as sending a WidgetDatabase to a mobile device over a socket.

Does a widget save to its own format? Does it interoperate with existing formats? Yes! All of the above. Going back to the file dialog, think about Microsoft Word. Microsoft were free to develop the DOCX format however they wanted within certain constraints. At the same time, Word also reads or writes legacy and third-party formats (e.g. PDF). This program is no different: the "binary" format I talk about is a yet-to-be-defined internal format designed for speed. At the same time, it must be able to read and write open standard formats in its domain (irrelevant to the question) so it can be able to work with other software.

Finally, there is only one type of Widget. It will have child objects, but those will be handled by this serialization logic. The program will never load both Widgets and Sprockets. This design only needs to be concerned with Widgets and WidgetDatabases.

Best Answer

I may be wrong, but your design seems to be horribly overengineered. To serialize just one Widget, you want to define WidgetReader, WidgetWriter, WidgetDatabaseReader, WidgetDatabaseWriter interfaces which each have implementations for XML, JSON, and binary encodings, and a factory to tie all those classes together. This is problematic for the following reasons:

If I want to serialize a non-Widget class, let's call it Foo, I have to reimplement this whole Zoo of classes, and create FooReader, FooWriter, FooDatabaseReader, FooDatabaseWriter interfaces, times three for each serialization format, plus a factory to make it even remotely usable. Don't tell me there won't be any copy&paste going on there! This combinatorial explosion seems to be fairly unmaintainable, even if each of those classes essentially only contains a single method.
Widget can't be reasonably encapsulated. Either you open up everything that should be serialized up to the open world with getter methods, or you have to friend each and every WidgetWriter (and probably also all WidgetReader) implementations. In either case, you will introduce considerable coupling between the serialization implementations and the Widget.
The reader/writer zoo invites inconsistencies. Whenever you add a member to Widget, you will have to update all related serialization classes to store/retrieve that member. This is something that can't be statically checked for correctness, so you will also have to write a separate test for each reader and writer. At your current design, that's 4*3=12 tests per class you want to serialize.

In the other direction, adding a new serialization format such as YAML is also problematic. For each class that you want to serialize, you will have to remember to add a YAML reader and writer, and add that case to the enum and to the factory. Again, this is something that can't be statically tested, unless you get (too) clever and draw up a templated interface for factories that's independent of Widget and makes sure an implementation for each serialization type for each in/out operation is provided.
Maybe the Widget now satisfies the SRP since it's not responsible for serialization. But the reader and writer implementations clearly don't, with the “SRP = each object has one reason to change” interpretation: the implementations must change when either the serialization format changes, or when the Widget changes.

If you are able to invest a minimum of time beforehand, please try to draw up a more generic serialization framework than this ad-hoc tangle of classes. For example, you could define a common interchange representation, let's call it SerializationInfo, with a JavaScript-like object model: most objects can be seen as a std::map<std::string, SerializationInfo>, or as a std::vector<SerializationInfo>, or as a primitive such as int.

For each serialization format, you would then have one class that manages reading and writing a serialization representation from that stream. And for each class you want to serialize, you would have some mechanism that converts instances from/to the serialization representation.

I have experienced such a design with cxxtools (homepage, GitHub, serialization demo), and it is mostly extremely intuitive, broadly applicable, and satisfactory for my use cases – the only problems being the fairly weak object model of the serialization representation that requires you to know during deserialization precisely which kind of object you are expecting, and that deserialization implies default-constructible objects that can be initialized later. Here's a contrived usage example:

class Point {
  int _x;
  int _y;
public:
  Point(x, y) : _x(x), _y(y) {}
  int x() const { return _x; }
  int y() const { return _y; }
};

void operator <<= (SerializationInfo& si, const Point& p) {
  si.addMember("x") <<= p.x();
  si.addMember("y") <<= p.y();
}

void operator >>= (const SerializationInfo& si, Point& p) {
  int x;
  si.getMember("x") >>= x;  // will throw if x entry not found
  int y;
  si.getMember("y") >>= y;
  p = Point(x, y);
}

int main() {
  // cxxtools::Json<T>(T&) wrapper sets up a SerializationInfo and manages Json I/O
  // wrappers for other formats also exist, e.g. cxxtools::Xml<T>(T&)

  Point a(42, -15);
  std::cout << cxxtools::Json(a);
  ...
  Point b(0, 0);
  std::cin >> cxxtools::Json(p);
}

I am not saying that you should use cxxtools or exactly copy that design, but in my experience its design makes it trivial to add serialization even for small, one-off classes, provided that you don't care too closely about the serialization format (e.g. the default XML output will use member names as element names, and will never use attributes for your data).

The problem with binary/text mode for streams does not seem solvable, but that isn't so bad. For one thing, it only matters for binary formats, on platforms I don't tend to program for ;-) More seriously, it's a restriction of your serialization infrastructure you'll just have to document and hope everyone uses correctly. Opening the streams within your readers or writers is way too inflexible, and C++ does not have a built-in type-level mechanism to distinguish text from binary data.

Best Answer

Related Solutions

Hardware Design Patterns – Designing Interfaces to Hardware

Design Patterns – Choosing Base for Decorator: Interface, Abstract Class, or Non-Abstract

Related Topic