C++ – Implementing copy-on-write

cdesign

I have a rather large class that contains a number of related member variables. These member variables can be grouped into related sections. We've noticed that the pattern of use of the class is to make a copy of an existing instance, change only a few of the member variables, and then do work with the resulting instance. There are 2 problems we've hit.

Instances of the class are large and often on the stack, so it would be nice to reduce its size
Copying all the member variables is somewhat slow.

So one idea we had was to group related member variables into their own objects. This is just a good idea for making the code easier to understand. But if we did this and had each of these objects be merely a shared pointer to an existing instance, it would reduce the size of the main class as several member variables would be replaced with a single shared pointer. It would also speed up copying, as the copy would just be creating a new shared pointer pointing to the existing instance.

The problem is that now whenever we have to write to anything in one of these shared objects, we need to make a copy of it on first write. (And we need to check on every write whether it needs to be copied or not.) We don't write very often, so I don't think it will be a performance hit. But implementing this seems problematic.

Is there a way to make a class be copy-on-write when one of its member variables needs to be changed? The class itself is written in C++, though I could also use C or Objective-C to implement this. I don't want to have to manually write the check of the shared pointer's reference count in every setter, for example. Is there any way to avoid that?

Best Answer

C++ does not have built-in copy on write semantics. You can implement such semantics with a smart pointer class, or can choose a design-level approach that makes copies unnecessary, e.g. by using the decorator pattern. However, you may find that these optimizations are not worth their effort.

Copy on Write Smart Pointer

The idea is that your members are stored in a smart pointer that manages their ownership. When the smart pointer is copied, it will not copy the managed object. However, once write-access is required, a copy is made.

This is not entirely trivial to get right, especially if there may be other pointers into the managed object. Some CoW implementations trigger only when a copy wants to write to the object, other implementations also perform a copy when the original owner writes to the (shared) object.

You will therefore have to write your own smart pointer that implements your desired semantics. Likely, this can be implemented with moderate effort on top of std::shared_ptr.

A significant drawback of CoW is that these lazy copies can make your code much harder to debug. Copying a C++ object may have observable side effects, but may now happen at unexpected times. Also, every write access must first check the state of the smart pointer and must possibly dereference multiple pointers.

Decorator Pattern

The decorator pattern can be used to overlay parts of an object with a different implementation. For example, you might want to overlay parts of a struct { A a; B b; C c; }. First, we need to define an interface so that we can combine our Decorators:

class Data {
public:
  virtual A& get_a() = 0;
  virtual B& get_b() = 0;
  virtual C& get_c() = 0;
};

Now we can implement this interface with a base storage class:

class BaseStorage : public Data {
  A a; B b; C c;
public:
  BaseStorage(A const& a, B const& b, C const& c) : a(a), b(b), c(c) {}
  A& get_a() override { return a; }
  B& get_b() override { return b; }
  C& get_c() override { return c; }
};

If we want to overlay the value of A, we can define a class that delegates all calls to a base object, except for requests of the A data:

class OverlayA : public Data {
  Data& base;
  A a;
public:
  OverlayA(Data& base, A&& a) : base(base), a(std::move(a)) {}
  A& get_a() override { return a; }
  B& get_b() override { return base.get_b(); }
  C& get_c() override { return base.get_c(); }
};

Instead of performing a copy of some Data, we can now overlay the part of it we are going to change:

Data& orig = ...;
// call a function with a copied A
will_change_a(OverlayA(orig, A(orig.get_a())));

Unfortunately, these decorators make it really really difficult to write const-correct code – you may want the Data& base to be a const reference in order to prevent writes to the base storage, but the interface may also describe overlaid data where writes are necessary. This could possibly be expressed through templates.

If behaviour of any Data interface implementation changes the private fields directly rather than going through the virtual methods, that may not write to the overlaid data. Therefore, the Data interface should not contain extra behaviour. You can implement such behaviour in a separate class that wraps a Data&.

This technique requires you to know in advance which parts of the object must be overlaid with copied data. As such, it is potentially error-prone.

The decorator pattern constructs a linked list of overlays. If the list grows many levels deep, this pointer chasing may hurt performance.

Copies are good

In many cases, creating a copy is preferable to cleverness like CoW. In particular, if the objects in question are not very large and are trivially copyable, then doing a copy each time may turn out to be cheaper than the alternatives. Because of caching, chasing pointers tends to be expensive compared to operations on contiguous objects. Both CoW and Decorators have continuous per-read overhead, whereas copies only have a predictable per-copy overhead.

If your profiling determines that the copies are a performance problem, then it may be sensible to try out one of these approaches or a combination of them (e.g. performing a copy but using CoW for expensive to copy members like vectors might be sensible). Where possible, avoid ownership of expensive to copy data in your object.

Related Solutions

C++ Interfaces – Understanding the Term ‘Interface’ in C++

In C++, the term "interface" has not just one widely accepted definition - so whenever you are going to use it, you should say what you mean exactly - a virtual base class with or without default implementations, a header file, the public members of an arbitrary class and so on.

Concerning your example: in Java (and similar in C#), your code probably would imply a separation of concerns:

interface IFoo {/*  ... */} // here is your interface

class FooBase implements IFoo 
{
     // make default implementations for interface methods
}

class Foo extends FooBase
{
}

In C++, you can do this but you are not forced to. And if you like to call a class an interface if it has no member variables, but contains default implementations for some methods, just do so, but make sure everyone you are talking to knows what you mean.

Python Refactoring – How to Refactor a ‘God Class’

Having written and then refactored a python "God object", I sympathize. What I did is break the original object down into sub sections based upon methods. For instance, the original looked like this pseudo code:

method A():
    self.bla += 1

method B():
    self.bla += 1

do stuff():
    self.bla = 1
    method A()
    method B()
    print self.bla

The stuff method is a self contained "unit" of work. I migrated it out to a new class which the original instantiates. This pulled out the necessary properties as well. Some were used only by the sub class and could move straight across. Others were shared, and got moved into a shared class.

The "God object" creates a new copy of the shared class at start up, and each of the new sub classes accepts a pointer as part of their init method. For example, here's a stripped version of the mailer:

#!/usr/bin/env python
# -*- coding: ascii -*-
'''Functions for emailing with dirMon.'''

from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.Utils import COMMASPACE, formatdate
from email import Encoders
import os
import smtplib
import datetime
import logging

class mailer:
    def __init__(self,SERVER="mail.server.com",FROM="support@server.com"):
        self.server = SERVER
        self.send_from = FROM
        self.logger = logging.getLogger('dirMon.mailer')

    def send_mail(self, send_to, subject, text, files=[]):
        assert type(send_to)==list
        assert type(files)==list
        if self.logger.isEnabledFor(logging.DEBUG):
            self.logger.debug(' '.join(("Sending email to:",' '.join(send_to))))
            self.logger.debug(' '.join(("Subject:",subject)))
            self.logger.debug(' '.join(("Text:",text)))
            self.logger.debug(' '.join(("Files:",' '.join(files))))
        msg = MIMEMultipart()
        msg['From'] = self.send_from
        msg['To'] = COMMASPACE.join(send_to)
        msg['Date'] = formatdate(localtime=True)
        msg['Subject'] = subject
        msg.attach( MIMEText(text) )
        for f in files:
            part = MIMEBase('application', "octet-stream")
            part.set_payload( open(f,"rb").read() )
            Encoders.encode_base64(part)
            part.add_header('Content-Disposition', 'attachment; filename="%s"' % os.path.basename(f))
            msg.attach(part)
        smtp = smtplib.SMTP(self.server)
        mydict = smtp.sendmail(self.send_from, send_to, msg.as_string())
        if self.logger.isEnabledFor(logging.DEBUG):
            self.logger.debug("Email Successfully Sent!")
        smtp.close()
        return mydict

It is created once and shared between the different classes that need mailing capabilities.

So for you, create a class larry with the properties and methods you need. Everywhere the client says larry = blah replace it with larryObj.larry = blah. This migrates things to sub projects without breaking the current interface.

The only other thing to do is look for "units of work". If you were going to turn part of the "God Object" into it's own method, do so. But, put the method outside it. This forces you to create an interface between the components.

Laying that groundwork allows everything else to follow it. For example, a piece of the helper object demonstrating how it interfaces with the mailer:

#!/usr/bin/env python
'''This module holds a class to spawn various subprocesses'''
import logging, os, subprocess, time, dateAdditionLib, datetime, re

class spawner:
    def __init__(self, mailer):
        self.logger = logging.getLogger('dirMon.spawner')
        self.myMailer = mailer

Concentrate on the smallest individual unit of work possible, and move it out. This is easier to do, and lets you play with the setup quickly. Don't look at properties for moving stuff, they are ancillary to the tasks being done with them in most cases. Whatever is left over after you have dealt with the methods should probably stay in the original object, since it is part of the shared state.

But, the new objects should now accept the properties they need as init variables, not touching the caller objects property. They then return any necessary values, which the caller can use to update the shared properties as necessary. This helps to decouple the objects and makes for a more robust system.