Continuous Delivery in Practice – How It Works

continuous integrationcontinuous-delivery

Continuous Delivery sounds good, but my years of software development experience suggest that in practice it can't work.

(Edit: To make it clear, I always have lots of tests running automatically. My question is about how to get the confidence to deliver on each checkin, which I understand is the full form of CD. The alternative is not year-long cycles. It is iterations every week (which some might consider still CD if done correctly), two weeks, or month; including an old-fashioned QA at the end of each one, supplementing the automated tests.)

  • Full test coverage is impossible. You have to put in lots of time — and time is money — for every little thing. This is valuable, but the time could be spent contributing to quality in other ways.
  • Some things are hard to test automatically. E.g. GUI. Even Selenium won't tell you if your GUI is wonky. Database access is hard to test without bulky fixtures, and even that won't cover weird corner cases in your data storage. Likewise security and many other things. Only business-layer code is effectively unit-testable.
  • Even in the business layer, most code out there is not simple functions whose arguments and return values can be easily isolated for test purposes. You can spend lots of time building mock objects, which might not correspond to the real implementations.
  • Integration/functional tests supplement unit tests, but these take a lot of time to run because they usually involve reinitializing the entire system on each test. (If you don't reinitialize, the test environment is inconsistent.)
  • Refactoring or any other changes break lots of tests. You spend lots of time fixing them. If it's a matter of validating meaningful spec changes, that's fine, but often tests break because of meaningless low-level implementation details, not stuff that really provides important information. Often the tweaking is focused on reworking the internals of the test, not on truly checking the functionality that is being tested.
  • Field reports on bugs cannot easily be matched with the precise micro-version of the code.

Best Answer

my years of software development experience suggest that in practice it can't work.

Have you tried it? Dave and I wrote the book based on many collective years of experience, both of ourselves and of other senior people in ThoughtWorks, actually doing the things we discuss. Nothing in the book is speculative. Everything we discuss has been tried and tested even on large, distributed projects. But we don't suggest you take it on faith. Of course you should try it yourself, and please write up what you find works and what doesn't, including the relevant context, so that others can learn from your experiences.

Continuous Delivery has a big focus on automated testing. We spend about 1/3 of the book talking about it. We do this because the alternative - manual testing - is expensive and error-prone, and actually not a great way to build high quality software (as Deming said, "Cease dependence on mass inspection to achieve quality. Improve the process and build quality into the product in the first place")

Full test coverage is impossible. You have to put in lots of time -- and time is money -- for every little thing. This is valuable, but the time could be spent contributing to quality in other ways.

Of course full test coverage is impossible, but what's the alternative: zero test coverage? There is a trade-off. Somewhere in between is the correct answer for your project. We find that in general you should expect to spend about 50% of your time creating or maintaining automated tests. That might sound expensive until you consider the cost of comprehensive manual testing, and of fixing the bugs that get out to users.

Some things are hard to test automatically. E.g. GUI. Even Selenium won't tell you if your GUI is wonky.

Of course. Check out Brian Marick's test quadrant. You still need to perform exploratory testing and usability testing manually. But that's what you should be using your expensive and valuable human beings for - not regression testing. The key is that you need to put a deployment pipeline in place so that you only bother running expensive manual validations against builds that have passed a comprehensive suite of automated tests. Thus you both reduce the amount of money you spend on manual testing, and the number of bugs that ever make it to manual test or production (by which time they are very expensive to fix). Automated testing done right is much cheaper over the lifecycle of the product, but of course it's a capital expenditure that amortizes itself over time.

Database access is hard to test without bulky fixtures, and even that won't cover weird corner cases in your data storage. Likewise security and many other things. Only business-layer code is effectively unit-testable.

Database access is tested implicitly by your end-to-end scenario based functional acceptance tests. Security will require a combination of automated and manual testing - automated penetration testing and static analysis to find (e.g.) buffer overruns.

Even in the business layer, most code out there is not simple functions whose arguments and return values can be easily isolated for test purposes. You can spend lots of time building mock objects, which might not correspond to the real implementations.

Of course automated tests are expensive if you build your software and your tests badly. I highly recommend checking out the book "growing object-oriented software, guided by tests" to understand how to do it right so that your tests and code are maintainable over time.

Integration/functional tests supplement unit tests, but these take a lot of time to run because they usually involve reinitializing the entire system on each test. (If you don't reinitialize, the test environment is inconsistent.)

One of the products I used to work on has a suite of 3,500 end-to-end acceptance tests that takes 18h to run. We run it in parallel on a grid of 70 boxes and get feedback in 45m. Still longer than ideal really, which is why we run it as the second stage in the pipeline after the unit tests have run in a few minutes so we don't waste our resources on a build that we don't have some basic level of confidence in.

Refactoring or any other changes break lots of tests. You spend lots of time fixing them. If it's a matter of validating meaningful spec changes, that's fine, but often tests break because of meaningless low-level implementation details, not stuff that really provides important information. Often the tweaking is focused on reworking the internals of the test, not on truly checking the functionality that is being tested.

If your code and tests are well encapsulated and loosely coupled, refactoring will not break lots of tests. We describe in our book how to do the same thing for functional tests too. If your acceptance tests break, that's a sign that you're missing one or more unit tests, so part of CD involves constantly improving your test coverage to try and find bugs earlier in the delivery process where the tests are more fine-grained and the bugs are cheaper to fix.

Field reports on bugs cannot easily be matched with the precise micro-version of the code.

If you're testing and releasing more frequently (part of the point of CD) then it is relatively straightforward to identify the change that caused the bug. The whole point of CD is to optimize the feedback cycle so you can identify bugs as soon as possible after they are checked in to version control - and indeed, preferably before they're checked in (which is why we run the build and unit tests before check-in).