Agile Development in Machine Learning and NLP

agiledevelopment-processmachine learningnatural-language-processing

I've been developing web apps for a while now and it is standard practice in our team to use agile development techniques and principles to implement the software.

Recently, I've also become involved in Machine Learning and Natural Language Processing. I heard people primarily use Matlab for developing ML and NLP algorithms. Does agile development have a place there or is that skill completely redundant?

In other words, when you develop ML and NLP algorithms as a job, do you use agile development in the process?

Best Answer

Maching Learning and Natural Language Processing is somewhat data-driven. Without a continuous supply of high-quality data (which must be re-captured whenever new criteria are added), the software development may miss its intended target.

The customer and product owner may devote a bigger fraction of their time toward test data collection.

The adaptations will depend on:

  1. The balance of time allocated toward "spike" versus "implementation"
    • spike: open-ended research / exploratory prototyping (where the benefits are possible but not certain, and where priorities are fast-changing)
    • implementation (where the benefits and costs are somewhat more predictable, but each task unit will take much longer to finish).
  2. Average size of a task unit. How long does a spike take? How long does an implementation take? Unlike typical application development, partial algorithm implementations are usually not runnable.
  3. Whether "canned algorithms" i.e. existing libraries / algorithm packages are available.
    • When these are available, time spent on "implementation" is reduced (because they're already written) and thus more time will be spent on "spikes".

There will be two feedback loops:

  1. In each iteration, customer and product owner collect/update data periodically (and with better quality / newer criteria) based on business need and developers' feedback.
  2. In each iteration, developers try to improve algorithm quality based on the data available, and the algorithm is packaged into a usable system and delivered at the end of each iteration. Customer and product owner should be allowed to take the beta system elsewhere, redistribute them, etc.

Thus, we see that "data" replaces "features" as the main definition of progress.

Because of the increased importance of "research / spike" in ML/NLP development, there is a need for a more organized approach to spike - something you may have already learned from any graduate research teams. Spikes are to be treated as "mini-tasks", taking from hours to days. Implementation of one algorithm suite will take longer, in some cases weeks. Because of task size differences, spikes and implementations are to be prioritized separately. Implementations are costly, something to be avoided if possible. This is one reason for using canned algorithms / existing libraries.

The scrummaster will need to constantly remind everyone to: (1) note down every observation, including "passing thoughts" and hypotheses, and exchange notes often (daily). (2) Spend more time on spikes (3) use existing libraries as much as possible (4) don't worry about execution time - this can be optimized later.

If you do decide to implement something that's missing in libraries, do it with good quality.

Daily activities:

  • Reprioritize spikes (short tasks / hours - days) and exchange research notes.
    • De-prioritize spikes aggressively if yesterday's result don't seem promising.
  • Everyone must commit a fraction of time on implementation (long tasks / weeks), otherwise nobody would be working on them because they tend to be more boring on tasks.

Sprint activities:

  • Demo, presentation of beta software
  • New data collection
  • Retrospect: data collection criteria / new measures of quality, algorithm satisfaction, balance between spikes and implementations

About the note on deferring optimization: The thought-to-code ratio is much higher in ML/NLP than in business software. Thus, once you have a working idea, rewriting the algorithm for an ML/NLP application is easier than rewriting a business software. This means it is easier to get rid of inefficiencies inheritant in the architecture (that is, in the worst case, simply do a rewrite.)

(All editors are welcome to rearrange (re-order) my points.)