Maching Learning and Natural Language Processing is somewhat data-driven. Without a continuous supply of high-quality data (which must be re-captured whenever new criteria are added), the software development may miss its intended target.
The customer and product owner may devote a bigger fraction of their time toward test data collection.
The adaptations will depend on:
- The balance of time allocated toward "spike" versus "implementation"
- spike: open-ended research / exploratory prototyping (where the benefits are possible but not certain, and where priorities are fast-changing)
- implementation (where the benefits and costs are somewhat more predictable, but each task unit will take much longer to finish).
- Average size of a task unit. How long does a spike take? How long does an implementation take? Unlike typical application development, partial algorithm implementations are usually not runnable.
- Whether "canned algorithms" i.e. existing libraries / algorithm packages are available.
- When these are available, time spent on "implementation" is reduced (because they're already written) and thus more time will be spent on "spikes".
There will be two feedback loops:
- In each iteration, customer and product owner collect/update data periodically (and with better quality / newer criteria) based on business need and developers' feedback.
- In each iteration, developers try to improve algorithm quality based on the data available, and the algorithm is packaged into a usable system and delivered at the end of each iteration. Customer and product owner should be allowed to take the beta system elsewhere, redistribute them, etc.
Thus, we see that "data" replaces "features" as the main definition of progress.
Because of the increased importance of "research / spike" in ML/NLP development, there is a need for a more organized approach to spike - something you may have already learned from any graduate research teams. Spikes are to be treated as "mini-tasks", taking from hours to days. Implementation of one algorithm suite will take longer, in some cases weeks. Because of task size differences, spikes and implementations are to be prioritized separately. Implementations are costly, something to be avoided if possible. This is one reason for using canned algorithms / existing libraries.
The scrummaster will need to constantly remind everyone to: (1) note down every observation, including "passing thoughts" and hypotheses, and exchange notes often (daily). (2) Spend more time on spikes (3) use existing libraries as much as possible (4) don't worry about execution time - this can be optimized later.
If you do decide to implement something that's missing in libraries, do it with good quality.
Daily activities:
- Reprioritize spikes (short tasks / hours - days) and exchange research notes.
- De-prioritize spikes aggressively if yesterday's result don't seem promising.
- Everyone must commit a fraction of time on implementation (long tasks / weeks), otherwise nobody would be working on them because they tend to be more boring on tasks.
Sprint activities:
- Demo, presentation of beta software
- New data collection
- Retrospect: data collection criteria / new measures of quality, algorithm satisfaction, balance between spikes and implementations
About the note on deferring optimization: The thought-to-code ratio is much higher in ML/NLP than in business software. Thus, once you have a working idea, rewriting the algorithm for an ML/NLP application is easier than rewriting a business software. This means it is easier to get rid of inefficiencies inheritant in the architecture (that is, in the worst case, simply do a rewrite.)
(All editors are welcome to rearrange (re-order) my points.)
Isn't it enough to split a text by whitespace or other characters that
delimit boundaries between words?
In many cases, natural language processing is used to pick out pieces of sentences without necessarily analyzing the entire sentence. However, those pieces may consist of several words, so using white space to simply break the sentence into words isn't very helpful. Imagine that you're building database of historical facts, and a tiny portion of the input text looks like this:
Tony Orlando was born on April 3, 1944 in New York City and later moved to New Jersey.
In that case it's probably useful to know that this fact involves a person, a date, two places, all of which consist of multiple words that aren't very useful by themselves. A chunker can break that sentence into phrases that are more useful than individual words, like Tony Orlando and New York City and even born on April 3, 1944. Identifying meaningful terms could speed up searching and yield better results.
Best Answer
I once worked with an NLP toolkit and ran into the problem you described. I think there are (at least) two approaches:
(implicit approach), use memoization
This was the approach I used and it could be implemented quickly; drawback was, that certain larger data structures would be persisted on disk, and while loading was orders of magnitude faster than recalculation, it still took its time.
(explicit), use some database, be it relational or document-oriented, to store all the results you might care about in the future. This requires more attention in the beginning, but would pay off in the longer run.
Maybe of interest:
Edit: Another thing that I've been using lately for multistep long running computations is a workflow framework, of which there are dozens. It is not really about persistence, but persistence is a step in the workflow. I'm trying luigi for that and it comes, e.g. with Hadoop and Postgres helper classes, which can eliminate a lot of boilerplate code.