How to trim emails for just the body, when using email as input to an external system

algorithmsmachine learning

When an application allows emails to be sent to it to either reply to comments or add todos, trimming those emails for just the relevant text becomes a problem, since there are many different standards. Many times you'll end up seeing things like this:

Hey Joe, good to hear from you. Let me know when you'll be back in town.
Posted by Bob, 30 minutes ago


I'll be back on the 13th.


Sincerely,
Joseph R. Roberts
Senior Partner

This communication is confidential and is property of Whatever Law Firm.
Posted by Joe, 10 seconds ago

Signatures are probably the most difficult to get rid of, and quoted text the easiest. I imagine any comprehensive strategy for trimming will be multi-facted, and ideally, learning. I think a good system should:

  1. Remove quoted body
  2. Remove quote headers ("On 15 October, Joe wrote:")
  3. Remove signatures
  4. Preserve anything that was typed manually.

What steps would a system need to take to accomplish this, and what pitfalls should it be aware of?


This answer is a good example of a useful answer to a similar question

Best Answer

Properly formatted signatures are easy to identify by the '-- ' (dash dash space) line which precedes them. Good luck finding many. Although netiquette requires signatures to be no more than three lines many organizations have standards signatures and disclaimers which far exceed this.

Properly formatted quoted text will begin with one ore more '>' characters. This assumes that you have a plain text copy of the body to extract data from.

HTML formatted messages may have CSS styling which will help do what you want.