Linux – How to find a reason why a email is marked 3.5 BAYES_99 BODY

amavisemail-serverlinuxpostfixspamassassin

I am trying to find out why my messages getting marked in http://isnotspam.com with a 3.5 BAYES_99 BODY score. It is nice that the report shows "SpamAssassin Check : ham (non-spam)" However, it bothers me that I get:

3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%

So yes, its not really marked as spam but how could it even look like spam for the SpamAssassin BAYES? I thought it might be my HTML signature or the fact that the email was sent as HTML. So, I tried a version with only a JPG as a signature, and then finally the version that can be accessed on the link below that is pure text only. Still the same result. I checked if the IP or domain is blacklisted anywhere, but all is clear. My emails are signed with SPF, DKIM, DMARC and are even listed at dnswl.org as trustworthy.

It is frustrating that even though emails would most likely be delivered because it gives a score of 2.7, but still, it is not pleasant to be classified as 99% probable spam and I feel like a score of 2.7 is still high. You never know how other receiving servers have configured the SpamAssassin score for this.

I would appreciate if anyone has an idea on what else to check. The sending email server is CentOS 6.9 with Postfix, AMAVIS-New, ClamAV. I am not sure what else this could be, see the link below to the report; it is a text-only email.

http://isnotspam.com/newlatestreport.php?email=ins-10xbd4u5%40isnotspam.com

Best Answer

You cannot inspect the Bayes database of a recipient that does not share this data with you. This is deliberately not public, otherwise it would be even simpler to construct spam messages that defeat any simple Bayesian spam filter.

However, you can see the tokens queried with a Bayes database that you do have access to, and when there are similarities between those databases (as is likely when correctly setup & trained on similar mail flows), still deduce useful information on which tokens might have been relevant.

Simply pipe your the mail in question to a SpamAssassin program instructed to log tokens into a header.

cat message.eml | sudo -H -u debian-spamd spamassassin \
 --test-mode --local --cf='bayes_auto_lean 0' \
 --cf='add_header all Spam-Tokens-Spammy _SPAMMYTOKENS(20,compact)_' \
 --cf='add header all Spam-Tokens-Hammy _HAMMYTOKENS(20,compact)_' | less

Best to use the message as it was received, not as it was sent - the most interesting tokens may well have been added after submission (such as names and addresses of relays). You can chose maximum number of tokens printed, and a format (here: compact). The syntax is documented in doc/Mail_SpamAssassin_Conf

The resulting message will contain headers like this, listing each token with its respective signal strength:

Spam-Tokens-Spammy: 0.992-+--investment, 0.988-+--estate, 0.987-+--download, ..
Spam-Tokens-Hammy: 0.000-+--0, 0.002-+--H*RU:192.0.2.1, 0.018-+--utf8, ..

In this example we can tell that mentioning "investment", "estate" and "download" has had an impact towards classifying the message as more likely to be spam.

Related Topic