Centos – Postfix mail lost in active queue

centosemail-serverpostfix

I am facing a strange situation with our relay mail servers that send emails on behalf of our clients.

Our current infra is composed of 2 mail relays servers configured with Postfix that receive all emails from our internal apps, and are in charge of sending them out to the Internet.

Now the issue we are seeing is that about 20% of all emails received by those relays are not sent out, and disappear in the active queue.

Here is a example of postfix log showing an email that do not leave the active queue:

Feb 10 17:12:02 relay02 postfix/smtpd[31701]: EFF07209F6A3: client=coreapps02[10.11.12.202]
Feb 10 17:12:02 relay02 postfix/cleanup[10949]: EFF07209F6A3: message-id=<48327_38759699@example.com>
Feb 10 17:12:02 relay02 postfix/qmgr[23160]: EFF07209F6A3: from=<no-reply@example.com>, size=3581, nrcpt=1 (queue active)

This message and appears to be lost as it is not present in the /var/spool/postfix/active directory.

Here is example of email that is sent to the Internet at around the same time:

Feb 10 17:12:02 relay02 postfix/smtpd[31701]: D8F67209F6AF: client=coreapps02[10.11.12.202]
Feb 10 17:12:02 relay02 postfix/cleanup[10949]: D8F67209F6AF: message-id=<48327_38759698@example.com>
Feb 10 17:12:02 relay02 postfix/qmgr[23160]: D8F67209F6AF: from=<no-reply@example.com>, size=3617, nrcpt=1 (queue active)
Feb 10 17:12:03 relay02 postfix/smtp[10738]: D8F67209F6AF: to=<some.one@example.com>, relay=cluster1.us.messagelabs.com[216.82.241.131]:25, conn_use=2, delay=0.18, delays=0/0/0.02/0.16, dsn=2.0.0, status=sent (250 ok 1486746723 qp 65173 server-8.tower-54.messagelabs.com!1486746722!118816510!2)
Feb 10 17:12:03 relay02 postfix/qmgr[23160]: D8F67209F6AF: removed

Any ideas why Postfix is dropping some (~20%) of our messages?

Best Answer

Rsyslog and the Systemd journal had a rate limit in place which caused some postfix messages to never being logged even though emails were being properly handled.

I removed rsyslog rate limit following this guide: https://support.asperasoft.com/hc/en-us/articles/216128628-How-to-disable-rsyslog-rate-limiting, and systemd journal using this one: https://bani.com.br/2015/06/systemd-journal-what-does-systemd-journal-suppressed-n-messages-from-system-slice-mean/