Postfix “timed out while sending message body” to “Microsoft ESMTP MAIL Service”

emailpostfixsmtp

For some reason we cannot send larger email to one of our customers. Small emails are fine, but even a message with a 1MB attachment fails to deliver for several hours – Postfix logs keep filling with status=deferred (conversation with customer***[IP***] timed out while sending message body), but eventually the messages get sent.

The customer is using "Microsoft ESMTP MAIL Service, Version: 6.0.3790.3959" (checked with telnet to port 25), although I'm not sure if that is really the recipients software fault.
We can send email without problems to pretty much everywhere else. The customer has confirmed that they do not have such problem with other people sending them mail. Only mail from our server seems to get stuck.

I've tried lowering the MTU value of our outgoing interface to 1000 and even 500. Other places like this or this show that there is a rare problem, but are without solutions or confirmations.

Below is the Postfix log of a typical message sending process to that customer. (Email addresses replaced with ***, customer mail server with customer*** and ip***)

Aug 18 15:41:28 gozilla postfix/cleanup[14751]: F0DB414080: message-id=<***>

Aug 18 15:41:28 gozilla postfix/qmgr[21922]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 15:44:44 gozilla postfix/qmgr[14985]: F0DB414080: skipped, still being delivered
Aug 18 15:44:45 gozilla postfix/qmgr[15132]: F0DB414080: skipped, still being delivered
Aug 18 15:44:45 gozilla postfix/qmgr[15242]: F0DB414080: skipped, still being delivered
Aug 18 15:44:47 gozilla postfix/smtp[14752]: F0DB414080: to=<***>, relay=customer***[IP***], delay=199, delays=0.64/0/0.68/198, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

Aug 18 15:46:44 gozilla postfix/qmgr[15242]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 15:50:01 gozilla postfix/smtp[15555]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=514, delays=317/0.01/0.17/196, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

Aug 18 15:52:52 gozilla postfix/qmgr[15242]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 15:56:35 gozilla postfix/smtp[15872]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=908, delays=685/0.01/0.16/222, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

Aug 18 16:14:45 gozilla postfix/qmgr[15242]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 16:18:18 gozilla postfix/smtp[17065]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=2210, delays=1997/0/0.17/213, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

Aug 18 16:23:22 gozilla postfix/qmgr[17635]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 16:27:23 gozilla postfix/smtp[17646]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=2755, delays=2515/0.01/0.21/240, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

 --- cut out more lines like this ---

Aug 18 19:56:17 gozilla postfix/qmgr[18145]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 19:59:31 gozilla postfix/smtp[26730]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=15484, delays=15289/0.01/0.2/194, dsn=4.4.2, status=deferred (conversation with customer***[IP***] timed out while sending message body)

Aug 18 22:21:17 gozilla postfix/qmgr[18145]: F0DB414080: from=<***>, size=1671905, nrcpt=1 (queue active)
Aug 18 22:26:01 gozilla postfix/smtp[32493]: F0DB414080: to=<***>, relay=customer***[IP***]:25, delay=24274, delays=23989/0.01/0.37/284, dsn=2.6.0, status=sent (250 2.6.0  <***> Queued mail for delivery)
Aug 18 22:26:01 gozilla postfix/qmgr[18145]: F0DB414080: removed

Best Answer

Oh boy, I love these servers. Almost universally, they're doing content-based scanning of the message bodies and then sending non-200 responses to the DATA command if they don't like the contents. This is fine in principle, but when you get some under-resourced crapbox running Exchange and some overweight proprietary waste of disk space doing the scanning, the machine runs like molasses and everyone else times out.

Yes, in theory this could be an MTU problem as you've suggested, but in practice this is pretty unlikely -- if you can get a 2000 byte message through, that disproves that theory (and there aren't many messages less than 2000 bytes, including headers, out there). A tcpdump at your end can confirm this -- if it's a TCP level problem, then you'll see attempts by your end to retransmit the full-size packet; if it's a remote MTA-being-slow problem, then there won't be any retransmissions, and the stall point will be when your end sends the closing ".".

Given that the far end seems clueless ("everybody else doesn't have a problem" my prominent behind), I'd just increase Postfix's timeout to some significantly higher figure and leave things be. The three settings you want to look into are smtp_data_done_timeout, smtp_data_xfer_timeout, and smtp_data_init_timeout (in roughly that order of importance). As you can see, the defaults are mighty generous, so the need to bump them up really does indicate how bodgy the far end is.