Oh boy, I love these servers. Almost universally, they're doing content-based scanning of the message bodies and then sending non-200 responses to the DATA command if they don't like the contents. This is fine in principle, but when you get some under-resourced crapbox running Exchange and some overweight proprietary waste of disk space doing the scanning, the machine runs like molasses and everyone else times out.
Yes, in theory this could be an MTU problem as you've suggested, but in practice this is pretty unlikely -- if you can get a 2000 byte message through, that disproves that theory (and there aren't many messages less than 2000 bytes, including headers, out there). A tcpdump at your end can confirm this -- if it's a TCP level problem, then you'll see attempts by your end to retransmit the full-size packet; if it's a remote MTA-being-slow problem, then there won't be any retransmissions, and the stall point will be when your end sends the closing ".
".
Given that the far end seems clueless ("everybody else doesn't have a problem" my prominent behind), I'd just increase Postfix's timeout to some significantly higher figure and leave things be. The three settings you want to look into are smtp_data_done_timeout
, smtp_data_xfer_timeout
, and smtp_data_init_timeout
(in roughly that order of importance). As you can see, the defaults are mighty generous, so the need to bump them up really does indicate how bodgy the far end is.
The status value itself isn't as valuable as the data in parenthesis that directly follows it, which gives a better description of what's going on.
"Message queued for delivery"
- This means the transaction between your server and the target server has yet to transpire for that particular message, this usually means something just sent the message, and your SMTP server is acknowledging it's existence
"Message Accepted"
- This means the destiantion server acknowledges that the message has been received on it's end. (It doesn't indicate read)
"Bounced"
- This typically means that something went wrong - either the email was rejected from the target email server because the email address didn't exist, OR it could be rejected due to being on an RBL. This also means the email will NOT be delivered, nor handled anymore by the server. AKA: The message is dead in the water.
"Deferred"
- This means that something temporary has happened to cause the message to not be delivered, but the server (yours) hasn't given up and will try again later. This is also common to see when the target SMTP server uses an anti-spam technique known as 'greylisting'.
Other things, here's an example of a log line from my mail.log:
postfix/qmgr[32131]: 3858792A80: from=<foo@domain.com>, size=757, nrcpt=1 (queue active)
postfix/smtp[32135]: 3858792A80: to=<foo@gmail.com>, relay=gmail-smtp-in.l.google.com[74.125.91.27]:25], delay=8, delays=8/0.01/0.4/1.5, dsn=2.0.0, status=sent (250 2.0.0 OK
1307169606 6si4629303qcd.120)
relay=gmail-smtp-in.l.google.com[74.125.91.27]:25]
= Target SMTP server for the 'to' email address
delays=0.08/0.01/0.4/1.5 =
- 0.08s = time from message arrival to last active queue entry
- 0.01s = time from last active queue entry to connection setup
- 0.4s = time to negotiate connection (EHLO, etc)
- 1.5s = time spent transferring entire message
A good way to learn is to simply tail your mail log and send emails in various ways - watch what happens when you send to bad accounts; or to a server that uses greylisting.
block the outbound port and send one.
Best Answer
I'll second the above suggestions for NAGIOS. Lke vIM, it has a bit of a learning curve, but it's entirely worth it when you get over the hump.
You'll excuse my lack of knowledge regarding Munin, but the concept of a metric and a website state can't be that hard. Surely there's a data timeout/data read plugin, and you can set a reasonable expectation for when you should get a response for the webserver, and if it's longer than somewhere in the range of 5-10 seconds, you send an e-mail.
This can be done rudimentarily done for mail servers too. You should expect something to be returned on the POP3/IMAP and SMTP ports within a reasonable amount of time. And if you don't have any content then you should send an e-mail.
Aside from all of that, I really really suggest NAGIOS if Munin proves to be too difficult in implementing this. The nagios-plugins package ships a check_http and check_(imap|pop|smtp) script. All you have to do is hook the command line arguments (time thresholds, predominately), and you're done.