How to Diagnose Strange Network Failure

file-transfernetworkingtcp

This is a strange one. I have 2 remote networks that I transfer files between over the internet. Yesterday, a regular backup job failed, so I started looking into it. I have been transferring files between these two networks without issues for several months.

Several hours of debugging later, I have arrived at this diagram:

Network Diagram

Basically, I can't transfer any large file (>50MB-ish) from Network A to any device behind both routers on Network B. It doesn't matter if I initiate the transfer from Network A or from Network B. It will connect and start transferring and then after several seconds (it seems to vary from 5-60 seconds) the transfer fails.

I can transfer from Network A to other networks without issue. I can even transfer to devices on Network B that are behind only NAT Router 1 without issue. Small files work fine (most of the time). Larger files start out ok and then fail.

Errors and Logs

When I initiate an rsync transfer from Network B (sending file from A to B).

...several more identical lines (depends on how soon it fails)...
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
ssh_dispatch_run_fatal: Connection to XXX.XXX.XXX.XXX port 22: message authentication code incorrect
Sometimes ---> debug3: mux_client_read_packet: read header failed: Broken pipe

When I initiate the transfer from Network A (still sending from A to B)

...several more identical lines (depends on how soon it fails)...
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd adjust 131072
debug3: send packet: type 1
packet_write_wait: Connection to XXX.XXX.XXX.XXX port 22: Broken pipe
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)

Transfers also fail when trying to download a large file over HTTPS from Network A to B. When I run curl I get:

curl: (56) OpenSSL SSL_read: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac, errno 0

I see the same behavior with multiple files and with multiple computers on Network B that are behind both routers.

What I've Tried

  • Firmware update on Network B, NAT Router 2: no effect
  • Reboot all devices in both networks: no effect
  • Tried transfer over 2 different ISPs at Network B: no effect
  • Baseball bat to every router in sight: still deciding on this…

Update

As a minor update, I have noticed the same issue during an interactive SSH session. If I run a command that produces a lot of output on the screen, sometimes my SSH session disconnects with an invalid MAC error.

Update 2

NAT Router 2 is a Cisco RV320. As an experiment, I temporarily disabled the firewall (see screenshot below). The transfer now works, but this also kind of defeats the point of the router (it's there to create a protected inner layer of my network). Any ideas on how to proceed from here? The firewall setting is kind of opaque to me (it's just a checkbox). I'm not sure what it's actually doing under the hood.

By the way, I tried individually disabling SPI, Block WAN Request, and DoS, but none of those settings had any effect. It was only the main Firewall setting (which automatically disables the others) that did the trick.

cisco rv320 firewall settings

Update 3

I spoke with Cisco tech support and they asked me to connect the router directly to the modem as a test (bypassing NAT Router 1). In that environment, the transfer was successful. So, it's something about the combination of both routers that's causing the problem.

I enabled every available log option for the Cisco router and ran a few failed transfers, but nothing shows up in the logs. At this point, I'm not really sure on how to proceed. I might update the firmware on NAT Router 1 just for fun.

Best Answer

Just to close this out. The "solution" for me was to ditch the RV320 in favor a DrayTek Vigor 2925.

I've only tried a couple transfers since making the switch, but one was quite a large file that took almost 3 hours, and it went through without any problems. So, I'm optimistically going to say the new router has solved this.

I wish I knew exactly why the firewall on the Cisco was interfering with this traffic because, in general, I liked the router, but I don't have any more time to diagnose it. Thanks to everyone that helped. Anyone want to buy a router? :)