Unidirectional Packet Loss

metro-ethernetpacket-loss

Recently after upgrading several MetroE circuits (L2 connectivity) from 100Mbps to 1Gbps, I noticed that large file transfers fail between some sites; however, the transfer only fails in direction. For instance, consider the following example.

From -> To

A -> B = Fail

B -> A = Success

A -> C = Success

C -> A = Success

B -> C = Success

C -> B = Success

Each site is a routed segment behind a L3 switch located at the site. The L3 switch connects to the provider's CPE media converter, which in turn connects to the provider's network via fiber. Static routing is used between L3 switches.

            *Site A*                      *Site B*
    L3 Switch <-> CPE <--- Provider ---> CPE <-> L3 Switch
                               |
                              CPE
                               |
                           L3 Switch
                            *Site C*

The provider performed end-to-end testing of the circuits from the CPEs and reported no loss. However, I see many duplicate ACKs in a packet capture on the hosts before the transfer fails.

If I remove the L3 switches from the equation, and connect two hosts directly to the CPE device at each site, the file transfer completes successfully.

    Host A <-> CPE <--- Provider ---> CPE <-> Host B

If I place hosts on either side of a L3 switch, interVLAN routing works without a hitch and the file transfer completes successfully.

    Host A1 <-> L3 Switch <-> Host A2

The issue only seems to occur when data traverses the provider between two routed segments.

    Host A <-> L3 Switch <-> CPE <--- Provider ---> CPE <-> L3 Switch <-> Host B

I have verified a number of things – interface statistics are clean (no errors), cpu and memory utilization low, speed and duplex match (client and CPE), MAC and ARP tables correct, etc.

What could be the issue?

Update 1

Packet captures from hosts A and B can found at the following URL:

https://www.dropbox.com/sh/5m2yohgxieelo59/AADed-0EWOkdmFIe0qT45_uQa

The issue originally occurred using Juniper EX3200 switches running 12.3R6.6. I subsequently downgraded the switches to 11.4R6.6, but this did not resolve the issue.

I was able to replicate the issue using Juiper EX2200 switches running 12.3R6.6 and 11.4R6.6. I was also able able to replicate issue using Dell 6224 switches running 3.3.11.2.

Currently, only the CPE (ge-0/0/0) and a single host (ge-0/0/1) is connected to a Juniper EX3200 at each site. While troubleshooting the issue, I stripped the configuration of any extraneous parameters, so the configuration is fairly basic. The configuration is essentially the same at each, but with different IP addresses. Below is a snippet.

    # show interfaces
    ge-0/0/0 {
        unit 0 {
            family ethernet-switching {
                port-mode access;
                vlan {
                    members WAN;
                }
            }
        }
    }
    ge-0/0/1 {
        unit 0 {
            family ethernet-switching {
                port-mode access;
                vlan {
                    members LAN;
                }
            }
        }
    }
    vlan {
        unit 10 {
            description WAN;
            family inet {
                address 192.168.X.X/27;
            }
        }
        unit 100 {
            description LAN;
            family inet {
                targeted-broadcast;
                address 172.X.X.1/22;
            }
        }
    }

    # show vlans
    WAN {
        vlan-id 10;
        l3-interface vlan.10;
    }
    LAN {
        vlan-id 100;
        l3-interface vlan.100;
    }

Update 2

Today I noticed that if I scp a file from the L3 switch, Juniper EX3200, at site A to L3 switch, Juniper EX3200, at site B, the scp transfer is also affected by the issue.

I find this especially interesting since the transfer is originating from the CPE facing interface on the WAN VLAN, because if I trunk a VLAN between the affected sites through the EX3200 switches, switched file transfers complete successfully between hosts at site A and B.

Best Answer

On the Firewall if you are using an SRX, check what your security flow sessions are set to and if it is reaching the limit.

#show security flow session summary