Do routers ever combine multiple small frames into a jumbo frame then split them up again on the other end? I know that most of the time it wouldn't make sense to do because you would add latency by waiting for the second packet to come in, but if you had a QOS backlog it might make sense. Or are jumbo frames always end-end?
Are Normal Packets Ever Combined into Jumbo Packets?
ethernet
Related Solutions
Responding to individual concerns in the post...
Regarding Path MTU Discovery
Ideally i would be relying on Path MTU discovery. But since the ethernet packets being generated are too large for any other machine to receive, there is no opportunity for IP Packet too big fragmentation messages to be returned
Based on your diagram, I agree that PMTUD cannot function between two different PCs in the same LAN segment; PCs do not generate ICMP Error messages required by PMTUD.
Jumbo frames
Some vendors (such as Cisco) have switch models which support ethernet payloads larger than 1500 bytes. Officially IEEE does not endorse this configuration, but the industry has valid needs to judiciously deviate from the original 1500 byte MTU. I have storage LAN / backup networks which leverage jumbo frame for good reason; however, I made sure that all MTUs matched inside the same vlan when I deployed jumbo frames.
Mismatched MTUs within a broadcast domain
The bottom line is that you should never have mismatched ethernet MTUs inside the same ethernet broadcast domain; if you do, it's a bug or configuration error. Regardless of bug or error, you have to solve these problems, sometimes manually.
All that discussion leads to the next question...
Why is there a spec that intentionally creates invalid ethernet frames?
I'm not sure that I agree... I don't see how the IEEE 802.3 series, or RFC 894 create invalid frames. Host implementations or host misconfigurations create invalid frames. To understand whether your implementation is following the spec, we need a lot more evidence...
This diagram is at least prima facie evidence that your MTUs are mismatched inside a broadcast domain...
+------------------+ +----------------+ +------------------+
| Realtek PCIe GBe | | NetGear 10/100 | | Realtek 10/100 |
| (on-board) | | Switch | | (on-board) |
| | +----------------+ | |
| Windows 7 | ^ ^ | |
| | | | | |
| 192.168.1.98/24 |-----------+ +------------| 192.168.1.10/24 |
| MTU = 1504 bytes | | MTU = 1500 bytes |
+------------------+ +------------------+
How should an 802.3-compliant implementation respond to MTU mismatches?
What was it they [the writers of 'the spec'] expected people to do with devices that generate these too large packets?
MTU 1504 and MTU 1500 within the same broadcast domain is simply a misconfiguration; it should never be expected to work any more than mismatched IP netmasks, or mismatched IP subnets can be expected to work. Your company will have to knuckle-down and fix the root-cause of the MTU mismatches... at this time it's hard to say whether the root cause is user error, an implementation bug, or some combination of the above.
If the affected Windows machines are successfully logging into to an Active Directory Domain, one could write Windows login scripts to automatically fix MTU issues based on some well-constructed tests inside the domain login scripts (assuming the Domain Controller isn't part of the MTU issues).
If the machines are not logging into a domain, manual labor is another option.
Other possibilities to contain the damage
Use a layer3 switchNote 1 to build a custom vlan for anything that has broken MTUs and set the layer3 switch's ethernet MTU to match the broken machines; this relies on PMTUD to resolve MTU issues at the IP layer. Layer3 switches generate the ICMP errors required by PMTUD.
This option works best if you can re-address the broken machines with DHCP; and you can identify the broken machines by mac-address.
... why did they bump it up to 1504 bytes, and create invalid packets, in the first place?
Hard to say at this point
802.1ad vs 802.1q
How is IEEE 802.1ad (aka VLAN Tagging, QinQ) valid, when the packets are too large?
I haven't seen evidence so far that you're using QinQ; from the limited evidence I have seen so far, you're using simple 802.1q encapsulation, which should work correctly in Windows, assuming the NIC driver supports 802.1q encap.
End Notes:
Note 1Any layer 3 switch should do... Cisco, Juniper, and Brocades all could perform this kind of function.
It depends on how the interfaces are bonded.
One way to do this is that only one NIC is really active. If one of the links goes down, then the other NIC starts using the MAC address of the first NIC, or the system issues a gratuitous ARP with its MAC address to get everyone to update their ARP tables.
A close second to this method is that both NICs are used to send, but only one is used to receive.
Any other configuration requires the cooperation of the switches or the sending parties.
Note that unless the switch and the end device agree on a configuration, you could get some bad behavior. For example, the switch might not know which port actually has which MAC and will instead flood ALL traffic for that MAC. Or you could get a non-functional link.
Since you are using Adaptive Load Balancing, I will explain this mode.
Outgoing packets are split based on load.
Incoming packets are a bit trickier. When an ARP request is received, the MAC sent back is based on the requester's IP address. For example, if client A send an ARP request for your IP, it will get the MAC of NIC 1. Later when client B sends an ARP request, it will get the MAC of NIC 2. That way clients are split among the available NIC's.
Best Answer
Let me see if I am understanding your scenario correctly:
Assuming my understanding of your problem statement is correct, let's see what it would take to implement this.
Let's say the queue on the egress interface looks like this at some instant:
(1) Identification: The router would have to examine these packets, check if they are IPv4 or IPv6 (with the fragmentation extension header), then look at the fragmentation fields to identify reassembleable fragments. (Not all packets are IPv4 or IPv6, and the implementation would have to leave these packets alone.)
(2) Transmit order: It is possible that P1, P2 and P5 are fragments of one datagram, and P3, P4 and P6 are fragments of a different datagram. The implementation would therefore have to first reassemble and transmit (P1 + P2 + P5), then (P3 + P4 + P6). Normally queues are first-come-first-served, but now you'd have to "cherry-pick" fragments from across the entire queue.
Also consider what would happen if P5 is not the last fragment of the datagram; so you have to wait till the last fragment showed up in the queue, but in the meanwhile (P3 + P4 + P6) is ready to be reassembled and transmitted, so would you transmit it?
(3) Out-of-order fragments: Note also that it is possible that P2 might in fact be the first fragment, P1 the second and P5 the third. This is because these fragments may have taken different paths on their journey from A to R. Normally, end hosts deal with this situation of out-of-order fragments, but if routers start doing reassembly, this is something that they have to take care of as well.
(4) Checksum recomputation : Another thing you'd have to take care of after reassembly is checksum recomputation. Note that earlier in the routing pipeline we have already recomputed the checksum once (after decrementing the TTL), and now in the output queueing stage, we'd have to do it again after reassembly.
(5) Refragmentation: Another thing you'd have to consider is refragmentation: if in the above example P1, P2 and P3 were of sizes 4000, 4000 and 1500 bytes, and the output interface had an MTU of 9000, would you leave the three fragments alone or would you refragment into two packets of size 9000 and 500 ?
(6) Then finally you'd have to think about performance. All the above processing would have to be done at line rate, i.e. after every enqueue to the queue. For a router that supports even 10Gbps line rate performance you can calculate how fast the reassembly related processing described above has to happen.
In summary I'd say that this is possible in principle, but the practical issues are many. And the benefit does not justify the engineering cost involved in implementing this (read: if you were a buyer, how much more money would you be willing to spend on a router that can reassemble versus a router that can't?). Having said that, if some smart end-user application designer can build an application that can demonstrate superior performance (measured in terms of $$ :-)) by using routers that support reassembly, then it's a different story.