Am I right in saying that an Ethernet frame MTU is 1526 while the MTU
at the IP layer is 1500?
The Ethernet MTU is 1500 bytes, meaning the largest IP packet (or some other payload) an Ethernet frame can contain is 1500 bytes. Adding 26 bytes for the Ethernet header results in a maximum frame (not the same as MTU) of 1526 bytes.
Does the MTU change at each phase of encapsulation, or is the term
"MTU" only meant to define the maximum size of a packet at layer 3?
The MTU is often considered a property of a network link, and will generally refer to the layer 2 MTU. The limits at layer 3 are far higher (see below) and cause no issues.
The length of an IP packet (layer 3) is limited by the maximum value of the 16 bit Total Length field in the IP header. For IPv4, this results in a maximum payload size of 65515 (= 2^16 - 1 - 20 bytes header). Because IPv6 has a 40 byte header, it allows for payloads up to 65495. And IIRC using the Jumbo Payload header extension, IPv6 could allow packets up to 4 GB...
When setting up a TCP connection, a Maximum Segment Size (MSS) is agreed upon. This could be considered an MTU at layer 4, but it is not fixed. It is often set to the largest payload that can be sent in a TCP segment without causing fragmentation, thus reflecting the lowest layer 2 MTU on the path. With an ethernet MTU of 1500, this MSS would be 1460 after subtracting 20 bytes for the IPv4 and TCP header.
Responding to individual concerns in the post...
Regarding Path MTU Discovery
Ideally i would be relying on Path MTU discovery. But since the ethernet packets being generated are too large for any other machine to receive, there is no opportunity for IP Packet too big fragmentation messages to be returned
Based on your diagram, I agree that PMTUD cannot function between two different PCs in the same LAN segment; PCs do not generate ICMP Error messages required by PMTUD.
Jumbo frames
Some vendors (such as Cisco) have switch models which support ethernet payloads larger than 1500 bytes. Officially IEEE does not endorse this configuration, but the industry has valid needs to judiciously deviate from the original 1500 byte MTU. I have storage LAN / backup networks which leverage jumbo frame for good reason; however, I made sure that all MTUs matched inside the same vlan when I deployed jumbo frames.
Mismatched MTUs within a broadcast domain
The bottom line is that you should never have mismatched ethernet MTUs inside the same ethernet broadcast domain; if you do, it's a bug or configuration error. Regardless of bug or error, you have to solve these problems, sometimes manually.
All that discussion leads to the next question...
Why is there a spec that intentionally creates invalid ethernet frames?
I'm not sure that I agree... I don't see how the IEEE 802.3 series, or RFC 894 create invalid frames. Host implementations or host misconfigurations create invalid frames. To understand whether your implementation is following the spec, we need a lot more evidence...
This diagram is at least prima facie evidence that your MTUs are mismatched inside a broadcast domain...
+------------------+ +----------------+ +------------------+
| Realtek PCIe GBe | | NetGear 10/100 | | Realtek 10/100 |
| (on-board) | | Switch | | (on-board) |
| | +----------------+ | |
| Windows 7 | ^ ^ | |
| | | | | |
| 192.168.1.98/24 |-----------+ +------------| 192.168.1.10/24 |
| MTU = 1504 bytes | | MTU = 1500 bytes |
+------------------+ +------------------+
How should an 802.3-compliant implementation respond to MTU mismatches?
What was it they [the writers of 'the spec'] expected people to do with devices that generate these too large packets?
MTU 1504 and MTU 1500 within the same broadcast domain is simply a misconfiguration; it should never be expected to work any more than mismatched IP netmasks, or mismatched IP subnets can be expected to work. Your company will have to knuckle-down and fix the root-cause of the MTU mismatches... at this time it's hard to say whether the root cause is user error, an implementation bug, or some combination of the above.
If the affected Windows machines are successfully logging into to an Active Directory Domain, one could write Windows login scripts to automatically fix MTU issues based on some well-constructed tests inside the domain login scripts (assuming the Domain Controller isn't part of the MTU issues).
If the machines are not logging into a domain, manual labor is another option.
Other possibilities to contain the damage
Use a layer3 switchNote 1 to build a custom vlan for anything that has broken MTUs and set the layer3 switch's ethernet MTU to match the broken machines; this relies on PMTUD to resolve MTU issues at the IP layer. Layer3 switches generate the ICMP errors required by PMTUD.
This option works best if you can re-address the broken machines with DHCP; and you can identify the broken machines by mac-address.
... why did they bump it up to 1504 bytes, and create invalid packets, in the first place?
Hard to say at this point
802.1ad vs 802.1q
How is IEEE 802.1ad (aka VLAN Tagging, QinQ) valid, when the packets are too large?
I haven't seen evidence so far that you're using QinQ; from the limited evidence I have seen so far, you're using simple 802.1q encapsulation, which should work correctly in Windows, assuming the NIC driver supports 802.1q encap.
End Notes:
Note 1Any layer 3 switch should do... Cisco, Juniper, and Brocades all could perform this kind of function.
Best Answer
The entire frame has to be at least 64 bytes. This is not just the payload, this includes the headers and the frame check sequence. The FCS takes up 4 bytes at the end. An Ethernet header consists of two 6 byte MAC addresses plus a 2 byte type field, 14 bytes in total. 64-4-14 = 46. IPv4 packets have an additional header of at least 20 bytes on top of the Ethernet header, making the minimum payload size 26 bytes. TCP and UDP add more headers on top of that.
Another thing to note is that the size of a minimum length frame on the wire is actually larger than 64 bytes - there is an 8 byte preamble/start of frame delimiter and a 12 byte interframe gap that get attached to every packet, making a 64 byte packet take up 64+8+12 = 84 bytes on the wire.
The 41 byte answer on the other question is only considering TCP and IP headers. If you send a TCP packet with 0 data bytes, it will have 40 bytes of headers; it's not possible to make a valid TCP packet smaller than this. But if you try to send this packet, it will get zero padded out to 46 bytes before the Ethernet FCS is attached.
The reason this was originally done with Ethernet was to ensure a minimum frame length on the wire so that collisions could be reliably detected by all devices over the specified maximum cable length. This is required because early incarnations of 10M Ethernet used a shared coaxial medium and connected devices had to be able to detect when two of them tried to transmit on the shared medium at the same time. Slightly less ancient 10M and 100M Ethernet networks over twisted pair that were built with hubs instead of switches also needed to be able to detect collisions. However, most modern Ethernet networks are switched and do not use a shared medium, so this is no longer strictly necessary, but it's still part of the spec for compatibility reasons. Frames shorter than 64 bytes are called runt frames, and if you see runt frames in a network that usually indicates some sort of configuration or hardware issue.