Responding to individual concerns in the post...
Regarding Path MTU Discovery
Ideally i would be relying on Path MTU discovery. But since the ethernet packets being generated are too large for any other machine to receive, there is no opportunity for IP Packet too big fragmentation messages to be returned
Based on your diagram, I agree that PMTUD cannot function between two different PCs in the same LAN segment; PCs do not generate ICMP Error messages required by PMTUD.
Jumbo frames
Some vendors (such as Cisco) have switch models which support ethernet payloads larger than 1500 bytes. Officially IEEE does not endorse this configuration, but the industry has valid needs to judiciously deviate from the original 1500 byte MTU. I have storage LAN / backup networks which leverage jumbo frame for good reason; however, I made sure that all MTUs matched inside the same vlan when I deployed jumbo frames.
Mismatched MTUs within a broadcast domain
The bottom line is that you should never have mismatched ethernet MTUs inside the same ethernet broadcast domain; if you do, it's a bug or configuration error. Regardless of bug or error, you have to solve these problems, sometimes manually.
All that discussion leads to the next question...
Why is there a spec that intentionally creates invalid ethernet frames?
I'm not sure that I agree... I don't see how the IEEE 802.3 series, or RFC 894 create invalid frames. Host implementations or host misconfigurations create invalid frames. To understand whether your implementation is following the spec, we need a lot more evidence...
This diagram is at least prima facie evidence that your MTUs are mismatched inside a broadcast domain...
+------------------+ +----------------+ +------------------+
| Realtek PCIe GBe | | NetGear 10/100 | | Realtek 10/100 |
| (on-board) | | Switch | | (on-board) |
| | +----------------+ | |
| Windows 7 | ^ ^ | |
| | | | | |
| 192.168.1.98/24 |-----------+ +------------| 192.168.1.10/24 |
| MTU = 1504 bytes | | MTU = 1500 bytes |
+------------------+ +------------------+
How should an 802.3-compliant implementation respond to MTU mismatches?
What was it they [the writers of 'the spec'] expected people to do with devices that generate these too large packets?
MTU 1504 and MTU 1500 within the same broadcast domain is simply a misconfiguration; it should never be expected to work any more than mismatched IP netmasks, or mismatched IP subnets can be expected to work. Your company will have to knuckle-down and fix the root-cause of the MTU mismatches... at this time it's hard to say whether the root cause is user error, an implementation bug, or some combination of the above.
If the affected Windows machines are successfully logging into to an Active Directory Domain, one could write Windows login scripts to automatically fix MTU issues based on some well-constructed tests inside the domain login scripts (assuming the Domain Controller isn't part of the MTU issues).
If the machines are not logging into a domain, manual labor is another option.
Other possibilities to contain the damage
Use a layer3 switchNote 1 to build a custom vlan for anything that has broken MTUs and set the layer3 switch's ethernet MTU to match the broken machines; this relies on PMTUD to resolve MTU issues at the IP layer. Layer3 switches generate the ICMP errors required by PMTUD.
This option works best if you can re-address the broken machines with DHCP; and you can identify the broken machines by mac-address.
... why did they bump it up to 1504 bytes, and create invalid packets, in the first place?
Hard to say at this point
802.1ad vs 802.1q
How is IEEE 802.1ad (aka VLAN Tagging, QinQ) valid, when the packets are too large?
I haven't seen evidence so far that you're using QinQ; from the limited evidence I have seen so far, you're using simple 802.1q encapsulation, which should work correctly in Windows, assuming the NIC driver supports 802.1q encap.
End Notes:
Note 1Any layer 3 switch should do... Cisco, Juniper, and Brocades all could perform this kind of function.
*DISCLAIMER: Don't run debugs on equipment doing even remotely useful stuff unless you have to *
Also, you've specified console cable which is cool, because debugs normally go to a console session. But if you connect via SSH you won't see debugs until you type
Router1#terminal monitor
As YLearn notes, by default this will only show packets addressed to the router you're debugging. To show transit packets as well, you'll need to run the following on the interfaces you expect the packets to pass through.
R1(config-if)#no ip route-cache
This will switch the packets between the interfaces in software rather than using fast switching / CEF. As such, you should only do it in a test environment because it slows the process of sending and receiving data.
Using image c2600-adventerprisek9-mz.124-15.T14.bin provides some functions that may help, and Cisco's documentation for ping in version 12.1 (first I got in Google) suggests it's been around in general for some time.
Assuming you want to monitor Router1 for pings coming from Router2 and they are at 10.0.0.1 and 10.0.0.2 respectively, you could run
Router1#debug ip icmp
on Router 1 and whenever you send pings over from Router 2, you'll see something like
*Mar 1 00:02:30.530: ICMP: echo reply sent, src 10.0.0.1, dst 10.0.0.2
*Mar 1 00:02:30.622: ICMP: echo reply sent, src 10.0.0.1, dst 10.0.0.2
*Mar 1 00:02:30.674: ICMP: echo reply sent, src 10.0.0.1, dst 10.0.0.2
which simply shows that Router1 replied (so obviously received the pings).
Type
Router1#undebug all
to switch off this particular debug.
If you go with
Router1#debug ip packet
and send some pings over from Router2, you'll see more detail:
*Mar 1 00:15:42.961: IP: tableid=0, s=10.0.0.2 (FastEthernet0/0), d=10.0.0.1 (FastEthernet0/0), routed via RIB
*Mar 1 00:15:42.961: IP: s=10.0.0.2 (FastEthernet0/0), d=10.0.0.1 (FastEthernet0/0), len 100, rcvd 3
Which tells you the source address and interface and the destination address and interface.
Finally, if you go with
Router1#debug ip packet detail
Then each ping will show this:
*Mar 1 00:19:15.069: IP: tableid=0, s=10.0.0.2 (FastEthernet0/0), d=10.0.0.1 (FastEthernet0/0), routed via RIB
*Mar 1 00:19:15.069: IP: s=10.0.0.2 (FastEthernet0/0), d=10.0.0.1 (FastEthernet0/0), len 100, rcvd 3
*Mar 1 00:19:15.073: ICMP type=8, code=0
*Mar 1 00:19:15.073: IP: tableid=0, s=10.0.0.1 (local), d=10.0.0.2 (FastEthernet0/0), routed via FIB
*Mar 1 00:19:15.073: IP: s=10.0.0.1 (local), d=10.0.0.2 (FastEthernet0/0), len 100, sending
*Mar 1 00:19:15.077: ICMP type=0, code=0
Which gives you the same details as the previous debug but also tells you the packet was 100 bytes in length, was ICMP and the type - 8 is the actual ping, 0 is the response.
Best Answer
Sending a good ten minutes of 0-interval MTU-sized DF pings with contents 0x0000 and a second test with contents 0xffff is an excellent way to apply some stress to simple transmission technologies. Lost packets -- or overly delayed packets after the first few packets -- are a clear indication that further investigation is required. It's also a good moment to check that the reported round-trip time is within reasonableness (it's very easy for a transmission provider to provision a circuit which crosses the country and back rather than crossing the city).
Ping is great for finding faults. However, ping alone isn't a great acceptance test for being sure there are no faults. The rest of this answer explains why.
As part of the ping test you should connect with each of your network elements on the path (hosts, switches, routers) and record the transmission traffic and errors counters before the start and after the end. Rising error counters of any type require further investigation. Don't ignore small rises in error counters: even a low rate of loss will devastate TCP performance.
This still isn't to say that the link is acceptable. Let's take 1000Base-LX, ethernet over single-mode fiber. It's possible that the light levels at the receiver are under the specification for that transceiver's model. But we have an above-average sample of that transceiver so all is well. But then that transceiver has a fault and we replace it with a below-average-but-within-specification sample. The link cannot restore to service even though we have fixed the fault. So as part of the acceptance testing we need to check that light levels are within specification at both ends; and we need to check that there is a viable power budget at the extremes of both transmitting and receiving transceiver's performance (to make this easy, manufacturers will give their SFPs nominal ranges where they have done the power budget calculations, such as 10Km for 1000Base-LX/LH. But for any link longer than 10Km you should do your own power budget: five minutes arithmetic can save you hundreds of dollars in allowing you to safely purchase a lower-power SFP). SFPs often have a feature "DOM" which allows you to check the receive light level from the device's command line.
More complex transmission technologies have forward error correction. So the link appears to work under high transmission error rates, but if error rates are higher or more sustained then the FEC is overwhelmed and the transmission passes rubbish. So for these links we are very interested in the counts of error corrections. Interpreting those FEC counters requires understanding the physical transmission, as we're now low enough in the "stack" that we can no longer pretend that media isn't naturally free of errors. But even in these systems a simple ping test is enough stress to give initial results.
Finally, you should be aware that PCs are a cheap but not perfect test platform. So sometimes packet drops are because of the end-systems rather than the transmission. This can be simple IP-layer issues (such as a MTU inconsistent with the subnet, always a possibility when backbone links should be running with a MTU > 9000) or host performance issues (particularly >10Gbps). The cost of "real" ethernet test platforms is extraordinarily high, because you're paying for those issues to have been fully sorted via hardware or clever software (eg, running within the NIC).