Ethernet magnetics do vary, in particular in turns ratio between the network and circuit sides, and other parameters like center taps and baluns. The ethernet signalling voltages are well defined, but some PHYs operate at different voltages. Since ethernet is transformer coupled, that's no problem since the transformer can be built with any turns ratio. The PHY then also has to operate at the right impedance so that it comes out to the ethernet impedance of 50 Ω at the network side. The only way to know what your PHY wants is to look at its datasheet. PHY datasheets should tell you what transformer ratio they require.
POE versus not is mostly a issue of providing center taps on the network side. Transformers can also vary in whether they include a balun (common mode choke) on the network side or not. Note that this also makes the transformer so that it has a network side and a circuit side. Even with 1:1 ratio you shouldn't flip it around since the balun is pointless on the circuit side. The purpose of the balun is to keep high frequencies that might be present on the board from leaking out via the cable and therefore emitting too much RF.
The correct answer is because the ethernet specification requires it.
Although you didn't ask, others may wonder why this method of connection was chosen for that type of ethernet. Keep in mind that this applies only to the point-to-point ethernet varieties, like 10base-T and 100base-T, not to the original ethernet or to ThinLan ethernet.
The problem is that ethernet can support fairly long runs such that equipment on different ends can be powered from distant branches of the power distribution network within a building or even different buildings. This means there can be significant ground offset between ethernet nodes. This is a problem with ground-referenced communication schemes, like RS-232.
There are several ways of dealing with ground offsets in communications lines, with the two most common being opto-isolation and transformer coupling. Transformer coupling was the right choice for ethernet given the tradeoffs between the methods and what ethernet was trying to accomplish. Even the earliest version of ethernet that used transformer coupling runs at 10 Mbit/s. This means, at the very least, the overall channel has to support 10 MHz digital signals, although in practice with the encoding scheme used it actually needs twice that. Even a 10 MHz square wave has levels lasting only 50 ns. That is very fast for opto-couplers. There are light transmission means that go much much faster than that, but they are not cheap or simple at each end like the ethernet pulse transformers are.
One disadvantage of transformer coupling is that DC is lost. That's actually not that hard to deal with. You make sure all information is carried by modulation fast enough to make it thru the transformers. If you look at the ethernet signalling, you will see how this was considered.
There are nice advantages to transformers too, like very good common mode rejection. A transformer only "sees" the voltage across its windings, not the common voltage both ends of the winding are driven to simultaneously. You get a differential front end without a deliberate circuit, just basic physics.
Once transformer coupling was decided on, it was easy to specify a high isolation voltage without creating much of a burden. Making a transformer that insulates the primary and secondary by a few 100 V pretty much happens unless you try not to. Making it good to 1000 V isn't much harder or much more expensive. Given that, ethernet can be used to communicate between two nodes actively driven to significantly different voltages, not just to deal with a few volts of ground offset. For example, it is perfectly fine and within the standard to have one node riding on a power line phase with the other referenced to the neutral.
Best Answer
Yes, it needs to be high voltage, although it does little useful considering where it is connected. Ethernet is transformer-isolated, and specifies a fairly high isolation voltage. I don't remember what the spec says exactly. 2 kV is likely the max plus some margin. I know that the isolation spec is high enough to be able to connect between ground-based equipment and other equipment riding on a 250 V AC power line.
That said, the capacitor in that location doesn't make a lot of sense. About all it does is filter common mode voltage a bit. I suppose the point is to slow down common mode spikes to the point that whatever stray common to differential coupling there might be accross the transformer won't corrupt the signal.
Whenever I've put capacitors on the network side of ethernet it's been to reduce emissions coming from the board. For that purpose, it is better to use a transformer with common mode chokes built in (I often use the Pulse H2019), then put small caps (like 47 pF or less) to ground on each of the lines. The apparent differential capacitance will be half that, but the full capacitance will attenuate RF that got coupled onto the line from the board.