At some point in my life, I used to run the USB business for big semi company. The best result I remember was NEC SATA controller capable of pushing 320Mbps actual data throughput for mass storage, probably current sata drives are capable of this or slightly more. This was using BOT (some mass storage protocol runs on USB).
I can give a technical detailed answer but I guess you can deduce yourself. What you need to see is that, this is ecosystem play, any significant improvement would require somebody like Microsoft to change their stack, optimize etc, which is not going to happen. Interoperability is far more important than speed. Because existing stacks carefully cover the mistakes of slew of devices out there because when the USB2 spec come out probably the initial devices didn't really confirm to the spec that well since the spec was buggy, the certification system was buggy etc. etc.. If you build a home brew system using Linux or custom USB host drivers for MS and a fast device controller you can probably get close to the theoretical limits.
In terms of streaming, the ISO supposed to be very fast but controllers do not implement that very well, since 95% of the apps use Bulk transfer.
As a bonus insight, for example, if you go and build a hub IC today, if you follow the spec to the dot, you will practically sell zero chips. If you know all the bugs in the market and make sure your hub IC can tolerate to them, you can probably get in to the market. I am still amazed today, how well USB is working given number of bad software and chips out there.
My understanding of the your design is that the entire device is on a single PCB, is within a single enclosure, and is connected to the host by a single USB cable. You've integrated a hub onto the PCB to allow both the devices to communicate with the PC. The following answer will hinge on these assumptions, if it's made of several separate devices connected by disconnectable cables then that changes things.
In this case, I suggest that you simply configure the hub to enumerate as a high-power device, and share the resulting 500 mA among the whole board. Interestingly enough, TI's ganged-port sample schematic shows the devices all connected together, even when using their power management IC:
The incoming 5V power supply line (highlighted in blue, as it's one of two nets that we're interested in on this complicated schematic) is connected to a TPS2041 power management IC (a generous description, it's really just a FET that shuts down when it detects 500mA of current being passed). However, each of the inputs are shorted together, and each of the outputs are shorted together as well, and then distributed to each of the downstream ports (the net shown in red).
Basically, they're doing overcurrent protection for all of the downstream sections in a single IC. They have no way of detecting whether they have three low-power (100mA) units, a single high-power unit, or two low-power units and one 300 mA unit. All these options are acceptable based on this reference design. You wrote:
According to the USB specification, a bus-powered hub can provide only one unit per downstream port while drawing max 5 units...
but, to directly answer your question, this design from Texas Instruments (a USB group member and major implementor) shows that you only have to guarantee that the total current is less than 5 units.
To solve your problem, the rules state (taken from the excellent USB in a nutshell document):
High power bus powered functions will draw all its power from the bus and cannot draw more than one unit load until it has been configured, after which it can then drain 5 unit loads (500 mA Max) provided it asked for this in its descriptor.
If you can guarantee that your driver stage will not begin drawing current until the device has been configured (which might be as simple as a timed delay in the host controller), you can simply wire everything together. Because your entire circuit is on a single PCB and has no user-accessible downstream ports, you can probably also leave out the TPS2041 and simply design the system to not require more than 500 mA of current in any state.
Another benefit of enumerating as a high-power device is improved input voltage specifications. When you have enumerated as a low-power device, the host is only required to produce 4.40 V at the upstream port (which will be lower at your device due to the resistance of the cable). When you have enumerated as a high-power device, the specification guarantees that you'll get 4.75 V, which is more likely to be within the operating range of any 5V components you may be using.
Best Answer
USB signaling is called differential because it is differential. The state of bus is either one line (D+) is HIGH, the other (D-) is LOW. And vice versa. The receiver is connected in a differential way, and senses either positive DIFFERENCE, or negative. So it is differential.
The idea that current should be either sourced or sunk is fairly narrow. For example, the very popular LVDS signaling uses two levels on each of signal pair of wires, VH is 1.4 V, and VL is 1.0 V. Yet no one is questioning differentiality of this signaling standard.
Same in USB: for FS signaling mode, VH is 3.3 V, VL is 0 V on each individual wire. An the packets use alternative states (called J and K) to transmit information. The receiver senses either +3.3 V, or -3.3 V.
For the HS signaling the VH is 400 mV, VL is 0 mV, so the differential signal goes from +400 mV to -400 mV.
CORRECTION: In both cases the common-mode signal is half of nominal voltage swing. Section 7.1.4.2 of USB 2.0 Specifications explicitly mentions that nominal common-mode voltage for HS signaling is 200 mV.
When BOTH USB wires have additional offset, for example, due to signal shift in ground return wire (due to power supply current, which happens on bus-powered devices on long and/or skinny cables), the receiver must tolerate this within USB specified limits.