You know that shopping questions are not allowed here? Fortunately, I have never been one to follow the rules...
Large amounts of audio over Ethernet is not easy-- or cheap. I've been doing this professionally for the past 14 years, and I still have not gotten the price down to what I would consider cheap.
I would not recommend a DIY approach to this. Building the PCB's, writing the software, testing, etc. is difficult for this type of project. That's fine if you want to start a new career in audio over Ethernet, but this is probably too much for someone who just does this as a hobby.
For commercial products, the cheapest that I know of are the boxes by Atterotech. They follow the Cobranet protocol standard and so will inter-operate with other Cobranet devices. But while I said this is the cheapest, it is not cheap! Also, this is pro-audio gear, with pro-audio performance. Other companies that make similar products are QSC Audio, Rane, Whirlwind, Peavey, Biamp, and many others.
There is not much for modules that does both the networking interface and the ADC/DAC circuits. In a former life I designed the Cirrus Logic CM-1 and CM-2 modules which will do up to 32x32 channels of networked audio-- but they do not include the ADC's and DAC's. Connecting converters to these modules is not difficult, but might still be beyond what you want to do.
There are other modules similar to the CM-1/2 from Audinate, Lab X, and others. But I do not think that these will be any easier or cheaper for your uses.
universal protocol that is used by the likes of USB audio
Sounds like the USB Audio Class specification.
Is this conversion from output of ADC to USB data stream something which we can do, say, with a microcontroller?
Some USB microcontrollers - NXP LPC17xx for example - have example code for USB Audio Class available.
Best Answer
First, I think you use the word "merely" very lightly. ADCs are some of the most complex, challenging mixed-signal systems in use. To meet different performance targets practical ADCs use many different architectures (the most important right now are sigma-delta, SAR, and Pipelined).
ADC design is characterized by very painful tradeoffs between speed, accuracy, noise, and power dissipation. For example, to increase the SNDR of a thermal-noise limited ADC by one bit increases the power roughly 4X (because noise is proportional to sqrt(C)).
To first order (and this is VERY rough, mind) you can think of the speed-accuracy product of an ADC to be constant. So, to make an ADC very accurate it must be slow, and, conversely, the fastest ADCs (now at 40 GS/s and beyond) have very low resolutions (4 - 6 bits or so). This is due to a combination of factors such as oversampling (taking multiple samples and averaging) which reduces the speed, and the capabilities of sample-and-hold circuits to acquire signals at needed accuracy (for example, a 10-bit sample-and-hold needs to sample the input signal to an accuracy of about 0.1%. A 16-bit sample-and-hold needs to sample the input to an accuracy of about 0.001% (!). Accuracy, namely signal settling, takes time.
So, an Audio ADC is typically 16-24 bits, and has an effective sampling rate of 44kHz to 96 kHz or more. (keep in mind the ADC is sampling MUCH faster than this because of sigma-delta modulation).
A video ADC is typically 8-12 bits (sometimes 14b) and samples between 10 - 40 MHz.
An ADC in a Gb Ethernet chip would be more like 6 - 8 bits @ 125 MHz.
An ADC in a 10Gb Ethernet chip would be more like 6 - 8 bits @ 1.25 GHz.
An ADC for a DDR4 transceiver or a radar receiver may be more like 4 bits at 10 GHz.
And so on. The reason there are so many ADCs is that there are so many places in the parameter space. Do you care about noise? It will cost you. Do you care about power? It will cost you.
A general-purpose ADC is a balance of different factors that has use in a variety of applications, but can't perform at extremes of speed or accuracy.