The top signal is Frame Sync (FS). FS is used to indicate whether the audio is for the left or right channels. Don't think of them as "left" and "right" though, those are just arbitrary names. Think of them as channel 0 (FS clear) and channel 1 (FS set), time-division multiplexed onto a single communications link.
The bottom signal is the serial data that is being clocked into(?) your MCU.
MCLK is not visible in that diagram. It is the clock that is used by the audio codec (in your case, a CS42436) to time and/or drive its own internal operation. It is a relatively high frequency; a common value is 256*Fs (where Fs is the sample rate, e.g. 44.1kHz). Values in the range of 10-60MHz are pretty typical.
The master clock (MCLK) should be synchronized with LRCK, but the phase is not critical.
What this means is that MCLK and LRCK should be derived from the same clock source, so there is a constant number of MCLK cycles for each sample. For example at MCLK=512Fs you get 512 MCLK cycles per sample.
"Phase is not critical" means any amount of phase delay between MCLK and the other signals doesn't matter.
If they are derived from different clock sources, say MCLK from a 22.5792 Mhz oscillator, and whatever oscillator the ESP has for BCLK/LRCK, then both clocks won't be exactly synchronized. For example one oscillator could be a few ppm faster than the other. So instead of having 512 MCLK cycles per sample, you could have 512.01 or 511.99 MCLK cycles per sample. In this case it's hard to say what the DAC will do, it could just skip or duplicate samples once in a while, which shouldn't be audible, but it could also shift in the bits wrong and output garbage, or just decide to shut down.
Possible solutions:
ESP-07 doesn't have a MCLK output, so "feed MCLK from ESP-07 to the DAC" is out.
One solution is to use a clock oscillator of the correct frequency for the DAC chip and configure the DAC in Master mode so it outputs BCK/LRCK which feed into the corresponding inputs on the ESP to synchronize it to the DAC. Presumably then, the ESP I2S output will synchronize to this BCK/LRCK, and you can just feed the ESP's I2S data output to the DAC. However you need another DAC that supports master mode. Also you need two oscillators if you want to support 44.1/48k sample rates.
Another solution is to use an asynchronous sample rate converter chip which will convert the ESP's output to the DAC's clock domain. That's an extra chip though. You could also use a DAC which doesn't need a MCLK signal but instead reconstructs one from BCLK using an internal PLL.
I'd just use an ES9023 DAC with a local 25MHz oscillator. This chip sounds very good, it is simple to use, and it has an internal sample rate converter which will accept whatever you feed it and convert it to the local clock domain. You can probably find ready-made available modules with it.
Another simple question is about the type of audio amplifier that is required to connect to the analog output... it should be class A, B, AB, H?
That depends on your speakers, how many watts you want, etc. I'd just get a used vintage stereo "hi-fi" amp from the pawn shop and stick your WiFi ESP-based device inside.
EDIT:
If you only need 0.5W, then I assume you're going to use a small loudspeaker, so you could use a tiny and cheap Class-D amplifier with I2S input which does not require MCLK...
Best Answer
Generating the other clocks is the whole point of master mode, so that is indeed what an I²S chip is likely to be doing. But if you want to be sure, read the datasheet of your codec. For example, the CS4245 one says:
The I²C bus has its own clock line, and the I²C protocol allows both master and slave to delay clock cycles, so it's impossible to demand that it be synchronous to anything else. But again, read the datasheet, which will say something like this:
Some codecs might require that a valid master clock is present for the I²C to work.