Audio is not that high bandwidth, so is within the range of what a microcontroller can handle.
The quality level you want makes a large difference in the amount of data you have to handle. If you just need to save and later replay voice, then 8 bit samples at 8 kHz is good enough. If the 8 bit values are not constrained to be linear, then you can get better overall signal to noise ratio with the same amount of data. This is what the phone company does.
At the other end is "Hi-Fi" audio, which is from 20 Hz to 20 kHz, usually at least 16 bits per sample (over 90 dB signal to noise ratio). To digitize such audio, you sample much faster than the Nyquist limit, then apply digital filtering, then dessimation. The reason you need digital filtering is that analog filtering can't be that accurate to have the very sharp drop off after 20 kHz you need in order to sample just a little faster than 40 kHz.
Let's say you do the worst case and end up with 16 bit samples at 44 kHz rate. That's only 88 kB/s, or 5.3 MB/minute. Any SD card can handle that data rate. 1 GB gives you over 3 hours of this Hi-Fi audio.
Of course if you just want the voice-quality audio, things are much easier, the data rates lower, and the storage requirements lower. At 8 kB/s just 1 MB lasts over 2 minutes. 1 GB would hold nearly 1 1/2 days of audio.
Well, maybe. It might be helpful to look at the circuitry that processes the audio data to try to figure out what it's doing. You mentioned the spec is 600 signs per second, corresponding to 6 k baud. That seems to me like they are using some method of multiplexing 10 bits into each 'sign'. This is could well be some sort of multitone modulation where each bit is represented by a different frequency. The trick would be figuring out the specifics of how the symbols are constructed and then how to re-frame that data for transmission over the serial interface. tl;dr - it may be doable, but it will require some reverse-engineering.
Edit: after opening up the file in Audacity, it actually looks like it may be some sort of NRZ code. Looks like a 3 level format of some sort.
I'm not sure what the name is for the encoding, but it seems that it's a series of positive and negative going pulses with gaps inserted between the pulses that represent the data bits. I believe these gaps represent 1s as there are several long segments of pulses with no gaps, and it's far more likely for a binary file to have a long section of 0s than it is to have a long section of 1s. It would not be very difficult to write a script to extract the data. However, I am not sure if that data will be in the correct format to transmit via the serial port.
Success! This looks like a description of a very similar format: http://www.unige.ch/medecine/nouspikel/ti99/cassette.htm#Cassette%20tape%20format
The timings are a bit different and I think the bit levels might be inverted from what they are on your tape, but it seems like a very similar format.
Best Answer
It's analogue. A high-frequency 'bias' carrier has the incoming audio signal superimposed onto it, and the result is sent to the recording head, which is just an electromagnet.
The purpose of the bias is to drive the magnetic material around and around its hysteresis curve, so as to avoid the non-linear part of the curve close to the zero-crossing.