While internally computing all the answers, and then using a mux to select among them will work, it certainly is not a minimal design.
Consider that you can bit-slice the problem; instead of a single block of logic with two 8 bit inputs, you could partition this as two 4-bit sections, as long as you can link them to get a correct overall result. Fortunately, linking the slices is no worse than a single bit, which in the case of addition represents the carry bit. So each 4-bit slice has a carry-in bit and a carry-out bit. (Note that logicals like AND and NOR won't even need this, though if later on you implement left/right shifts, this bit is easily re-purposed).
Carried to an extreme you could use 8 slices of 1-bit each. It's useful to think about the 1-bit slices, because it makes it easier to think about an approach that scales back up to larger slices. So with a 1-bit slice, you have just 7 inputs: the 4 bit function code, a bit from input A, a bit from input B, and a carry-in bit. You also have just two outputs: function out, and carry out. So now you can write the two output functions in terms of just 7 inputs, which is within the realm of human ability to reasonably reduce. You'll end up with a handful of gates that won't necessarily always compute all the functions, but it doesn't matter what happens within the slice, only that it produces the correct result when viewed from outside.
Now you can go a couple of ways. One way is to simply use 8 of these 1-bit slices and you're done. Another way is to make larger slices and then use those. Going from 1-bit to 2-bits, the equations go from 7 inputs to 9, and 4-bits will require functions of 13 inputs. It's not necessarily easy, but will give more compact results than the compute-everything-then-mux approach. Besides, if you look at the internals of a 74181 4-bit ALU slice, you won't see a mux in there.
In some device technologies, registers are connected to a bus using three-state outputs. Such an approach does have some advantages, but it generally either requires that either there be some "dead time" between the moment one register releases the bus and the moment another register starts driving it, or else runs the risk that a device might start driving the bus before the previous device has fully released it.
In other technologies, this approach is avoided in favor of using nested multiplexers. If there are 64 registers that can output to a bus bit, the device might have eight 8-way multiplexers each of which accepts input from one register, and one more 8-way multiplexer which accepts input from one of the first eight. While this may use slightly more circuitry than the bus-based approach, it has the advantage that every signal throughout the system will be driven by exactly one device at all times.
Best Answer
Since you've built a 4-bit ALU, I'm going to assume you're interested in building a 4-bit computer. It will be almost half the work of building a 8-bit one, since the busses will be half as wide.
The following diagram is from a paper titled "A Simple and Affordable TTL Processor for the Classroom". It describes in quite a bit of detail the architecture of a 4-bit computer called CHUMP ("Cheap Homebrew Understandable Minimal Processor").
It is a Harvard architecture, meaning the program and data are in separate memories (program ROM and RAM in the diagram). It uses a single accumulator register (Accum), rather than the register arrays typically found in microcontrollers today.
PC is program counter which holds the address of the next instruction in the program ROM. The control ROM is used to decode the instructions. Each CHUMP instruction uses a 4-bit opcode (maximum of 16 instructions) and a 4-bit operand. This means it can load only one nibble of immediate data (0-F), address 16 RAM locations, perhaps 16 I/O ports, and be limited to 16 instructions. But these limitations will make it easier to build. You'll still need 64 FF's for your RAM.
You pretty much need to have all of the blocks shown above. If you are going to be making very small programs, you could use some sort of 16 x 4 plugboard instead of using EPROMs. Perhaps you could use a solderless breadboard. This one with 30 rows could store 14 instructions (7x4 on each side). Storing programs in RAM ("Von Neumann architecture") would be problematic since you would have to have a way of loading the program into RAM anyway.
Here are how the busses (all 4-bit) are wired. The output of the PC selects the address in program ROM. The input to the PC comes from the instruction operand in ROM (Sel Mux=0). This same path can be used to load an address from the instruction operand into the Addr latch, which selects a nibble from RAM. The RAM is read into the ALU from RAM with Sel Mux=1, or from the Program ROM (Sel Mux=0) as a constant literal. The Accum (accumulator) can be go to either the other side of the ALU, or be written into RAM. There is also a path to load either the PC or Addr latch from RAM, but I don't know if that's used.
Instructions are executed in several cycles, the advancing of the cycles is done using the system clock.
If the PC is not loaded from the instruction operand (jump instruction), then it is incremented after every instruction.
I suggest running your clock very slow, maybe only 1 MHz or so, so you can easily trace the operation using an oscilloscope. You probably want to add a single-step feature also, which executes one instruction per button push. Lots of blinking LEDs connected to the PC, Addr latch, and accumulator are good too.
Here is a page which has links to six other 4-bit computer pages, many with schematics. And here are two more, which weren't on the list above.