Generally speaking, a cache is a layer which abstracts the access to memory. When a piece of information is needed, it is specified by its address. All entries in the cache are tagged with the memory address of the datum that they hold. When the processor requests a datum, the cache control circuitry searches the cache for a matching address.
If the cache is fully associative than the entire address (except for the least significant bits) is matched against the entire cache. This matching is not a linear search, but an associative lookup. The cache entries somehow compare themselves to the address in parallel and one of them announces itself as a match.
If the cache is set associative then some of the address bits are used to directly select a bucket. For instance if there are 16 buckets, then four bits from the address can be taken as a bucket address 0 to 15. Then an associative lookup for the address takes place within just that bucket. This means that for any given memory address, we know which cache bucket it maps to, but not which specific cache line within that bucket.
If a cache is direct mapped then some of the address bits are used to select a single cache line, which either holds data for that address or not. So there is no associative lookup. Each address is mapped to a just a single cache line. (If a program alternately accesses two items at different addresses that map to the same cache line, the performance is bad. This is the worst/cheapest kind of cache.)
When there is a cache hit, then the item can be quickly supplied to the requesting circuit out of the cache. If there is a miss, then a memory access cycle has to be executed. The data is not only given to the requesting circuit, but also installed into the cache (replacing something else that has not recently been accessed).
Instruction caches tend to be specialized, to take advantage of the access patterns and the structure of the data. The cache may work at a higher level, combined with the instruction decoding. The requesting circuit asks not simply for an instruction opcode, but it demands a decoded instruction. The combined caching and decoding circuitry provides it. The idea is the same. Take the address and find a decoded instruction for that address. If it's not found in the cache, then it must be fetched and decoded.
So the answer to the question "how does the processor know" is that the processor is divided into logical units, and these units provide services to each other. The units which request data from memory do not have to be aware of the cache. The responsibility is put into the cache control circuitry. I.e. inside the ovearall processor there effectively a smaller processor which in fact "does not not know" that the data is in a cache.
Items in your hands are quicker to access than items in your pockets, which are quicker to access than items in your cupboard, which are quicker to access than items at Digikey. Each successive type of storage I have listed is larger but slower than the previous.
So, let's have the best of both worlds, let's make your hands as big as a Digikey warehouse! No, it doesn't work, because now they aren't really hands any more. They're a cannonball weighing your down.
The reason larger storage is slower to access is distance. Larger storage is further away from you on average. This is true for physical items, and for RAM.
Computer memory takes up physical space. For that reason, larger memories are physically larger, and some locations in that memory are going to be physically further away. Things that are far away take longer to access, due to whatever speed limits there are. In the case of your pockets, and Digikey, the speed limits are the speed of your arms, and the highway speed limits.
In the case of RAM, the speed limits are the propagation speed of electrical signals, propagation delay of gates and drivers, and the common use of synchronous clocks. Even if money was no object, and you could buy as much as you want of the fastest RAM technology available today, you wouldn't be able to benefit from all of it. Lay out an A4 sized sheet of L1 cache if you like, and place your CPU right in the centre. When the CPU wants to access some memory right in the corner of the memory, it'll literally take a nanosecond for the request to get there, and a nanosecond for it to get back. And that's not including all of the propagation delays through and gates and drivers. That's going to seriously slow down your 3GHz CPU.
Since synchronous logic is a lot easier to design than asynchronous logic, one 'block' of RAM will be clocked with the same clock. If want to make the whole memory an L1 cache, then you'd have to clock the whole lot with a slow clock to cope with the worst case timing of the most distant location in memory. This means that distant memory locations are now holding back local ones, which could have been clocked faster. So, the best thing to do would be to zone the memory. The closest and smallest section of the cache would use the fastest clock. The next closest and smallest section would use a slightly slower clock, etc.
And now you have L1 & L2 caches and RAM.
Which brings us to the next reason, power consumption.
The cache actually consumes a significant amount of power. Not only the memory itself, but all the logic surrounding it which handles the mapping between the cache lines and the main memory. Increasing the performance of this extra logic can result in an increase in power consumption. Now, for certain applications (mobile, embedded) you have even more incentive to keep the cache small.
See Cache Design Trade-offs for Power and Performance Optimization: A Case Study (Ching-Long Su and Alvin M. Despain, 1995).
Best Answer
The instruction register generally holds only the bits that constitute the "opcode" part of an instruction (including any bits that affect addressing modes) for the duration of the execution of the instruction, so that the instruction decoding logic has access to it.
Any bytes in the instruction that are only operands are not generally held in the instruction register.
If the bus width of the processor is only 8 bits, then the two bytes of the instruction are fetched in separate cycles. The second cycle is executed after the instruction decode logic has decided that there is a second byte as a result of examining the opcode in the first byte.
And no, caches are not necessary to the operation of a processor, but they help eliminate bottlenecks when the processor is faster than the main memory.