(I do not know any HDL, but I hope the following will be helpful anyway.)
One can use a 32-bit wide interface and implement atomic 64-bit loads/stores. For loads one can "cheat" by reading from the invalidated cache entry (only checking the tags on the first 32-bit load), since one knows that the two 32-bit accesses will be back-to-back and within the same cache block that is known to be a hit.
For stores, since the cache block must be in modified (or exclusive if silent updates are allowed) state to accept a store, an invalidate request (really read-for-ownership) generates a data response. Since a data response is provided and the total time of the write would typically only be two processor cycles, the data response could be delayed until the store has completed.
LDSTUB (load-and-store-unsigned-byte) and SWAP could be handled somewhat similarly to a 64-bit store by delaying the load until the cache block is in exclusive/modified state; the store part of the operation is known to be immediately after the read portion and a data response is required anyway, so the data response can be delayed slightly.
An alternative implementation of LDSTUB and SWAP could treat an invalidation between the load and the store as a miss for the load, effectively reissuing the load. However, this presents a danger of livelock. While livelock issues can be managed (e.g., various back-off techniques), the earlier mentioned implementation is probably much simpler.
Items in your hands are quicker to access than items in your pockets, which are quicker to access than items in your cupboard, which are quicker to access than items at Digikey. Each successive type of storage I have listed is larger but slower than the previous.
So, let's have the best of both worlds, let's make your hands as big as a Digikey warehouse! No, it doesn't work, because now they aren't really hands any more. They're a cannonball weighing your down.
The reason larger storage is slower to access is distance. Larger storage is further away from you on average. This is true for physical items, and for RAM.
Computer memory takes up physical space. For that reason, larger memories are physically larger, and some locations in that memory are going to be physically further away. Things that are far away take longer to access, due to whatever speed limits there are. In the case of your pockets, and Digikey, the speed limits are the speed of your arms, and the highway speed limits.
In the case of RAM, the speed limits are the propagation speed of electrical signals, propagation delay of gates and drivers, and the common use of synchronous clocks. Even if money was no object, and you could buy as much as you want of the fastest RAM technology available today, you wouldn't be able to benefit from all of it. Lay out an A4 sized sheet of L1 cache if you like, and place your CPU right in the centre. When the CPU wants to access some memory right in the corner of the memory, it'll literally take a nanosecond for the request to get there, and a nanosecond for it to get back. And that's not including all of the propagation delays through and gates and drivers. That's going to seriously slow down your 3GHz CPU.
Since synchronous logic is a lot easier to design than asynchronous logic, one 'block' of RAM will be clocked with the same clock. If want to make the whole memory an L1 cache, then you'd have to clock the whole lot with a slow clock to cope with the worst case timing of the most distant location in memory. This means that distant memory locations are now holding back local ones, which could have been clocked faster. So, the best thing to do would be to zone the memory. The closest and smallest section of the cache would use the fastest clock. The next closest and smallest section would use a slightly slower clock, etc.
And now you have L1 & L2 caches and RAM.
Which brings us to the next reason, power consumption.
The cache actually consumes a significant amount of power. Not only the memory itself, but all the logic surrounding it which handles the mapping between the cache lines and the main memory. Increasing the performance of this extra logic can result in an increase in power consumption. Now, for certain applications (mobile, embedded) you have even more incentive to keep the cache small.
See Cache Design Trade-offs for Power and Performance Optimization: A Case Study (Ching-Long Su and Alvin M. Despain, 1995).
Best Answer
By not decoupling the activity of the memory from the activity of the CPU, you're throwing away most of the benefit of having a cache.
But based on how you've described your system so far, it sounds like your analysis is correct: Increasing the line size will in effect prefetch some of the data, but this will not result in any performance gain. In fact, it could result in a slight performance loss on those occasions when the prefetched data is never actually used — the time spent fetching it was simply wasted.