My Intel CPU changes clock speed depending on the usage, but how does it decide what clock speed to run at? Is the clock speed determined by the OS software using an algorithm, or is it hardware based? Is it dependent on the # of interrupts? The cache turnover? Does the CPU itself set its own clock? Or does a separate controller set it? Or software?
Electronic – How to a CPU dynamically change its clock frequency
clockclock-speedcomputer-architecturecpu
Related Solutions
You mix two independent (orthogonal) ideas in digital circuits theory: asynchronous circuits and multi-core processors.
Asynchronous circuits: circuits which have more than one clock, and the clocks are asynchronous (i.e. have non-constant and unpredictable phase relationship).
Some circuits may use two clocks (for example), but one is just a division by 2 of the other. These circuits are not asynchronous because there is known phase relationship between the two clocks, although the frequencies of the clocks are different.
You may have a single core CPU having few asynchronous clocks, and a multi-core CPU with all its cores running on the same clock (the latter is just an imaginary CPU - all real multi-core CPUs have many clocks which consist several mutually-asynchronous clock sets).
Asynchronous circuits is a major topic in digital design. The above explanation is basic.
Multi-core CPUs: few microprocessors (cores) connected in parallel which employ sophisticated hardware and software in order to achieve high performance.
The usual practice is to make the cores as independent as possible in terms of clocks/power/execution/etc. This allows dynamic (at run time) adjustment of CPUs activity (i.e. consumed power) to the actual needs of the system.
My impression is that what you're looking for is an explanation about multi-core CPUs, not asynchronous circuits.
This topic is much, much bigger than anything one can put in the answer.
The answers to your questions, though:
- The clocks used by different cores (to my best knowledge) have the same sources (can be more than one: crystal, VCO, ...). Each core (usually) has few mutually-asynchronous clock sets. Each core has dedicated clock gating and throttling logic which allow to turn-off or slow the clock, independently for each core. Again, if you're interested only in algorithmic aspect of cores' parallelism - forget about clocks (for now).
- You have just indicated the main aspect of cores' parallelism - how do you run multiple cores in parallel efficiently. This topic is huge, and contains both HW and SW solutions. From HW perspective, cores both modify a common memory and exchange control and status signals with sequencing logic and between themselves. The picture complicates a lot due to existence of caches - I'd suggest that you start from reading on caches, then cache coherency, and only then on cashes in multi-cores systems.
Hope this helps.
Intel's Haswell (or at least those products that incorporate the Iris Pro 5200 GPU) and IBM's POWER7 and POWER8 all include embedded DRAM, "eDRAM".
One important issue that has led eDRAM not to be common until recently is that the DRAM fabrication process is not inherently compatible with logic processes, so that extra steps must be included (which increase cost and decrease yield) when eDRAM is desired. So, there must be a compelling reason for wanting to incorporate it in order to offset this economic disadvantage. Alternatively, DRAM can be placed on a separate die that is manufactured independently of, but then integrated onto the same package as, the CPU. This provides most of the benefits of locality without the difficulties of manufacturing the two in a truly integrated way.
Another problem is that DRAM is not like SRAM in that it does not store its contents indefinitely while power is applied, and reading it also destroys the stored data, which must be written back afterwards. Hence, it has to be refreshed periodically and after every read. And, because a DRAM cell is based on a capacitor, charging or discharging it sufficiently that leakage will not corrupt its value before the next refresh takes some finite amount of time. This charging time is not required with SRAM, which is just a latch; consequently it can be clocked at the same rate as the CPU, whereas DRAM is limited to about 1 GHz while maintaining reasonable power consumption. This causes DRAM to have a higher inherent latency than SRAM, which makes it not worthwhile to use for all but the very largest caches, where the reduced miss rate will pay off. (Haswell and POWER8 are roughly contemporaneous and both incorporate up to 128MB of eDRAM, which is used as an L4 cache.)
Also, as far as latency is concerned, a large part of the difficulty is the physical distance signals must travel. Light can only travel 10 cm in the clock period of a 3 GHz CPU. Of course, signals do not travel in straight lines across the die and nor do they propagate at anything close to the speed of light due to the need for buffering and fan-out, which incur propagation delays. So, the maximum distance a memory can be away from a CPU in order to maintain 1 clock cycle of latency is a few centimetres at most, limiting the amount of memory that can be accommodated in the available area. Intel's Nehalem processor actually reduced the capacity of the L2 cache versus Penryn partly to improve its latency, which led to higher performance.* If we do not care so much about latency, then there is no reason to put the memory on-package, rather than further away where it is more convenient.
It should also be noted that the cache hit rate is very high for most workloads: well above 90% in almost all practical cases, and not uncommonly even above 99%. So, the benefit of including larger memories on-die is inherently limited to reducing the impact of this few percent of misses. Processors intended for the enterprise server market (such as POWER) typically have enormous caches and can profitably include eDRAM because it is useful to accommodate the large working sets of many enterprise workloads. Haswell has it to support the GPU, because textures are large and cannot be accommodated in cache. These are the use cases for eDRAM today, not typical desktop or HPC workloads, which are very well served by the typical cache hierarchies.
To address some issues raised in comments:
These eDRAM caches cannot be used in place of main memory because they are designed as L4 victim caches. This means that they are volatile and effectively content-addressable, so that data stored in them is not treated as residing in any specific location, and may be discarded at any time. These properties are difficult to reconcile with the requirement of RAM to be direct-mapped and persistent, but to change them would make the caches useless for their intended purpose. It is of course possible to embed memories of a more conventional design, as it is done in microcontrollers, but this is not justifiable for systems with large memories since low latency is not as beneficial in main memory as it is in a cache, so enlarging or adding a cache is a more worthwhile proposition.
As to the possibility of very large caches with capacity on the order of gigabytes, a cache is only required to be at most the size of the working set for the application. HPC applications may deal with terabyte datasets, but they have good temporal and spatial locality, and so their working sets typically are not very large. Applications with large working sets are e.g. databases and ERP software, but there is only a limited market for processors optimized for this sort of workload. Unless the software truly needs it, adding more cache provides very rapidly diminishing returns. Recently we have seen processors gain prefetch instructions, so caches are able to be used more efficiently: one can use these instructions to avoid misses caused by the unpredictability of memory access patterns, rather than the absolute size of the working set, which in most cases is still relatively small.
*The improvement in latency was not due only to the smaller physical size of the cache, but also because the associativity was reduced. There were significant changes to the entire cache hierarchy in Nehalem for several different reasons, not all of which were focused on improving performance. So, while this suffices as an example, it is not a complete account.
Best Answer
The core clock of the CPU isn't received directly from the motherboard. That clock is usually much slower (often by a factor of 10 or more) than the internal frequency of the CPU. Instead, the clock signal from the motherboard is used as the reference frequency for a higher frequency phase locked loop controlled oscillator inside the CPU. The generated clock runs at some multiple of the reference clock, and that multiple can be changed by setting certain registers in the CPU. The actual generation of the clock is done purely in hardware.
To reduce power even further, the CPU also signals to the voltage regulator supplying its core voltage to run at a lower set point. At lower frequencies the CPU can run at a lower voltage without malfunctioning, and because power consumption is proportional to the square of the voltage, even a small reduction in voltage can save a large amount of power.
The voltage and frequency scaling is done by hardware, but the decision to run in a low power mode is made by software (the OS). How the OS determines the optimal mode to run in is a separate, messier, problem, but it likely comes down to mostly what %time has the system been idle lately. Mostly idle, lower the frequency. Mostly busy, raise the frequency. Once the OS decides the frequency to run at, it's just a matter of setting a register.
Reference: "Enhanced Intel SpeedStep Technology for the Intel Pentium M Processor"