For example, a store of a register into memory, followed by a load of some other piece of memory into the same register, represent different values and need not go into the same physical register. Furthermore, if these different instructions are mapped to different physical registers they can be executed in parallel, which is the whole point of OOO execution.
So, the processor must keep a mapping of the instructions in flight at any moment and the physical registers they use. This process is called register renaming. As an added bonus, it becomes possible to work with a potentially larger set of real registers in an attempt to extract even more parallelism out of the code. All of this dependency analysis, register renaming and OOO execution adds a lot of complex logic to the processor, making it harder to design, larger in terms of chip area, and more power-hungry.
The extra logic is particularly power-hungry because those transistors are always working, unlike the functional units which spend at least some of their time idle possibly even powered down. On the other hand, out-of-order execution offers the advantage that software need not be recompiled to get at least some of the benefits of the new processor's design, though typically not all. Another approach to the whole problem is to have the compiler optimize the code by rearranging the instructions.
This is called static , or compile-time, instruction scheduling. The rearranged instruction stream can then be fed to a processor with simpler in-order multiple-issue logic, relying on the compiler to "spoon feed" the processor with the best instruction stream. Avoiding the need for complex OOO logic should make the processor quite a lot easier to design, less power-hungry and smaller, which means more cores, or extra cache, could be placed onto the same amount of chip area more on this later. The compiler approach also has some other advantages over OOO hardware — it can see further down the program than the hardware, and it can speculate down multiple paths rather than just one, which is a big issue if branches are unpredictable.
Passar bra ihop
On the other hand, a compiler can't be expected to be psychic, so it can't necessarily get everything perfect all the time. Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a cache miss. A question that must be asked is whether the costly out-of-order logic is really warranted, or whether compilers can do the task of instruction scheduling well enough without it. This is historically called the brainiac vs speed-demon debate. Brainiac designs are at the smart-machine end of the spectrum, with lots of OOO hardware trying to squeeze every last drop of instruction-level parallelism out of the code, even if it costs millions of logic transistors and years of design effort to do it.
In contrast, speed-demon designs are simpler and smaller, relying on a smart compiler and willing to sacrifice a little bit of instruction-level parallelism for the other benefits that simplicity brings. Historically, the speed-demon designs tended to run at higher clock speeds, precisely because they were simpler, hence the "speed-demon" name, but today that's no longer the case because clock speed is limited mainly by power and thermal issues.
Clearly, OOO hardware should make it possible for more instruction-level parallelism to be extracted, because things will be known at runtime that cannot be predicted in advance — cache misses, in particular. On the other hand, a simpler in-order design will be smaller and use less power, which means you can place more small in-order cores onto the same chip as fewer, larger out-of-order cores.
Which would you rather have: 4 powerful brainiac cores, or 8 simpler in-order cores? Exactly which is the more important factor is currently open to hot debate. In general, it seems both the benefits and the costs of OOO execution have been somewhat overstated in the past. In terms of cost, appropriate pipelining of the dispatch and register-renaming logic allowed OOO processors to achieve clock speeds competitive with simpler designs by the late s, and clever engineering has reduced the power overhead of OOO execution considerably in recent years, leaving mainly the chip area cost.
This is a testament to some outstanding engineering by processor architects. Out-of-order execution has also been unable to deliver the degree of "recompile independence" originally hoped for, with recompilation still producing large speedups even on aggressive OOO processors.
- Table of Contents!
- The Mystery of Marriage 20th Anniversary Edition: Meditations on the Miracle;
- Self-Organizing Maps.
- Disease Ecology: Community Structure and Pathogen Dynamics.
When it comes to the brainiac debate, many vendors have gone down one path then changed their mind and switched to the other side DEC, for example, went primarily speed-demon with the first two generations of Alpha, then changed to brainiac for the third generation. MIPS did similarly. Sun, on the other hand, went brainiac with their first superscalar SPARC, then switched to speed-demon for more recent designs.
ARM processors, in contrast, have shown a consistent move towards more brainiac designs, coming up from the low-power, low-performance embedded world as they have, but still remaining mobile-centric and thus unable to push the clock speed too high. Intel has been the most interesting of all to watch. Modern x86 processors have no choice but to be at least somewhat brainiac due to limitations of the x86 architecture more on this soon , and the Pentium Pro embraced that sentiment wholeheartedly.
Intel changed their focus to clock speed at all cost, and made the Pentium 4 about as speed-demon as possible for a decoupled x86 microarchitecture, sacrificing some ILP and using a deep stage pipeline to pass 2 and then 3 GHz, and with a later revision featuring a staggering stage pipeline, reach as high as 3.
At the same time, with IA Itanium not shown above , Intel again bet solidly on the smart-compiler approach, with a simple design relying totally on static, compile-time scheduling. The Pentium 4's severe power and heat issues demonstrated there are limits to clock speed. It gets even worse, because in addition to the normal switching power, there is also a small amount of leakage power, since even when a transistor is off, the current flowing through it isn't completely reduced to zero. And just like the good, useful current, this leakage current also goes up as the voltage is increased.
If that wasn't bad enough, leakage generally goes up as the temperature increases as well, due to the increased movement of the hotter, more energetic electrons within the silicon.
Up to a point this increase in power is okay, but at a certain point, currently somewhere around watts, the power and heat problems become unmanageable, because it's simply not possible to provide that much power and cooling to a silicon chip in any practical fashion, even if the circuits could, in fact, operate at higher clock speeds. This is called the power wall. Processors which focused too much on clock speed, such as the Pentium 4, IBM's POWER6 and most recently AMD's Bulldozer, quickly hit the power wall and found themselves unable to push the clock speed as high as originally hoped, resulting in them being beaten by slower-clocked but smarter processors which exploited more instruction-level parallelism.
Thus, going purely for clock speed is not the best strategy. And of course, this is even more true for portable, mobile devices, such as laptops, tablets and phones, where the power wall hits much sooner, around 50W for "pro" laptops, 15W for ultralight laptops, 10W for tablets and less than 5W for phones, due to the constraints of battery capacity and limited, often fanless cooling. So, if going primarily for clock speed is a problem, is going purely brainiac the right approach then? Sadly, no. Pursuing more and more ILP also has definite limits, because unfortunately, normal programs just don't have a lot of fine-grained parallelism in them, due to a combination of load latencies, cache misses, branches and dependencies between instructions.
This limit of available instruction-level parallelism is called the ILP wall. Processors which focused too much on ILP, such as the early POWER processors, SuperSPARC and the MIPS R, soon found their ability to extract additional instruction-level parallelism was only modest, while the additional complexity seriously hindered their ability to reach fast clock speeds, resulting in those processors being beaten by dumber but higher-clocked processors which weren't so focused on ILP. A 4-issue superscalar processor wants 4 independent instructions to be available, with all their dependencies and latencies met, at every cycle.
In reality this is virtually never possible, especially with load latencies of 3 or 4 cycles.
Loop pipelining in hardware-software partitioning - Semantic Scholar
Currently, real-world instruction-level parallelism for mainstream, single-threaded applications is limited to about instructions per cycle at best. In fact, the average ILP of a modern processor running the SPECint benchmarks is less than 2 instructions per cycle, and the SPEC benchmarks are somewhat "easier" than most large, real-world applications. Certain types of applications do exhibit more parallelism, such as scientific code, but these are generally not representative of mainstream applications.
- Bibliographic Information!
- The Wave-Particle Dualism: A Tribute to Louis de Broglie on his 90th Birthday.
- Neutrality and Theory of Law.
- Smart Cities: A Spatialised Intelligence?
- More Than Just Megahertz.
There are also some types of code, such as "pointer chasing", where even sustaining 1 instruction per cycle is extremely difficult. For those programs, the key problem is the memory system, and yet another wall, the memory wall which we'll get to later.
So where does x86 fit into all this, and how have Intel and AMD been able to remain competitive through all of these developments in spite of an architecture that's now more than 35 years old? While the original Pentium, a superscalar x86, was an amazing piece of engineering, it was clear the big problem was the complex and messy x86 instruction set. Complex addressing modes and a minimal number of registers meant few instructions could be executed in parallel due to potential dependencies.
For the x86 camp to compete with the RISC architectures, they needed to find a way to "get around" the x86 instruction set. The solution, invented independently at about the same time by engineers at both NexGen and Intel, was to dynamically decode the x86 instructions into simple, RISC-like micro-instructions, which can then be executed by a fast, RISC-style register-renaming OOO superscalar core.
For these decoupled superscalar x86 processors, register renaming is absolutely critical due to the meager 8 registers of the x86 architecture in bit mode bit mode added an additional 8 registers. This differs strongly from the RISC architectures, where providing more registers via renaming only has a modest effect. NexGen's Nx and Intel's Pentium Pro also known as the P6 were the first processors to adopt a decoupled x86 microarchitecture design, and today all modern x86 processors use this technique.
Of course, this width-labelling conundrum is largely academic, since no processor is likely to actually sustain such high levels of ILP when running real-world code anyway. One of the most interesting members of the RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine.
This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register-renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OOO logic either. The software-based x86 translation did reduce the system's performance compared to hardware translation which occurs as additional pipeline stages and thus is almost free in performance terms , but the result was a very lean chip which ran fast and cool and used very little power.
A MHz Crusoe processor could match a then-current MHz Pentium III running in its low-power mode MHz clock speed while using only a fraction of the power and generating only a fraction of the heat. This made it ideal for laptops and handheld computers, where battery life is crucial. Today, of course, x86 processor variants designed specifically for low power use, such as the Pentium M and its Core descendants, have made the Transmeta-style software-based approach unnecessary, although a very similar approach is currently being used in NVIDIA's Denver ARM processors, again in the quest for high performance at very low power.
As already mentioned, the approach of exploiting instruction-level parallelism through superscalar execution is seriously weakened by the fact that most normal programs just don't have a lot of fine-grained parallelism in them. Because of this, even the most aggressively brainiac OOO superscalar processor, coupled with a smart and aggressive compiler to spoon feed it, will still almost never exceed an average of about instructions per cycle when running most mainstream, real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions.
Issuing many instructions in the same cycle only ever happens for short bursts of a few cycles at most, separated by many cycles of executing low-ILP code, so peak performance is not even close to being achieved. If additional independent instructions aren't available within the program being executed, there is another potential source of independent instructions — other running programs, or other threads within the same program.
Simultaneous multi-threading SMT is a processor design technique which exploits exactly this type of thread-level parallelism. Once again, the idea is to fill those empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code which are hard to come by , the instructions come from multiple threads running at the same time, all on the one processor core. So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system.
Of course, a true multi-processor system also executes multiple threads simultaneously — but only one in each processor. This is also true for multi-core processors, which place two or more processor cores onto a single chip, but are otherwise no different from traditional multi-processor systems. In contrast, an SMT processor uses just one physical processor core to present two or more logical processors to the system.
Join Kobo & start eReading today
This makes SMT much more efficient than a multi-core processor in terms of chip space, fabrication cost, power usage and heat dissipation. And of course there's nothing preventing a multi-core implementation where each core is an SMT design. From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread — things like the program counter, the architecturally-visible registers but not the rename registers , the memory mappings held in the TLB, and so on.
Luckily, these parts only constitute a tiny fraction of the overall processor's hardware. The really large and complex parts, such as the decoders and dispatch logic, the functional units, and the caches, are all shared between the threads. Of course, the processor must also keep track of which instructions and which rename registers belong to which threads at any given point in time, but it turns out this only adds a small amount to the complexity of the core logic. This is really great! Now that we can fill those bubbles by running multiple threads, we can justify adding more functional units than would normally be viable in a single-threaded processor, and really go to town with multiple instruction issue.