Tilted Forum Project Discussion Community - View Single Post - What You Need To Know About The Shift to 64-Bit Computing

bendsley · 12-07-2004, 11:51 AM

64-bit Computing and x86-64

AMD's 64-bit alternative: x86-64

When AMD set out to alter the x86 ISA in order to bring it into the world of 64-bit computing, they took the opportunity to do more than just widen the GPRs. x86-64 makes a number of improvements to x86, and in this section we'll look at some of them.

Extended registers

I don't want to get into a historical discussion of the evolution of what eventually became the modern x86 ISA as Intel's hardware went from 4-bit to 8-bit to 16-bit to 32-bit. You can find such discussions elsewhere, if you're interested. I'll only point out that what we now consider to be the "x86 ISA" was first introduced in 1978 with the release of the 8086. The 8086 had four, 16-bit GPRs and four, 16-bit registers that were intended to hold memory addresses but could be used as GPRs. (The four GPRs, though, could not be used to store memory addresses in 16-bit addressing mode.)

With the release of the 386, Intel extended the x86 ISA to support 32 bits by doubling the size of original eight, 16-bit registers. In order to access the extended portion of these registers, assembly language programmers used a different set of register mnemonics.

With x86-64, AMD has done pretty much the same thing that Intel did to enable the 16-bit to 32-bit transition--they've doubled the sizes of the 8 GPRs and assigning new mnemonics to the extended registers. However, extending the existing eight GPRs isn't the only change AMD made to the x86 register model.

More registers

One of the oldest and longest-running gripes about x86 is that the programming model has only eight GPRs, eight FPRs, and eight SIMD registers. All newer RISC ISAs support many more architectural registers; the PowerPC ISA, for instance, specifies thirty-two of each type of register. Increasing the number of registers allows the processor to cache more data where the execution units can access it immediately; this translates in to a reduced number of LOADs and STOREs, which means less memory subsystem traffic, less waiting for data to load, etc. More registers also give the compiler or programmer more flexibility to schedule instructions so that dependencies are reduced and pipeline bubbles are kept to a minimum.

Modern x86 CPUs get around some of these limitations by means of a trick called register renaming. I won't describe this technique in detail, here, but it involves putting extra, "hidden," internal registers onto the die and then dynamically mapping the programmer-visible registers to these internal, machine-visible registers. The P4, for instance, has 128 of these microarchitectural rename registers, which allow it to store more data closer to the ALU and reduce dependencies. The take-home point is this: of P4's 128 GPRs, only the traditional 8 are visible to the programmer or compiler; the other 120 are visible only to the P4's internal register rename logic, so it's up to the P4's hardware to try and make the best use of them at runtime.

In spite of the benefits of register renaming, it would still be nicer to have more registers directly accessible to the programmer via the x86 ISA. This would allow a compiler or an assembly language programmer more flexibility and control to statically optimize the code. It would also allow a decrease in the number of memory access instructions (LOADs and STOREs). So in extending x86 to 64 bits, AMD has also taken the opportunity to double the number of GPRs and SIMD registers available via the x86-64 ISA.

When running in 64-bit mode, x86-64 programmers will have access to eight additional GPRs, for a total of 16 GPRs. Furthermore, there are eight new SIMD registers, added for use in SSE/SSE2 code. So the number of GPRs and SIMD registers available to x86-64 programmers will go from eight each to sixteen each.

Switching modes

Full binary compatibility with existing x86 code, both 32-bit and older 16-bit flavors, is one of x86-64's greatest strengths. x86-64 accomplishes this using a nested series of modes. The first and least interesting mode is legacy mode. When in legacy mode, the processor functions exactly like a standard x86 CPU--it runs a 32-bit OS and 32-bit code exclusively, and none of x86-64's added capabilities are turned on.

In short, the Hammer in legacy mode will look like just another Athlon.

It's in the 64-bit long mode that things start to get interesting. To run application software in long mode you need a 64-bit OS. Long mode provides two sub-modes--64-bit mode and compatibility mode--in which the OS can run either x86-64 or vanilla x86 code.

So, legacy x86 code (both 32-bit and 16-bit) runs under a 64-bit OS in compatibility mode, and x86-64 code runs under a 64-bit OS in 64-bit mode. Only code running in long mode's 64-bit sub-mode can take advantage of all the new features of x86-64. Legacy x86 code running in long mode's compatibility sub-mode, for example, cannot see the extended parts of the registers, cannot use the eight extra registers, and is limited to the first 4GB of memory.

These modes are set for each segment of code on a per-segment basis by means of two bits in the segment's code segment descriptor. The chip examines these two bits so that it knows whether to treat a particular chunk of code as 32-bit or 64-bit.

I've already discussed how only the integer and address operations are really affected by the shift to 64 bits, so it makes sense that only those instructions would be affected by the change. If all the addresses are now 64-bit, then there's no need to change anything about the address instructions apart from their default pointer size. If a LOAD in 32-bit legacy mode takes a 32-bit address pointer, then a LOAD in 64-bit mode takes a 64-bit address pointer.

Integer instructions, on the other hand, are a different matter. You don't always need to use 64-bit integers, and there's no need to take up cache space and memory bandwidth with 64-bit integers if your application needs only smaller 32- or 16-bit ones. So it's not in the programmer's best interest to have the default integer size be 64 bits. Hence the default data size for integer instructions is 32 bits, and if you want to use a larger or smaller integer then you must add an optional prefix to the instruction that overrides the default. This prefix, which AMD calls the REX prefix (presumably for "register extension"), is one byte in length. This means that 64-bit instructions are one byte longer, a fact that makes for slightly increased code sizes.

Increased code size is bad, because bigger code takes up more cache and more bandwidth. However, the effect of this prefix scheme on real-world code size will depend on the number of 64-bit integer instructions in a program's instruction mix. AMD estimates that the average increase in code size from x86 code to equivalent x86-64 code is less than 10%, mostly due to the prefixes.

It's essential to AMD's plans for x86-64 that there be no performance penalty for running in legacy or compatibility mode versus long mode. The two backwards compatibility modes don't give you the performance enhancing benefits of x86-64 (specifically, more registers), but they don't incur any added overhead, either. A legacy 32-bit program simply ignores x86-64's added features, so they don't affect it one way or the other. AMD intends for x86-64 to be a straightforward upgrade from x86, and this means best in class 32-bit performance along with 64-bit support.

Old Stuff

In addition to fattening up the x86 ISA by increasing the number and sizes of its registers, x86-64 also slims it down by kicking out some of the older and less frequently used features that have been kept thus far in the name of backward compatibility.

When AMD's engineers started looking for legacy x86 features to jettison, the first thing to go was the segmented memory model. Programs written to the x86-64 ISA will use a flat, 64-bit virtual address space. Furthermore, legacy x86 applications running in long mode's compatibility sub-mode must run in protected mode. Support for real mode and virtual-8086 mode are absent in long mode and available only in legacy mode. This isn't too much of a hassle, though, since, except for a few fairly old legacy applications, modern x86 apps use protected mode.

If you'll recall our previous discussion of the benefits of increased dynamic range, you might've noticed that, with the possible exception of security/encryption, none of the application types that currently make use of 64-bit addresses and/or integers are really consumer-level applications. Rather, they're all "back-end" in some respect. So there really is no market for 64-bit consumer apps, yet. The main reason that there is no 64-bit consumer market is quite obvious: for all practical purposes, the market for consumer application software is the x86 software market, and x86 is 32-bit. Applications that need 64 bits are built for hardware like MIPS, SPARC, Power4, Alpha, etc., and people who want to run 64-bit apps buy this specialized hardware to run them on.

You might say, then, that x86-64 faces a chicken-and-egg problem: there are no consumer-level 64-bit applications because there is no consumer-level 64-bit hardware, and there is no consumer-level 64-bit hardware because there are no consumer-level 64-bit applications. Depending on who you listen to, AMD or their detractors, either the former or latter reason takes precedence. AMD hopes that the former reason is primary, and that if they build an affordable, backward compatible 64-bit x86 processor then the applications will come. Others, however, insist that there just aren't any good practical or theoretical reasons at this point for 64 bits on the desktop, and that making the hardware available won't magically create those reasons.

In some sense, both sides are justified in their thinking. If and when AMD makes consumer-level 64-bit hardware available at an attractive price/performance point, coders may come up with new, as-yet-unconceived-of consumer applications that make use of the hardware's expanded capabilities. People have ways of finding innovative uses for better hardware; witness the completely unexpected rise of the consumer 3D graphics chipset industry, or the entire application spaces made possible by increases in DRAM and mass storage capacities. AMD is no doubt hoping that a similar thing will happen with x86-64.

The "if you built it they will come" plan isn't so far-fetched, since it's what anyone who puts out a high-performance CPU for the consumer market (read: both AMD and Intel) is already counting on. What to do with all the CPU horsepower is by now a serious marketing problem, and Intel pours millions a year into software research in order to stimulate the development of mass-market applications that require something like a P4 2.8GHz. So AMD has good reason, based on solid historical precedent, to expect that by expanding capabilities of the x86 ISA they'll be able to expand the ISA's reach to new market niches.

By some accounts, x86-64 is already making some inroads into a market segment that matters a great deal to anyone pushing a new PC technology: computer gamers. In a recent post to a Slashdot discussion of Intel's claims that consumers won't need 64 bits before the end of the decade, Epic Games's Unreal engine guru Tim Sweeny made the following comments, worth quoting in full:

Quote:

On a daily basis we're running into the Windows 2GB barrier with our next-generation content development and preprocessing tools.

If cost-effective, backwards-compatible 64-bit CPU's were available today, we'd buy them today. We need them today. It looks like we'll get them in April.

Any claim that "4GB is enough" or that address windowing extensions are a viable solution are just plain nuts. Do people really think programmers will re-adopt early 1990's bank-swapping technology?

Many of these upcoming Opteron motherboards have 16 DIMM slots; you can fill them with 8GB of RAM for $800 at today's pricewatch.com prices. This platform is going to be a godsend for anybody running serious workstation apps. It will beat other 64-bit workstation platforms (SPARC/PA-RISC/Itanium) in price/performance by a factor of 4X or more. The days of $4000 workstation and server CPU's are over, and those of $1000 CPU's are numbered.

Regarding this "far off" application compatibility, we've been running the 64-bit SuSE Linux distribution on Hammer for over 3 months. We're going to ship the 64-bit version of UT2003 at or before the consumer Athlon64 launch. And our next-generation engine won't just support 64-bit, but will basically REQUIRE it on the content-authoring side.

We tell Intel this all the time, begging and pleading for a cost-effective 64-bit desktop solution. Intel should be listening to customers and taking the leadership role on the 64-bit desktop transition, not making these ridiculous "end of the decade" statements to the press.

If the aim of this PR strategy is to protect the non-existant [sic] market for $4000 Itaniums from the soon-to-be massive market for cost-effective desktop 64-bit, it will fail very quickly.

-Tim Sweeney, Epic Games

I reproduced Tim Sweeney's Slashdot post because it points to the very real possibility that the amateur game modification community, which is a surprisingly large community that produces a huge amount of freely available content, will soon want 64-bit machines with which to create content for the next generation of games. I'm not suggesting that the mod community represents a large enough consumer market to make x86-64 a success, but I would certainly say that developments in the hard-core gaming enthusiast/early adopter community have a way of rippling out to affect the rest of the consumer market. Again, witness the rise of consumer 3D graphics, seeded by a single piece of software: GLQuake.

Even more relevant is the release of the new x86-64 port of the Counter-Strike server software. Counter-Strike (or CS, as it's commonly called) is far and away the most successful online shooter in recent memory, and the CS team claims a stunning 30% performance gain from porting it to x86-64 with no optimization. A significant portion of this gain probably comes from the benefits associated with x86-64's increased number of registers. The rest is from the Opteron's on-die DDR controller, large L2 cache and microarchitectural enhancements.

The success of the 3D graphics industry has shown that gamers can and will buy very expensive, dedicated hardware to enhance their gaming experience. If AMD can keep supplies up and prices down for x86-64 parts, then better-performing 64-bit versions of just a few popular games (or even one massively popular game) could go a long way toward driving x86-64's adoption in the marketplace.

A few final words about performance

Note that I attributed the CS performance increase to x86-64's larger number of registers, and not the increased register width. On applications that do not require the extended dynamic range afforded by larger integers (and this covers the vast majority of applications, including games), the only kind of performance increase that you can expect from a straight 64-bit port is whatever additional performance you get from having more memory available. As I said earlier, 64-bitness, by itself, doesn't really improve performance for anything but the rare 64-bit integer application. In the case of x86-64, it's the added registers and other changes that actually account for better performance on normal apps like games.

Should Apple move from 32-bit PPC to 64-bit PPC, Mac users should not expect the same kinds of ISA-related performance gains that x86 software sees when ported to x86-64. 64-bit PPC gives you larger integers and more memory, and that's about it. There are no extra registers, no cleaned up addressing scheme, etc., because the PPC ISA doesn't really need these kinds of revisions to bring it into the modern era.