Sunday, September 17, 2006

K8L and Penryn

It just amazes me to see how AMD fans keep on saying, "K8L will make Conroe look obsolete," or "Intel needs someting a lot better than Conroe if it has to survive K8L onslaught." That's funny, considering these same fanboys were saying that with Conroe, Intel will not be able to close gap on K8. That is funny...

Let us just look at some of the facts. In video encoding, at the same clock speed, Conroe beats K8 by about 30%. In games, if the GPU bottlenect is removed, Conroe beats K8 by upto 50%. So, it is safe to say that Conroe, in general is better than K8 by about 30%. Now, first we have to understand that K8L will go against Penryn--Conroe's 45 nm cousin. Also, sometime back, Inquirer reported that Intel's 45 nm process is a giant leap over its 65 nm process, as far as leakage is concerned. If this is indeed true, Conroe's 45 nm sibling, Penryn, would not have any trouble clocking at 4 GhZ, while operating in the same or lower power envelope. Recent reviews of Kentsfield also show that Conroe architecture is not hurting for memory bandwidth.

So what does all this mean? If Penryn clocks at 4GHz, it will be 33% faster than X6800. Now, even though Conroe performance scales almost linearly with speed, let us give it some headroom, and assume that this additional 33% clock-speed will give it only 25% performance improvement. Considering that X6800 beats AMD's best by 30%, this means, top-end Penryn will beat AMD's best (as of today) by 62%. Now that is huge!! Now beating that is not going to be an easy task, for K8 or K8L.

AMD fanboys might say--K8L is a grounds up architecture built of performance, and it should not be too difficult for it to improve over K8 by 62%. Well, let us take a look at what K8L brings to the table.

First, improved SSE performance. Yawn!! It was time. Conroe beats AMD in pure SSE2 operations by 400%. Nothing that AMD does here is going to tilt the balance in the other direction.

Second, it adds load reordering. Another bit of a Yawn! Intel has been doing that for how many years now? Conroe also does load/store reordering. Nothing new here.

Third, it adds support for HT 3.0. Last I checked, K8 was not really hurting for memory bandwidth. So this is not expected add too much to real performance. Extremely good for bragging rights, but from end-user's perspective, another yawn.

Fourth, it adds dual-ported L1 cache. AMD might be onto something with this. But again, we do not know what it really means. In fact, personally I do not expect it to add a lot. How many workloads today are really hurting for cache bandwidth. I would say, not many. Definitely not the 32-bit ones. At an IPC of about 2, and at most two source operands per instruction, you do not need more than 16 byte reads per clock cycle. For 64-bit workloads, it *MAY* mean someting. But again, until we see the effect on real benchmarks, I wouldn't start jumping up and down. Add to that the fact that K8L won't be reordering loads and stores. That means, if there are stores in front of loads, the second cache port could be just sitting there, doing nothing. Actually not entirely true, the second cache port will be adding load to the cache, increasing power conumption. In short, this seems like a good feature, but unless there is evidence to the contrary, I wouldn't count too much on it.

Fifth: improved floating point performance. Well, AMD has always been strong in this area, and when it comes to editing word documents, unzipping files, compressing videos, and playing games, it means zilt!! Universities and NASA care about it--impling 99% of their computers will be from AMD instead of 95%. But again, it is a performance improvement tailored for a very niche market.

Finally, they add shared L3 cache. Time they did so, wouldn't you say? Whatever happened to all those claims about cache thrashing, makes you wonder...

You might say, this is a very one-sided view. If Intel managed such a huge performance improvement from P4 to Core 2, why can't AMD do the same from K8 to K8L?

For first, Intel did not improve performance going from P4 to Core 2, they improved power. You can take P4, cool it with liquid nitrogen, clock it at 6 GHz, and it will beat the crap out of any CPU on the planet. The problem is, no one wants to do that, and hence Intel has to clock it below 4 GHz. Unfortuantely, P4 was designed to be performant at 5+ GHz. Realizing this mistake, what Intel did with Core 2 is that they came up with an architecture that works great at sub-5GHz range. No matter what you do to Core 2, you cannot clock it beyond 5GHz. If you take 5 GHz Conroe and 7GHz P4, they will probably be pretty close in performance. However, since no average user clocks in that range, P4 seems far inferior to Core 2.

In short, Intel saw the huge performance improvement going from P4 to Core 2 because they changed the philosophy. I wouldn't expect the same type of performance jump when Intel goes to Nehalem, for example. And that is why, I do not expect such a huge performance jump going from K8 to K8L--afterall, there is no huge change in philosophy. K8L will definitely be better. Will there be a wow factor? Hard to tell at this point. However, considering that Intel's transition to Core 2 has already set expectations very high, I think K8L will hugely be a disappointment.

12 comments:

Anonymous said...

So how much does the HyperTransport Link's speed really effect an overclocked system's performance?

After reading your post I went to [H] and found this.

It is just more evidence about the lack of help HT 3.0 will bring K8L.

"Mad Mod" Mike said...

I'm sorry enumae, but that link is stupid. First off, it runs a hard drive benchmark -- please, 50MB/s MAX is not going to effect HyperTransport @ 100MHz, that is not a test.

Than they run a RAM test -- do you realise that doesn't even USE the HT bus?

The next is 3DMark01 -- please, this isn't 2001, 1x HT is plenty for that slow ass thing. SuperPI? Come on man, don't insult our intelligence. CPUMark? Dude, wtf?

Please, if you want to diss HyperTransport, show me better than this -- none of those tests stress HyperTransport, not to mention the RAM doesn't DOESN'T EVEN USE HYPERTRANSPORT.

HyperTransport 3.0 isn't to help in CPU to NB enumae, stop thinking it is. HyperTransport 3.0 will rape the 2P+ servers b/c of INTER-COHERENCY, that is what it is for.

Anonymous said...

MMM said...

"HyperTransport 3.0 isn't to help in CPU to NB enumae, stop thinking it is."

MMM, I will admit I do not fully understand HT.

I really need to research it so as to not make this mistake again.

core2dude had made a statement about HT and how AMD has plenty of bandwidth, so I may have misinturpreted the use of HT3.

"HyperTransport 3.0 will rape the 2P+ servers b/c of INTER-COHERENCY, that is what it is for."

If HT3 is for 2P+, how does it benefit the desktop?

And why do you and others keep stating HT3 is going to help when comparing K8L to Conroe which is on the desktop front?

That is what made me believe the article.

"Mad Mod" Mike said...

"And why do you and others keep stating HT3 is going to help when comparing K8L to Conroe which is on the desktop front?"

I've never said that HT 3.0 is going to help on Desktop CPU's with 1 processor. If I did once, it was b/c I was pissed and set shit. HT3.0 is for 4x4 & 2P+

Joshua said...

Core2Dood, first off, your blog is a bit new to new and untrusted to blow shit around. Second, C2D is Intel's cream of the crop and Kentsfield will cost SUPER DUPER EXPeNSIvE

core2dude said...

Mike, you are right, those benchmarks are plain stupid. But again, is there *ANY* proof that current HT is hurting for bandwidth or latency? For example, in TPCC, on a DP system, only 30% of cache-miss references are directed towards the remote memory. Thus I seriously doubt if suped up HT is will bring anything to even DP. For MP, it might make some difference (again, I stress on *might*, we do not know this, just speculation).

But then again, today's DP is yesterday's 4P, and tomorrow's DP will be yestereday's 8P. Essentially, all this multi-core is making MP irrelevant.

Will K8L beat Penryn? I do not have a crystalball, and I cannot answer that. No one can, not even AMD. All that they have at this point is K8L simulators that take 1 night to crunch 20 lines of assembly code--far inefficient to run any real-world benchmarks. But my guess is, on desktop it would be very difficult to beat Penryn. On 2P, again we will see someting similar to Woodcrest and K8, though it *could* be closer (considering Penryn will have higher frequency and most-likely supped-up bus. On MP--well I don't know. There is no Conroe-based MP product available today, so everything remains to be seen. All we have is Kentsfield, and it is clearly demonstrating that at least on desktop-type benchmarks, it is not hurting for bandwidth (no difference what-so-ever due to 1333 FSB vs 1066 FSB).

core2dude said...

Joshua, I have slight bias towards Intel, but I am not against logical reasoning. As you might have noticed, I do not censor comments on this blog--your comment gets published as soon as you post it. And I won't change that unless this blog starts getting spammed. You can say whatever you want--most people visiting this web-site are adults. However, I would like to keep the discussions civil, and would really love if people did not insult each other. The very fact that people are talking about next generation architectures means that they have more intelligence than 90% of the population. Just keep that in mind.

"Mad Mod" Mike said...

Alright Mr. Core 2 Duo obsessed person, let me lay down some fact-o-la's.

I put together some info that explains why HT 3.0 is so huge for Opteron servers.

http://www.rubyworks.net/images/HT3.jpg

core2dude said...

Thanks Mike for this info. It sure does look impressive. But I repeat, I would like to see what it means when it comes to real benchmarks.

I always thought current AMD 4P systems were fully connected. Why are the latencies so bad?

"Mad Mod" Mike said...

"I always thought current AMD 4P systems were fully connected. Why are the latencies so bad?"

Check out this diagram Link

You should easily see why, even without benchmarks, that LOGICALLY HT 3.0 will provide a nice increase in performance.

Unless you're so Anti-AMD that you think they would mess this up (which is wrong and stupid to think) you should be giddy as a school girl as I am over this HUGE improvement.

Anonymous said...

http://www.linuxhardware.org/article.pl?sid=06/08/22/0415251
Yes K8 'can' beat Core 2 Duo. Of course a FX62 could beat an E6300..

"Intel not only has the fastest chip in their top processor.."

"..they even take the performance lead at their !second! tier chip in six out of seven of our benchmarks. As well as being the fastest thing on the market, it also runs neck and neck with AMD in the heat generation and power consumption race."

Scientia from AMDZone said...

It may be that we have something in common. I tend to write lengthy articles. The article on 2002 was originally 6400 words before I cut it down and broke it into two parts. My blog is also about computer technology, the markets, AMD, and Intel.

The actual comparison between K8L and Conroe is a bit complex. K8L as native quad core should do quite well against Kentsfield until Intel releases a native quad design. For the dual core version, I expect K8L to be a bit faster than Conroe in terms of FP/SSE but about halfway between current K8 and Conroe in terms of Integer. K8L doubles the cache bus just as Conroe did. Conroe greatly improved the SSE performance over Yonah. K8L should do more than that. I wouldn't expect this to make C2D look obsolete although I suppose it could make Tulsa look more obsolete.

The Kentsfield tests were not very good in terms of checking for memory bandwidth. There really is no way to get around the common sense fact that each core on Kentsfield has less bandwdith than each core on Conroe. I also haven't seen any good stress tests. The limited tests performed over at Tom's Hardware Guide were worthless in terms of multitasking. You simply cannot use a benchmark that is I/O intensive to stress the second core. Even SuperPi would have been better.

I wouldn't count on there being 4.0 Ghz Penryn's in 2007. And, if indeed Penryn is just a die shrink of Kentsfield then its only gain will be in power draw. Considering too that AMD's 65nm process should be a bit better than Intel's I'm not sure this would be that much of a problem for AMD.

I would say that X6800 beats FX-62 by closer to 25% rather than 33%. I put FX-62 about equal to E6600.

Theretically, K8 can do one 128 bit SSE computation in one clock or two 64 bit SSE computations. K8L is supposed to have wider computation units that can do two 128 bit computations in one clock. However it also has a second FP pipeline so it should have about four times the performance in SSE.

They should be expanding the size of the reorder buffer on K8L. I'm not expecting this to catch conroe in terms of integer.

HT 3.0 and the Direct Connect Architecture 2.0 in 2008 are for servers. AMD works really well in 4-way but it has much more latency in 8-way because of the extra hop. Adding a 4th HT link will give the 2008 Opteron direct link capability for 8-way. They also include some flexibility in using 16 bit HT links as two 8 bit links so this increases flexibility again. The HT speed increase is necessary to keep up with the increase in processors and memory controller speed. This should mean than Opterons will be the most cost efficient systems in 4-way, 8-way, and possibly 16-way in 2008. This isn't really competing with Woodcrest; it is really to compete with Itanium.

The dual ported L1 on K8L is nothing really it mostly just doubles the bandwidth to cache which is what Intel did with Conroe over Yonah. You need this to keep the SSE pipe full. The prefetch increases to 32 bytes which will give a small boost in speed due to instruction boundaries. However, AMD won't get any great increase until it does something to the instrucion decoders.

I'm not sure why you mention cache thrashing. Cache thrashing is why THG avoids doing real stress tests in its benchmarks. Neither K8 nor K8L has a problem with this because of the independent L2 caches. However sometimes C2D gets an advantage because of sharing. With a shared L3, K8L should get a similar advantage and have a very balanced cache design.

There is no similarity between P4 and C2D. The closest processor to C2D is Yonah. C2D has many improvements over Yonah. I've seen people over at THG try to state the lineage of C2D and K8L but I've never seen one person even get the direct lineage correct. This would be something like:

K6->Thunderbird->Athlon MP->Barton Athlon MP->K8->K8L

PIII->Banias->Dothan->Yonah->C2D

The actual lineage is much more complex.