TPC-W xSeries Benchmark–Lessons Learned
28 Pages

TPC-W xSeries Benchmark–Lessons Learned


Downloading requires you to have access to the YouScribe library
Learn all about the services we offer


IBM ~ Performance Technical Report ™TPC-W xSeriesBenchmark–Lessons LearnedDiscussion of the impact configuration changes made between two™successive TPC-W publications using DB2 UDB EE and NUMA Technology forthe Database Server. Performance counter data reveals interestinginformation influencing the benchmark configuration.Mary Edie Meredith, Mark Wong, Basker Shanmugam, Russell Clapp503-578-4273, T/L, 2002©2002 International Business Machines Corporation, all rights reservedTPC-W xSeries Benchmark - Lessons Learned Page 1IBM ~ Performance Technical ReportAbstract®In May of 2001, IBM published a leading Transaction Processing Performance Council™Benchmark W (TPC-W) results with DB2 Universal Database on the x430 system. A white™paper entitled “TPC-W Benchmark for IBM xSeries Servers - Leadership e-businessPerformance with IBM xSeries and DB2 Universal Database” documented the configuration andresults of that effort. This follow-on paper describes how the configuration evolved and whatissues motivated changes from an earlier publication on an E410 system to a publication on anx430. We present hardware performance counter data collected on the database server for eachpublication and describe what the data revealed about the TPC-W workload characteristics in aNUMA system environment.OverviewThe Transaction Processing Performance Council (TPC), is a standards body whose mission is toprovide transaction ...



Published by
Reads 35
Language English

IBM ~ Performance Technical Report
™TPC-W xSeries
Benchmark–Lessons Learned
Discussion of the impact configuration changes made between two
™successive TPC-W publications using DB2 UDB EE and NUMA Technology for
the Database Server. Performance counter data reveals interesting
information influencing the benchmark configuration.
Mary Edie Meredith, Mark Wong, Basker Shanmugam, Russell Clapp
503-578-4273, T/L 775-4273
January, 2002
©2002 International Business Machines Corporation, all rights reserved
TPC-W xSeries Benchmark - Lessons Learned Page 1IBM ~ Performance Technical Report
®In May of 2001, IBM published a leading Transaction Processing Performance Council
™Benchmark W (TPC-W) results with DB2 Universal Database on the x430 system. A white
™paper entitled “TPC-W Benchmark for IBM xSeries Servers - Leadership e-business
Performance with IBM xSeries and DB2 Universal Database” documented the configuration and
results of that effort. This follow-on paper describes how the configuration evolved and what
issues motivated changes from an earlier publication on an E410 system to a publication on an
x430. We present hardware performance counter data collected on the database server for each
publication and describe what the data revealed about the TPC-W workload characteristics in a
NUMA system environment.
The Transaction Processing Performance Council (TPC), is a standards body whose mission is to
provide transaction processing and database benchmark guidelines in the form of specifications.
The TPC developed the benchmark that is the focus of this paper - the TPC Benchmark W
(TPC-W). In February, 2001, and again in May, 2001, IBM published the first and second
TPC-W results recorded at the 100,000 scale factor. These results were also the first published
®TPC-W results for DB2 .
This paper describes the elements of the TPC-W workload and shows how they can be
distributed across multiple physical systems. We demonstrate how this was done for the first
publication and describe some of the issues that determined the configuration. We explain what
we learned after the first publication about the database performance using hardware counters
collected by some internal tools. Finally, we show the second configuration, noting the changes
that occurred and their effects on performance and cost.
Brief Introduction to the TPC-W
The TPC-W specification, approved in February, 2000, represents an e-commerce workload
simulating an Internet environment where a user browses and places orders on a retail (in this
case a book store) web site, which we will call the “TPC-W web site”. Fourteen “interactions”
(complete web page generation and delivery to a browser) are specified. A list of these web
interactions is shown in Table 1.
TPC-W xSeries Benchmark - Lessons Learned Page 2IBM ~ Performance Technical Report
Table 1. TPC-W Web Interactions
Web Interaction Notations
New Product Can be cached
Best Seller
Product Detail Can be cached
Search Request
Search Results Subject, Author and Title can be cached
Shopping Cart
Customer Registration Static Page (all others are dynamic)
Buy Request SSL Encryption
Buy Confirmation
Order Inquiry
Order Display SSL Encryption
Admin Request
Admin Confirmation
All customer and item information used to dynamically generate web pages is stored in a
database. Benchmark sponsors are required to report many performance, configuration, and
pricing details about the solution tested. The primary metrics are:
WIPS, the number of Web interactions per second supported by the proposed solution, a
performance indicator
the system cost per WIPS, a price-performance indicator
When sponsors run the TPC-W, they must do so at a given “Scale Factor”. The Scale Factor
determines how many product items must be supported by the database, and this (along with the
number of customers) determines the size of the database. Valid Scale Factors are 1,000, 10,000,
100,000, 1,000,000, and 10,000,000. The results of a test run at two different scale factors are
not comparable. The metrics are always given with a scale factor indication, for example,
WIPS@100,000 or Dollars per WIPS@100,000.
The mix of interactions used for measuring WIPS is known as the “shopping mix” and represents
a particular user profile. This profile is characterized by a mix of browsing and ordering
transactions resulting in a mix of read, update, and insert database activity. Two other user
profiles are measured and reported as secondary metrics: the browsing mix, reported as
WIPSb@scalefactor, and the ordering mix, reported as WIPSo@scalefactor. The browsing mix
has a high percentage of read-only interactions, whereas the ordering mix has a high percentage
of database modifications (inserts and updates).
The TPC web site has much more detail on the benchmark, including the benchmark
specification. Other papers, for example “Benchmarking An E-commerce Solution” [1] ,
provide summary overviews of the benchmark specification.
TPC-W xSeries Benchmark - Lessons Learned Page 3
ŸIBM ~ Performance Technical Report
IBM Results
Table 2 gives an overview of the results from the two IBM publications at scale factor 100,000.
Table 2. IBM Results
Report Date 2/2/2001 5/1/2001
Database System E410 x430
WIPS 6,272.5@100,000 7,554.7@100,000
$/WIPS $195.95@100,000 $136.80@100,000
WIPSb 5755.7@100,000 6104.9@100,000
WIPSo 3193.4@100,000 2777.3@100,000
Number of Users 50,000 55,000
TPC-W spec Version 1.2.1 Version 1.5
From the combination of changes made between the two publications, WIPS increased
approximately 20%, while the price performance improved roughly 43% . The number of
browsers (users) increased to generate the increased WIPS. Although the specification changed
slightly between the two publications, the two results are still comparable and specification
changes did not impact the configuration. The article “TPC-W Benchmark for IBM e-server
xSeries Servers” [2] provides many more details regarding the results of the final benchmark.
Executive summaries and FDRs (Full Disclosure Reports) are still available on the TPC web site
The TPC-W Workload
The TPC-W workload, as defined by the specification, has many elements (refer to Figure 1) but
can be grouped into three parts:
Web site activities
Internet Workload Emulation
Services between Workload Emulation and the web site
Activities in the area marked as the System Under Test (SUT) represent the combination of
hardware and software that is the proposed solution offered by the benchmark sponsors.
Activities performed outside of the SUT are not included in the price/performance calculations,
unless the same resources are used by the SUT.
The Internet Workload Emulation consists of two elements - the Remote Browser Emulation
(RBE), and the Payment Gateway Emulation (PGE), which is credit card authorization
TPC-W xSeries Benchmark - Lessons Learned Page 4
ŸIBM ~ Performance Technical Report
Internet Workload
Remote Browser Payment Gateway Emulation:
Emulator (RBE) Emulator (PGE)
Services Between
Security Load Balancing Emulated Browers (EB) (optionally) (optionally)and Web Site
Caching objects Web Site Core - Generating the web pageText Search (Web Cache)(Web Svr)
Html text -
Static pages SSL Security Web Server (Web Svr) (optionally) cache misses
(Web Svr) (Web Cache)
Gif images Dynamic pages
(Web Svr) (App Svr)
Jpeg images Load Balancing
(Img Svr) (optionally)
(Web Svr) Database Access
(DB Svr) SUT
Figure 1. Functional Elements of a TPC-W Workload.
The RBE emulates HTTP network traffic that would be generated by a user browsing a retail
web site. The RBE simulates many Emulated Browsers (EBs). The total number of
browsers required goes up as the metric achieved increases (Number of browsers/14 < WIPS <
Number of browsers/7). The RBE is typically run on several physical machines.
The TPC-W web site has 14 different pages (refer to Table 1). Six are browsing transactions
while eight are order transactions. All pages are dynamically generated except the customer
registration page. The RBE selects traversals through the site as defined by the spec for a given
user profile (shopping, browsing, or ordering). The spec requires there to be one network
connection to the Web Site per EB. The RBE records response time and tracks the number of
interactions completed over time.
The Payment Gateway Emulation (PGE) responds to credit card authorization requests
made by the web application. The spec requires Secure Socket Layer (SSL) v3.0 or higher to be
used for encrypting data over the "Internet" from the web site to the PGE, requiring 1 SSL
handshake per 100 transmitted messages.
Four of the web pages require SSL encryption (see Table 1). Test sponsors can support SSL in
one of three ways:
As a separate SSL server that handles all secure connections (a.k.a. Security Proxy)
From within SSL network cards (i.e. embedded in hardware)
Embedded in the Web Server software
TPC-W xSeries Benchmark - Lessons Learned Page 5
ŸIBM ~ Performance Technical Report
Included in the SUT, or (optionally) outside of the web site, is the functionality needed to load
balance connections from RBE to the SUT. If the SUT could be run on a single physical
machine, there would be no need for this element. In practice, the web site is implemented
across many physical systems (we discuss why in the next section). Since the browsers only know
to connect to one site URL, load balancing multiple EB connections across multiple web servers
in the web site is a necessity. Load balancing can be handled on a separate system, can be
determined within the web site, or can be implemented on a network switch that connects the
“Internet” to the SUT.
Within the web site itself, there are many elements familiar to the typical e-commerce site. An
emulated browser maintains a network connection to a HTTP server, which serves up each page
to the EB. The web site is expected to handle text searches, static pages, gif and jpeg images, or
dynamic pages. The web site may need to provide SSL encryption or load balancing as
previously discussed.
Dynamic web pages are generated by an application. Servicing the execution of the dynamic
pages is the application server. The application uses a database to query or store information
such as an order request or product information. These requests are serviced by a database
server. If the application server is physically on a separate system from the HTTP server, then
the HTTP server utilizes network connections to that application server to make requests
(indicated by the connecting lines). Similarly, if an application server resides on a separate system
from a database server, it also utilizes network connections to the database server to access the
The TPC-W spec defines what objects can be cached. Web caching can dramatically reduce
overall processing requirements, so the TPC-W web site utilizes web caching services. The
cached objects need to be refreshed to reflect changes in the database. The web cache uses the
services of a web server (denoted as the cache miss web server), which in turn relays the refresh
request to an application server on down to the database server.

The cache miss web server, the application server, and the database server are the only services
in the web site not accessed directly by users. Any network activity between these elements are
considered part of the SUT and will be priced as part of the solution. Thus network connections
for these activities are typically isolated for a benchmark configuration.

All the elements pictured within the web site in Figure 3 are activities that themselves represent
different workloads, and therefore they are candidates for distributing across separate physical
Tuning Issues
There are many sizing and tuning challenges facing test sponsors. Some issues result from
requirements imposed by the specification intended to represent real life challenges. As noted
earlier, the number of browsers must increase with the target WIPS performance. The database
size depends on the scale factor (fixed for a benchmark) but also grows with the number of
browsers (transaction rate). Therefore database configuration and tuning parameters must be
adjusted as performance improves.
Many SUT elements (e. g. the text search engine) must be commercially available software
requiring evaluation, selection, and tuning. Although the spec allows certain activities to be
TPC-W xSeries Benchmark - Lessons Learned Page 6IBM ~ Performance Technical Report
cached, caching an object does not necessarily improve performance, so much tuning effort is
spent on determining what and how to perform this function.
Distributed System Solution
Test sponsors are judged by sustained performance and the cost of that performance. Any under
utilized resource represents unnecessary cost. So if at any point in time, the CPU utilization were
to drop to 50%, then roughly1/2 the cost of the system would be wasted. To be competitive,
resource utilization must be optimized by tuning. Rather than consolidate all elements on a
single system, it is more straight forward to isolate the individual pieces ( web services, database
services, the web application, caching services, encryption services, and so forth) on separate
systems and tune them individually.
Even if a test sponsor wanted to offer a single-system solution, the software supporting two
workloads often requires different OS releases or system tuning parameters for optimal
performance. Some software, unfortunately, does not scale beyond a few processors, limiting
software choices for single system environments. This is usually not a problem for database
software, but it certainly can be a problem for other software, as we discovered.
All TPC-W solutions to date implement the SUT across multiple physical systems. This leads to
a myriad of tuning options that must be considered:
how to divide the workload across multiple systems and implement load balancing to
optimize performance
how to configure and balance the network traffic in this distributed environment
how to do this in a cost effective manner to maintain good price/performance
Configurations that on paper might seem superior to others may prove inferior. Products can
yield surprising limitations. The number of elements to be tuned presents a challenge, as a
change in one can impact the performance of many others.
Overall, test sponsors take the same approach: get the most possible out of the database server. To
do this, they
remove any bottlenecks in the networking, hardware, and software activities that are
driving the database server
remove any unnecessary database activities
1tune the database server for the highest possible WIPS load
do all of the above in a cost effective manner
How the Workload is Distributed
Recall the list of web server activities described in Figure 3. There are several ways to distribute
the workload for the web site:
1. Spreading one activity across multiple systems, for example, having multiple web caching
servers or dedicated machines to handle jpeg images
2. Dividing the workload by web page type (all four SSL pages directed to an SSL server)
1 Since one of the other workloads (orders) used for the secondary metric, WIPSo, is more
compute intensive than the primary workload, some headroom must be reserved. The
configured system also had to survive an overload condition of three times the normal load as
part of the audit, a requirement which has been removed in recent TPC-W spec versions.
TPC-W xSeries Benchmark - Lessons Learned Page 7
ŸIBM ~ Performance Technical Report
3. Combining workloads that do not utilize a system completely or who perform more
effectively together than apart
4. Any combination of the above
The web page itself is used to direct any activity not performed on the system servicing the
emulated browser. Figure 2 shows an example of application services and web caching running
on separate systems from the web server. The EB holds a connection to the web server, who
delivers a web page. Links on the web page point to the web cache, to an image server, and to
locations within the web server itself. The browser knows to complete the rest of the interaction
by connecting to the other systems.
Web Server Web Cache
Web Page
1 22 33
Image Server
1 Find this on Web Cache ServerEmulated RBE 2 Find this on Image Server
Browser 3 Find this on this Server
Figure 2. Splitting activities across physical machines.
The communication path in Figure 2 always remains between the EB and the individual systems
(and not between the systems in the SUT). So even though the SUT systems may be physically
on the same network, as long as they do not communicate directly with each other, the cost of
that network support is considered to be outside of the SUT. As discussed earlier, connections
between SUT systems do occur between HTTP servers and application servers, application
servers and database server, web caches and the web cache miss server. Any part of the network
used to make those connections is included in the cost of the SUT. Care must be taken such that
workload distribution does not negatively impact the cost due to network utilization. In other
words, the networking overhead incurred in separating these elements cannot exceed the benefit
of separating them.
As an example of a distributed workload, we next examine the configuration used in the first
Configuration for TPC-W on E410
Figure 3 illustrates the benchmark environment of the first publication using an E410 database
server. Each block represents a separate system and is labeled with its function. Those systems
that qualify as part of the SUT according to the spec appear in the box denoted as SUT.
Benchmarkers utilized a design methodology described in “Designing and Sizing Large
E-Commerce Web Sites” by Burzin Patel and Bhagyam Moses [4]. An in-depth analysis of the
workload led to decisions on how to distribute functionality. Compute intensive functions were
TPC-W xSeries Benchmark - Lessons Learned Page 8
IBM ~ Performance Technical Report
split apart, and others were combined. Given the workload and target performance, initial
estimates of system and network resources were made from purely mathematical calculations.
Mini-configurations designed to maximize the workload for a particular component with the
fewest front-end resources were use to confirm calculations or determine unknowns.
For the RBEs and PGEs ( the Internet Workload Emulation described in Figure 1), we used a
® ®mix of Netfinity 5000 systems (which supported 3,000 EBs each) and 4-processor NUMA-Q
Intel Centurion based 550 MHz systems (which supported 4,000 EBs each). These devices ran
custom code developed by IBM to comply with the TPC-W specifications described earlier.
Connections from the RBE/PGE to the SUT were made using a Cisco 6509 Gigabit Switch
(1 Gbit/sec TCP/IP Ethernet network) that was capable of handling the high volume of HTTP
and image traffic generated by the interactions at the target WIPS. From the diagram in Figure
3, one can see that most of the systems within the SUT also connect to the switch. Care was
taken that none of these systems use the switch to communicate directly with each other, so the
switch is considered outside of the SUT, and therefore is not priced in the solution.
Img SrvRBE 100Mb SUTImg Srv HUB
Img Srv
App Srv
Web SrvRBE
Web SrvRBE
Web SrvRBE DB SrvApp Srv
RBE Web Srv
RBE Web Srv 100Mb
RBE Web Srv
App SrvRBE
Img Srv *
PGE 100MbImg Srv
HUBWeb Srv
Web SrvPGE
Web CacheWeb Srv
Web Cache
Figure 3. TPC-W first100K configuration, 1/2001.
We chose to locate elements of security and load balancing within the web site, rather than
outside of the Web Site (see Figure 1). Therefore, what is marked as the SUT in Figure 3
corresponds to the elements labeled “Web Site” in Figure 1. Details of equipment, operating
system and software are summarized in Table 3, which should be used as a reference in the
remaining discussion as we review the distribution of the web site workload.
To understand the web site configuration, one must understand the goal of the benchmark team:
demonstrate xSeries DB2 E-Commerce capability for database workloads executing on more than 8
processors. That is why the 100,000 scale factor was chosen as the target. Anything smaller
would not have required such a large database server.
At that time NUMA-Q E410 systems were the xSeries high-end database servers capable of
® ®achieving this goal. The operating system for the E410 was DYNIX/ptx , a flavor of UNIX .
TPC-W xSeries Benchmark - Lessons Learned Page 9IBM ~ Performance Technical Report
There was no HTTP server that was supported on DYNIX/ptx. Since the other xSeries based
web site elements required Windows 2000, this in itself required that the database workload be
split from the other services. The database server system is pictured in Figure 3 as “DB Srv”.
Table 3. Implementation Details for First E410 Benchmark
Role SoftwareFunction Quanity Name Type OS
Web Srv Base Web 10 Netfinity 2 Pentium III 933 MHz/256 K, W2k MS IIS
Services, static 4500R 512 MB, 1x18.2 GB (Internet
pages (and Information
other as noted) Server) 5.0
App Srv Application 3 Netfinity W2k Custom C
Server 4500R 512 MB, 1x18.2 GB code
Img Srv Image Server 5 Netfinity 2 Pentium III 933 MHz/256 K, W2k MS IIS
(jpeg) 4500R 2304 MB, 3x18.2 GB (Internet
Server) 5.0
Web Cache Web Cache 2 Netfinity 2 Pentium III Xeon 700 MHz/1 W2k MS Internet
(aka ISA Server) Service 6000R MB, 4 GB, 3x18.2 GB Security and
Server (ISA),
Cypress Auto
Task 2000
Web Server Web Server na na Run on one of the Web Caching na Same as
Cache Miss used by Web machines Application
Cache servers Server
DB Srv Database 1 NUMA-Q 12 Pentium III Xeon 700 ptx DB2 UDB EE
Server E410 MHz/2 MB, 15 GB, 10x18 GB 4.5.1 V7.1
and 20x36 GB on HDS 5800
30-slot, 512 MB cache, 2 CU’s 2
FC ports
Hub Network (SUT) 3 NETGEAR 8-port Ethernet Switch (100 MBit na na
RS108 HUB)
Load Balancing SUT na na Run on one of the Web Caching na MS Windows
machines DNS Server
Security Serviced pages na na Supported by each Web Svr na Provided by
requiring SSL IIS
Text Search SUT na na na MS Windows
Engine machines (the same one as the 2000 Indexing
Load Balancing DNS machine) Service
We used DB2 UDB Enterprise Edition (EE) Version 7.1 for the database server (labeled DB Srv).
®The E410 system has 12 Pentium III Xeon 700 MHz processors with 2 MB cache and 15 GB
2total main memory. This was a 3 Quad NUMA based system. Thirty 10K RPM disks were
attached via an Hitachi HDS 5800 Fibre Channel based disk subsystem.
Since DYNIX/ptx supports Multi-Path I/O (MPIO), we had a choice between DB2 UDB EE
and DB2 UDB EEE (Extended Enterprise Edition). DB2 UDB EE is a shared everything
architecture that utilized MPIO and the Fibre Channel topology to directly access any drive from
2 NUMA is Non-Uniform Memory Access. A NUMA-Q system consists of 4-processor
nodes called “Quads”, each having memory and I/O capabilities. When combined within the
same system, one copy of the operating system is active on all processors and memory, providing
the user with what looks like an SMP system. As the name implies, accesses by a processor to
local Quad memory has lower latency than memory on a remote Quad.
TPC-W xSeries Benchmark - Lessons Learned Page 10