ETC AU1100LCDPERF_30274A

AMD Alchemy™ Solutions
Au1100™ Processor
LCD Performance
Application Note
Revision: 30274A
Issue Date: April 2003
© 2003 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to make
changes to specifications and product descriptions at any time without notice. No license,
whether express, implied, arising by estoppel or otherwise, to any intellectual property
rights is granted by this publication. Except as set forth in AMD’s Standard Terms and
Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or
implied warranty, relating to its products including, but not limited to, the implied warranty
of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of
AMD’s product could create a situation where personal injury, death, or severe property or
environmental damage may occur. AMD reserves the right to discontinue or make changes
to its products at any time without notice.
Contacts
www.amd.com [email protected]
Trademarks
AMD, the AMD Arrow logo, Alchemy, and combinations thereof, and Au1100 are trademarks of Advanced Micro Devices, Inc.
MIPS is a registered trademark and MIPS32 is a trademark of MIPS Technologies, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
1.0 Introduction
This document describes the performance characteristics of an Au1100™ processor based design
using the integrated LCD controller.
This document assumes the reader is familiar with LCD technology and the AMD Alchemy™
Solutions Au1100™ Processor Data Book (see 7.0 “References”).
This document also assumes the reader is familiar with the applications note “Au1x00 SDRAM
Performance” which outlines SDRAM performance for a typical system (see 7.0 “References”). The
remainder of this document uses numbers for a 396MHz system with a 99MHz SDRAM interface, as
outlined in the SDRAM applications note.
2.0 LCD Controller Overview
The Au1100 processor features an integrated LCD controller for connecting to liquid crystal displays
and cathode ray tubes. The LCD controller supports the common industry standard TFT and STN
panel technologies and is able to drive cathode ray tubes via an external digital-to-analog converter
(DAC). In the discussion to follow, the term display is a reference to either a TFT or a cathode ray
tube with an appropriate DAC. The majority of the information presented in this document is
applicable for an STN panel, but the calculations differ.
The Au1100 processor databook contains details and additional information on the operation of the
LCD controller. The general arrangement of the Au1100 processor LCD controller is depicted below.
Au1100™ Processor
Au1
Core
SDRAM
Static
Cntrlr
SBUS
Flash/
PCMCIA/
Other
LCD
Cntrlr
LCD_FCK
LCD_LCK
LCD_BIAS
LCD_LEND
LCD_D[15:0]
LCD_PWM[1:0]
LCD
or
CRT DAC
SDRAM
Cntrlr
Figure 1: Au1100™ Processor LCD controller
Application Note
3
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
The performance of any input/output peripheral is usually described in terms of the maximum amount
of data that can be moved through the interface in a given time period. For example, a 100Mbps
Ethernet controller can move a maximum of 12.5MB/s. If the actual performance is less than the
maximum, data movement occurs at a slower pace.
In the case of the output-only LCD controller, the performance is essentially a constant. Unlike many
peripheral I/Os, if the LCD controller fails to satisfy the constant performance requirement, the
display refresh fails, resulting in visual artifacts (and not just slower data movement).
The performance constant for the LCD controller is easily calculated for a given display type.
However, the LCD controller is only one aspect of performance in an Au1100 processor based design.
The remainder of this document identifies influences on the system performance of a design using the
Au1100 processor LCD.
3.0 Unified Memory Architecture Fundamentals
Figure 1: “Au1100™ Processor LCD controller” depicts a unified memory architecture (UMA)
arrangement where the memory used by the LCD controller for the framebuffer is shared with the rest
of the system. In this arrangement, the Au1 core performs all drawing in the framebuffer, which
resides in SDRAM, and the LCD controller continuously refreshes the display by fetching the
framebuffer contents and sending the pixel data to the display.
In a non-unified memory architecture, the LCD (or graphics) controller has a dedicated memory pool
that contains the framebuffer. Furthermore, the LCD (or graphics) controller has priority over
processor-initiated accesses to the framebuffer memory in order to maintain the refresh of the display.
By eliminating the need for a dedicated framebuffer memory pool, a UMA is a more cost-effective
graphics solution than a non-unified memory architecture environment. However, since the Au1 core,
LCD controller and other peripherals share the SDRAM, memory latency and bandwidth can affect
system performance.
3.1 System Bus (SBUS)
The system bus (SBUS) is the main bus within the Au1100 processor. As such, access to the system
bus is necessary in order to access the SDRAM, the Static Bus, or the integrated peripherals.
The SBUS typically operates at one-half the Au1 core frequency, and the SDRAM controller operates
at one-half the frequency of the SBUS.
The Au1100 processor SBUS has four bus master slots for handling six system bus masters:
• Au1 core
• Ethernet MAC controller and DMA controller
• USB Host controller and IrDA controller
• LCD controller
4
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
The arbitration scheme for the system bus is round-robin; each bus master slot has equal opportunity
to obtain access to the system bus. For a particular system bus master X, if no other system bus
masters request the bus, then bus master X immediately wins the system bus. By contrast, if all other
system bus masters request the bus, then bus master X must wait for three other system bus master
slots’ transfers before it wins the system bus, as depicted in the following figure.
Req A
Req B
Req C
Req X
SBUS
A
B
C
X
Figure 2: System Bus Arbitration
When a system bus master wins arbitration of the system bus, it performs transfers to/from the
integrated peripherals, SDRAM, or the Static bus.
3.2 Latency and Bandwidth
Latency is defined as the amount of time between when a request for a resource is initiated and when
the request for that resource is granted. In the scope of this discussion, latency is the time between
when a system bus master (e.g. LCD controller) requests access to the system bus (e.g. in order to
access framebuffer memory) and when the system bus is granted to that master. Bandwidth is the
amount of data that can be moved across the system bus in a time interval. In the Au1100, latency and
bandwidth are inversely related such that an increase in latency results in a decrease in bandwidth
(since less time is available to move data), and vice versa.
Two factors influence latency and bandwidth: system bus arbitration, and transfer time.
As stated previously, access to SDRAM requires access to the system bus. For all practical purposes,
the latency onto the system bus is the latency to the SDRAM. Figure 3: “System Bus Latency for a
Bus Master” illustrates the round-robin arbitration scheme with all system bus masters requesting the
bus simultaneously, and the corresponding effect on system bus latency for bus master X.
Req A
Req B
Req C
Req X
SBUS
A
B
C
X
Latency for System Bus Master X
Figure 3: System Bus Latency for a Bus Master
Application Note
5
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
The illustration also demonstrates the impact of transfer time on latency, and in turn bandwidth.
While the transfer time to integrated peripheral registers is negligible, the SDRAM, and Static Bus
transfer times can add appreciable delay to system bus latency.
Note that from the perspective of system bus master X, increases in system bus latency result in fewer
opportunities for system bus master X to perform transfers to/from SDRAM in a given time interval.
Thus an increase in system bus latency results in a decrease in effective SDRAM bandwidth for
system bus master X (the actual SDRAM bandwidth potential is unchanged, as outlined in the
SDRAM applications note).
3.2.1 SDRAM Interface
For a 396MHz Au1100 processor operating the SDRAM controller at 99MHz, an SDRAM singlebeat access is 60ns (6 cycles at 10.1ns), and an SDRAM burst access is 121ns (12 cycles at 10.1ns).
Accesses to SDRAM can add upwards to 121ns to the system bus latency for other system bus
masters.
A typical SDRAM configuration is capable of approximately 248.9MB/s throughput. The SDRAM
bandwidth is important since it is the main storage for applications, data and the LCD framebuffer.
There must be enough SDRAM bandwidth to satisfy the LCD controller refresh demand as well as
run the applications.
The SDRAM bandwidth needed by the LCD controller is a product of the display resolution size,
pixel depth and refresh rate. The following table lists some common resolutions and the resulting
SDRAM bandwidth requirement.
Table 1: LCD Controller SDRAM Bandwidth
Horizontal
(Pixels)
Vertical
(Pixels)
Depth
(Bits Per
Pixel)
Refresh
Rate
(Hz)
Bandwidth
(MB/s)
QVGA
320
240
8
60
4.6MB/s
QVGA
320
240
16
60
9.2MB/s
VGA
640
480
16
60
36.8MB/s
XGA
800
600
16
60
57.6MB/s
XGA
800
600
16
72
69.1MB/s
The above values represent the SDRAM bandwidth demand of the LCD controller as it continuously
refreshes the display. With a total SDRAM bandwidth of 248.9MB/s, the LCD controller consumes a
relatively small percentage, leaving ample bandwidth for the Au1 core to run applications and
perform graphics operations.
The LCD controller timing values should be configured so as to minimize the SDRAM bandwidth
demand. In particular, the refresh rate should be set to the lowest rate permitted by the display.
6
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
3.2.2 Static Bus Interface
The Static Bus permits a wide variety of external devices to connect to the Au1100 processor. The
transfer time for these peripherals is in the tens and hundreds of nanoseconds. Flash memories
typically range from 90ns to 120ns, and PCMCIA cards typically are 150ns, 200ns or 250ns.
Furthermore, the Static Bus features the EWAIT# signal, and PWAIT# signal for PCMCIA, which
can be asserted by external devices to insert an arbitrary number of wait states into a transfer. The
assertion of these signals further increases latency for other system bus masters.
The full impact of static bus peripherals that assert EWAIT# or PWAIT# is discussed after outlining
the latency requirements of the LCD controller.
4.0 Latency and Bandwidth with Respect to the LCD Controller
To refresh the display, the LCD controller must fetch all the pixels of a frame, and do so at the refresh
rate of the display. To fetch a frame, the LCD controller generates a series of burst accesses to
SDRAM. Since a single SDRAM burst fetches only 32 bytes, multiple SDRAM accesses are needed
to fetch an entire frame.
The LCD controller implements two 320-word buffers for moving data from SDRAM to the pixel
engine. The two buffers are ping-pong buffers: the pixel engine pulls data from one buffer while the
other buffer is filled from SDRAM. If the pixel engine consumes a buffer, and the next buffer is not
yet filled, the pixel engine incurs an under-flow and repeats the last pixel, resulting in display
artifacts. The time to empty a 320-word buffer determines the maximum time allowed to fill a 320word buffer in order to avoid the buffer under-flow condition.
The pixel engine pulls one pixel from the buffer every pixel clock while rasterizing (for the sake of
simplicity, the horizontal non-display times are ignored). Thus, the pixel clock period multiplied by
the size of the buffer and divided by the number of pixels in the buffer yields the buffer empty/fill
time for a given display configuration.
The LCD pixel clock is derived from values programmed into sys_clksrc, sys_freqctrl and
lcd_clkcontrol. Table 2 provides example pixel clock settings for common display types.
Table 2: LCD Pixel Clock Timing
Horizontal
(Pixels)
Vertical
(Pixels)
FREQn
lcd_clkcontrol[PCD]
Pixel Clock
QVGA
320
240
48MHz
1
12MHz (83.3ns)
VGA
640
480
96MHz
1
24MHz (41.6ns)
XGA
800
600
96MHz
0
48MHz (20.8ns)
Application Note
7
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
The number of pixels the buffer contains is determined by the lcd_control[BPP] field. Table 3
summarizes the possible combinations:
Table 3: LCD Buffer Pixels
lcd_control[BPP]
Bits Per Pixel
Number of
Pixels Per
Buffer
000
1
10240
001
2
5120
010
4
2560
011
8
1280
100
12
640
101
16
640
The time needed to empty a 320-word buffer is simply the product of the pixel clock period and the
number of pixels contained in the buffer. Table 4 summarizes the buffer empty time for the example
pixel clocks.
Table 4: LCD 320-Word Buffer Empty Time
QVGA
VGA
8
Horizontal
(Pixels)
Vertical
(Pixels)
Bits Per
Pixel
Pixel
Clock
(ns)
Pixels
Per
Buffer
Buffer
Time
(ns)
320
240
1
83.3
10240
852,992
320
240
2
83.3
5120
426,496
320
240
4
83.3
2560
213,248
320
240
8
83.3
1280
106,624
320
240
12
83.3
640
53,312
320
240
16
83.3
640
53,312
640
480
1
41.6
10240
425,984
640
480
2
41.6
5120
212,992
640
480
4
41.6
2560
106,496
640
480
8
41.6
1280
53,248
640
480
12
41.6
640
26,624
640
480
16
41.6
640
26,624
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
Table 4: LCD 320-Word Buffer Empty Time
XGA
Horizontal
(Pixels)
Vertical
(Pixels)
Bits Per
Pixel
Pixel
Clock
(ns)
Pixels
Per
Buffer
Buffer
Time
(ns)
800
600
1
20.8
10240
212,992
800
600
2
20.8
5120
106,496
800
600
4
20.8
2560
53,248
800
600
8
20.8
1280
26,624
800
600
12
20.8
640
13,312
800
600
16
20.8
640
13,312
To avoid the buffer under-flow condition, the time needed to fill the other 320-word buffer must not
exceed the time to empty a 320-word buffer. A 320-word buffer permits tens, hundreds, or even
thousands of microseconds of time in which to fill the next buffer.
To fill a 320-word buffer requires 40 SDRAM 8-word bursts, or approximately 4,840ns (121ns * 40
bursts); significantly less than the 320-word buffer empty time. The design and capability of the
Au1100 processor LCD controller permits ample time to fetch LCD buffers as well as perform other
useful work in the system.
4.1 How Latency and Bandwidth Affect the LCD Controller
The two main points of the preceding discussion are that the 320-word ping-pong buffers permit
adequate time to retrieve framebuffer contents from SDRAM as well as establish an upper-bound for
avoiding display artifacts.
This section examines the conditions that can cause the 320-word buffer fill time to exceed the empty
time. The 320-word buffer fill time in effect creates a hard real-time SDRAM bandwidth demand of
40 bursts in one buffer empty/fill time. Failure to complete 40 SDRAM burst in this time interval
causes the LCD pixel engine to under-flow and repeat pixels. It is during this time period that
efficient accesses to SDRAM is extremely important.
Consider the situation where the Au1 core is transferring a block of data to/from a PCMCIA card (e.g.
network or storage card). Only the Au1 core and the LCD controller are actively requesting the
system bus. The system bus arbitration scheme results in the Au1 core and LCD controller alternating
transfers on the system bus. Thus for each LCD controller access, there is an Au1 core access to
PCMCIA. Table 5 summarizes the time required to fill a 320-word buffer when both the Au1 core and
LCD controller are using the system bus.
Application Note
9
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
Table 5: PCMCIA and LCD Transfer Times
PCMCIA Transfer Time
40 PCMCIA Accesses
40 SDRAM Accesses
PCMCIA +LCD
Time
150ns
6,000ns
4,840ns
10,840ns
200ns
8,000ns
4,840ns
12,840ns
250ns
10,000ns
4,840ns
14,840ns
300ns
(PWAIT# asserted)
12,000ns
4,840ns
16,840ns
350ns
(PWAIT# asserted)
14,000ns
4,840ns
18,840ns
400ns
(PWAIT# asserted)
16,000ns
4,840ns
20,840ns
500ns
(PWAIT# asserted)
20,000ns
4,840ns
24,840ns
600ns
(PWAIT# asserted)
24,000ns
4,840ns
28,840ns
By comparing the time to fill a buffer from this table with that of the time to empty a buffer in Table
4: “LCD 320-Word Buffer Empty Time”, it is apparent that a number of display configurations,
especially the 12bpp and 16bpp configurations, are susceptible to display artifacts when accessing
slow PCMCIA cards. For example, the 640x480x16bpp display refresh fails if PCMCIA card
accesses consistently need 600ns (buffer fill time of 28,840ns exceeds buffer empty time of
26,624ns).
Also note that this example does not take into consideration peripherals other than the Au1 core and
LCD controller which may request the system bus. System bus requests by other peripherals simply
add more time to the actual time needed to fill a 320-word buffer.
In addition, AMD has observed some PCMCIA cards assert PWAIT# to extend the transfer time to
1,000ns (1 microsecond), and even longer. If the Au1100 processor based product permits using
PCMCIA cards with this type of transfer time, the ability to fill the 320-word buffer in the allotted
time is extremely difficult, and will result in display artifacts.
The number of 320-word buffer fills needed per refresh for common configurations is provided in
Table 6.
10
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
Table 6: LCD 320-Word Buffer Fills Per Refresh
QVGA
VGA
XGA
Horizontal
(Pixels)
Vertical
(Pixels)
Framebuffer
Size
(Pixels)
Bits Per
Pixel
Pixels
Per
Buffer
Buffer Fills
Per
Refresh
320
240
76,800
1
10240
7.5
320
240
76,800
2
5120
15
320
240
76,800
4
2560
30
320
240
76,800
8
1280
60
320
240
76,800
12
640
120
320
240
76,800
16
640
120
640
480
307,200
1
10240
30
640
480
307,200
2
5120
60
640
480
307,200
4
2560
120
640
480
307,200
8
1280
240
640
480
307,200
12
640
480
640
480
307,200
16
640
480
800
600
480,000
1
10240
46.8
800
600
480,000
2
5120
93.7
800
600
480,000
4
2560
197.5
800
600
480,000
8
1280
375
800
600
480,000
12
640
750
800
600
480,000
16
640
750
The number of 320-word buffer fills per refresh multiplied by the display refresh rate determines the
number of opportunities per second for buffer under-flows to occur. If a 320-word buffer under-flow
does occur, the display artifacts last only until the start of the next refresh.
The LCD controller under-flow problem is the direct result of long latency, and not a bandwidth
short-coming. The SDRAM has adequate bandwidth to supply the LCD controller; however, the
ability of the LCD controller to access the SDRAM in an efficient manner is impacted by the system
bus latency introduced by competing accesses to the static bus.
Application Note
11
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
4.1.1 LCD Controller lcd_control[22:21] Setting
As previously noted, an increase in system bus latency results in a decrease of effective SDRAM
bandwidth for the LCD controller. To combat the effects of long latency, the Au1100 processor LCD
controller implements a feature that determines how many SDRAM burst accesses it should perform
per system bus arbitration. By increasing the number of SDRAM bursts per LCD controller access,
the LCD controller effectively increases its bandwidth to the SDRAM and consequently increases the
likelihood of the LCD controller filling its 320-word buffers in time, even with the occurrence of long
latency static bus accesses.
The number of SDRAM bursts per system bus arbitration is selected by lcd_control[22:21].
Table 7: lcd_control[22:21] Settings
lcd_control[22:21]
Number of
SDRAM
Bursts
00
1
01
2
10
3
11
4
By setting lcd_control[22:21]=11, the LCD controller performs 40 SDRAM bursts in 10 system bus
arbitrations. Table 8 expands upon the previous example of the LCD controller alternating system bus
transfers with the Au1 core, presenting the change in actual transfer time for a 320-word buffer fill.
Table 8: PCMCIA and LCD Transfer Times with lcd_control[22:21]=11b
12
PCMCIA Transfer Time
10 PCMCIA Accesses
40 SDRAM Accesses
PCMCIA +LCD Time
150ns
1,500ns
4,840ns
5,340ns
200ns
2,000ns
4,840ns
6,840ns
250ns
2,500ns
4,840ns
7,340ns
300ns
(PWAIT# asserted)
3,000ns
4,840ns
7,840ns
350ns
(PWAIT# asserted)
3,500ns
4,840ns
8,340ns
400ns
(PWAIT# asserted)
4,000ns
4,840ns
8,840ns
500ns
(PWAIT# asserted)
5,000ns
4,840ns
9,840ns
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
Table 8: PCMCIA and LCD Transfer Times with lcd_control[22:21]=11b
PCMCIA Transfer Time
10 PCMCIA Accesses
40 SDRAM Accesses
PCMCIA +LCD Time
600ns
(PWAIT# asserted)
6,000ns
4,840ns
10,840ns
The lcd_control[22:21]=11b (4 SDRAM burst per arbitration) significantly increases the chances of
the LCD controller filling a 320-word buffer in the allotted time.
While this table indicates that it is possible to avoid under-flow in all situations, keep in mind that this
does not include system bus accesses by other masters, or PCMCIA (or static bus) transfers with
transfer times greater than 600ns. The presence of more system bus requestors or longer PCMCIA
transfer times increases the likelihood of a buffer under-flow, and the undesirable display artifacts.
4.1.2 LCD Controller sys_powerctrl[17] Setting
To further combat the effects of system bus latency, the Au1100 processor (stepping BE and newer)
features a setting in sys_powerctrl[17] to change the system bus arbitration scheme in favor of the
LCD controller. Setting sys_powerctrl[17] to 1 gives the LCD controller priority over other system
bus requestors.
Figure 4: System Bus Arbitration with sys_powerctrl[17]=1
Req A
Req B
Req C
LCD
SBUS
LCD
A
LCD
B
LCD
CX
The change in the arbitration scheme permits shorter system bus latency for the LCD controller, and
therefore more opportunities onto the system bus which in turn increases the likelihood of filling the
320-word buffer on time.
Note that this setting does not allow the LCD controller unconditional access to the system bus. The
LCD controller must still wait if another system bus master is using the system bus. It does, however,
reduce the number of arbitration cycles needed for the LCD controller to win the system bus. The end
result is that the system bus latency for the LCD controller decreases, while the latency for the other
bus masters slightly increases.
This setting is likely to help LCD display refresh in a system where many peripherals are requesting
the system bus, but may not help when the Au1 core is accessing slow PCMCIA cards during the fill
of the 320-word buffer.
Application Note
13
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
5.0 LCD Performance Tuning
An Au1100 processor design has adequate SDRAM bandwidth and latency requirements to
successfully drive a display using the integrated LCD controller. The following sections detail
optimizations that can be made to improve overall system performance.
5.1 Hardware Design Considerations
Since the function of the LCD controller is fixed and predictable, there are only a few hardware
design decisions to be made. These decisions are:
•
LCD display size
•
LCD refresh rate/timing
•
Selection of Au1100 processor operating frequency
•
Selection of the SDRAM
•
Appropriate setting of lcd_control[22:21]
•
Static bus peripheral timings
The LCD display is the single largest factor affecting overall system performance. The display size,
depth and refresh rate determine the SDRAM bandwidth and the Au1 core graphics performance. The
larger the display, the more SDRAM bandwidth that is needed, and the more performance that is
needed from the Au1 core to do graphics. The choice of LCD display size must balance market/
customer requirements and application functionality.
The LCD refresh rate and timing must be optimized to demand the least possible bandwidth from the
Au1100 processor SDRAM. Aggressive refresh rates or timing merely consumes SDRAM bandwidth
and increases the chance for the under-flow condition and display artifacts.
The operating frequency of the Au1100 processor ultimately determines the overall system
performance and the SDRAM clock frequency. The design should use an Au1100 processor running
at an appropriate frequency to yield the desired application and graphics performance, as well as an
appropriate SDRAM bandwidth.
The SDRAMs selected for the design should provide the necessary SDRAM bandwidth; prototyping
and profiling the intended application is recommended. The “SDRAM Performance” application note
provides insight into the selection criteria and expected bandwidth for the SDRAM in an Au1100
processor design.
The lcd_control[22:21] bits should be set according to the needs of the system. For systems with long
latency static bus accesses, it may be necessary to use a setting of 4 SDRAM bursts per system bus
arbitration to improve the ability of the LCD controller to fill the 320-word buffer. This feature might
also prove useful for larger display panels that require aggressive refresh timings.
14
Application Note
Rev. 30274A April 2003
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Accesses to static bus peripherals can have an unusually large transfer time, which directly translates
into a dramatic increase in system bus latency. System designers must carefully consider the timing of
all peripherals on the static bus and optimize the timings to consume the least amount of time
possible. The prime example is the PCMCIA interface, where card transfer times can vary from
150ns to 250ns depending upon the card inserted. In addition, the card can also assert PWAIT# to
extend the cycle time indefinitely.
5.2 Software Design Considerations
The LCD controller merely fetches pixel data from the framebuffer residing in SDRAM; it is the
responsibility of software executing on the Au1 core to perform all graphics operations. The graphics
driver for the Au1100 processor LCD controller can optimize framebuffer caching and mapping to
improve overall system performance.
5.2.1 Framebuffer Caching
Generally speaking caching data improves overall performance. However, a framebuffer presents a
unique challenge in that it is a large, infrequently referenced data structure. For even a small display
panel with resolution 320x240 at 16bpp, the resulting framebuffer of 153,600 bytes easily exceeds the
16KB data cache of the Au1 core. As a direct result, caching the framebuffer displaces other useful,
non-framebuffer data (such as working variables, data-sets, stack, etc.) from the cache. Furthermore,
the cache is best utilized when the memory is referenced frequently; framebuffers pixels are typically
only written once by graphics operations and remain unchanged until a subsequent graphics operation
changes the pixel.
The net result is that it is undesirable to have the framebuffer occupy the entire cache since it reduces
overall cache hit rate and in turn reduces overall system performance. However, for performance
reasons, it is always desirable to do the most efficient access possible to the framebuffer. The Au1100
processor offers several options for improving framebuffer accesses.
If using the translation look-aside buffers (TLB) to access the framebuffer (that is, KSEG0 or
KSEG1spaces are not used exclusively to access the framebuffer), then the framebuffer cache setting
in the TLB should be one of the following, in order of preference:
1. CCA=6 (cached into way 0), with the data cache way 0 locked
2. CCA=6 (cached into way 0), without the data cache way 0 locked
3. CCA=7 (non-cached, write buffer merging and gathering)
4. CCA=2 (non-cached, no write buffer merging and gathering)
5. CCA=3 (cached, uses entire data cache)
CCA, cache coherency attributes, is a field in the MIPS® TLB. See the Alchemy™ Au1100™
Processor from AMD Data Book “2.4 Virtual Memory” for more information. CCA values are
provided in “Table 2. CCA Values” of the data book.
Application Note
15
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
5.2.2 Framebuffer and CCA=6
Case 1 is CCA=6, which is cached and streaming. Furthermore software locks way 0 of the data
cache. This configuration has two mutually beneficial effects:
1. it permits the framebuffer to be cached by confining framebuffer data to way 0 of the cache, and
2. non-framebuffer data is kept out of way 0 which prevents it from being purged by framebuffer
contents.
This configuration has a 4KB cache for the framebuffer, and a 12KB cache for non-framebuffer
items. The following example code configures the 4KB framebuffer cache by locking way 0 of the
cache. The parameter to this routine is the framebuffer address.
.global dcacheStreamInit
.set noreorder
dcacheStreamInit:
li t0,128 # number of dcache sets
dcsiloop:
cache 0x15,0(a0) # wb inv address if in cache
pref 0x4,0(a0) # streaming prefetch into way 0
cache 0x1D,0(a0) # dcache fetch and lock
addiu t0,t0,-1 # decrement sets
bne zero,t0,dcsiloop
addiu a0,a0,32 # increment address by cacheline size
j ra
nop
.set reorder
When this setting is used in conjunction with lcd_control[C]=1, there is no need to flush the data
cache to SDRAM; the data cache snoop mechanism returns current data for cache lines that contain
framebuffer data.
This is the preferred configuration as it permits framebuffer caching, coherent updates, and prevents
non-framebuffer items from being purged from the data cache.
Case 2 is CCA=6, and way 0 of the data cache is not locked. In this configuration, most of the
benefits just described are realized, but non-framebuffer data can land in way 0. In doing so,
framebuffer and non-framebuffer data can displace each other from way 0, degrading the full benefits
of locking way 0.
5.2.3 Framebuffer and CCA=7
Case 3 is CCA=7, which is non-cached, with write buffer merging and gathering. In this
configuration, the framebuffer is not cached, but writes (e.g. blits) to the framebuffer can be merged
and gathered for more efficient burst accesses to the SDRAM. Burst accesses to SDRAM result in
improved throughput and increase overall system performance.
16
Application Note
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
The lcd_control[C] setting should be 0 (non-coherent) as the framebuffer contents are never in the
data cache.
5.2.4 Framebuffer and CCA=2
Case 4 is CCA=2 which is non-cached and non write buffer merging or gathering. In this
configuration, all framebuffer accesses travel through the writebuffer individually, thus consuming
more SDRAM bandwidth than burst accesses with CCA=7.
The lcd_control[C] setting should be 0 (non-coherent) as the framebuffer contents are never in the
data cache.
5.2.5 Framebuffer and CCA=3
Case 5 is CCA=3, which is cached. Furthermore, CCA=3 permits framebuffer contents to use the
entire data cache. As previously noted, if the framebuffer occupies the entire data cache, the overall
system performance degrades. Therefore using CCA=3 is not recommended.
If this setting is used, the lcd_control[C] setting must be 1 (coherent); the data cache snoop
mechanism returns current data for cache lines that contain framebuffer data.
5.2.6 Framebuffer Mapping
When using the translation look-aside buffers (TLB) to access the framebuffer (that is, KSEG0 or
KSEG1spaces are not used exclusively to access the framebuffer), the framebuffer mapping should
attempt to use a single TLB.
Most software environments/operating systems use a 4KB page size. The number of pages required to
cover an entire framebuffer of various sizes is provided in Table 9: “Number of Framebuffer Pages”.
Table 9: Number of Framebuffer Pages
Width
(pixels)
Height
(pixels)
Depth
(bits per pixel)
Size
(Bytes)
4KB Pages
QVGA
320
240
8
76,800
19
QVGA
320
240
16
153,600
38
VGA
640
480
16
614,400
150
XGA
800
600
16
960,000
235
SVGA
1024
768
16
1,572,864
384
The Au1 core has a 32 dual-entry TLB that can map a maximum of 64 pages. If the framebuffer is
mapped using 4KB pages, then as drawing takes place across the display, two performance limiting
effects come into play:
Application Note
17
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
Rev. 30274A April 2003
1. TLB misses occur more frequently which degrades the performance of the drawing routines, and
2. the updates to the TLB to map framebuffer pages displace other valid code and data mappings
from the TLB and degrade overall system performance.
The larger the display size, the higher the frequency of the TLB misses and the longer it takes for
graphics operations to complete.
As graphics load/store instructions miss in the TLB, an exception is taken. Software optionally stores
context, performs a table walk, updates the TLB, optionally restores context and re-initiates the load/
store operation that caused the TLB miss. Furthermore, the MIPS TLB contains mapping for both
code and data, so TLB misses due to framebuffer accesses can result in displacing valid TLB entries
for program instruction/code pages. In effect, program code, data and framebuffer all compete for the
limited number of entries in the TLB. Avoiding TLB misses is therefore desirable, and mapping the
entire framebuffer using a single TLB eliminates such performance limiting effects.
The Au1 TLB can handle page sizes up to 16MB (and in reality up to 32MB due to the dual-entry
TLB). For display sizes that the Au1100 processor LCD controller can handle, a 1MB page size
covers the entire framebuffer, and 2MB covers surface flipping/ping-pong buffers. Thus, the entire
framebuffer can be mapped with a single TLB entry.
In order to map the entire framebuffer with a single TLB, the following must occur:
•
The framebuffer memory must be a valid TLB PageSize bytes in size, or PageSize*2 in size. The
framebuffer memory must be mapped by exactly one TLB, with either one or both entries valid.
•
The framebuffer memory must be aligned on a PageSize, or PageSize*2, boundary, e.g. for a
1MB PageSize, the alignment of the physical address must be on 1 MB boundary.
•
The process virtual address into which the framebuffer is mapped must also be aligned on the
same boundary, e.g. for a 1MB PageSize, the alignment of the virtual address must be on a 1MB
boundary.
With a single TLB entry, TLB misses and the associated performance degradation are minimized.
Depending upon the software environment, one additional performance improvement can be realized
by mapping the framebuffer with a static, or wired, TLB entry. The MIPS32™ TLB permits certain
TLB entries to not participate in the random TLB replacement algorithm (dynamic) and thus remain
in the TLB indefinitely (static) until removed by software. By using a static TLB entry, TLB misses
caused by framebuffer accesses are completely eliminated.
Combining the optimizations for framebuffer caching and mapping, the ideal framebuffer
configuration uses a single, static TLB entry covering the entire framebuffer with CCA=6 and data
cache way 0 locked.
18
Application Note
Rev. 30274A April 2003
AMD Alchemy™ Solutions Au1100™ Processor LCD Performance
6.0 Conclusion
The Au1100 processor LCD controller provides a cost-effective, flexible solution for connecting to a
variety of displays. While the performance of the LCD controller is constant, system design issues, in
particular the long-latency static bus accesses, can impact the ability of the LCD controller to
maintain display refresh. Several optimizations including choice of LCD panel, Au1100 processor
operating frequency, and software optimizations of framebuffer caching and mapping are presented
for fine-tuning the Au1100 processor based design.
7.0 References
1. Alchemy™ Au1100™ Processor from AMD Data Book, AMD, 2002.
2. AMD Alchemy Solutions Au1000, Au1100 and Au1500 Processors SDRAM Performance - Application Note, AMD, 2003.
Application Note
19