ETC 29KPROG

Evaluating and Programming the
29K RISC Family
Third Edition – DRAFT
A D V A N C E D
M I C R O
D E V I C E S
IF YOU HAVE QUESTIONS, WE’RE HERE TO HELP YOU.
Customer Service
AMD’s customer service network includes U.S. offices, international offices, and
a customer training center. Expert technical assistance is available to answer 29K
Family hardware and software development questions from AMD’s worldwide
staff of field application engineers and factory support staff.
Hotline, Bulletin Board, and eMail Support
For answers to technical questions, AMD provides a toll-free number for direct
access to our engineering support staff. For overseas customers, the easiest way to
reach the engineering support staff with your questions is via fax with a short
description of your question. Also available is the AMD bulletin board service,
which provides the latest 29K product information, including technical information and data on upcoming product releases. AMD 29K Family customers also
receive technical support through electronic mail. This worldwide service is available to 29K product users via the International UNIX eMail service. To access the
service, use the AMD eMail address: “[email protected].”
Engineering Support Staff:
(800) 292-9263 ext. 2
(512) 602-4118
0800-89-1455
0031-11-1163
(512) 602-5031
toll free for U.S.
local for U.S.
toll free for UK
toll free for Japan
FAX for overseas
Bulletin Board:
(800) 292-9263 ext. 1
(512) 602-4898
toll free for U.S.
worldwide and local for U.S.
Documentation and Literature
The 29K Family Customer Support Group responds quickly to information and
literature requests. A simple phone call will get you free 29K Family information
such as data books, user’s manuals, data sheets, application notes, the Fusion29K
Partner Solutions Catalog and Newsletter, and other literature. Internationally,
contact your local AMD sales office for complete 29K Family literature.
Customer Support Group:
(800) 292-9263 ext. 3
(512) 602-5651
(512) 602-5051
toll free for U.S.
local for U.S.
FAX for U.S.
Evaluating and Programming the
29K RISC Family
Third Edition – DRAFT
Daniel Mann
Advanced Micro Devices
1995 Daniel Mann
Advanced Micro Devices reserves the right to make changes in its products
without notice in order to improve design or performance characteristics.
This publication neither states nor implies any warranty of any kind, including but
not limited to implied warrants of merchantability of fitness for a particular application. AMD assumes no responsibility for the use of any circuit other than the circuit
in an AMD product.
The author and publisher of this book have used their best efforts in preparing this
book. Although the information presented has been carefully reviewed and is believed to be reliable, the author and publisher make no warranty of any kind, expressed or implied, with regard to example programs or documentation contained in
this book. The author and publisher shall not be liable in any event for accidental or
consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.
Trademarks
29K, Am29005, Am29027, Am29050, Am29030, Am29035, Am29040, Am29200,
Am29205, Am29240, Am29243, Am29245, EZ030, SA29200, SA29240,
SA29040, MiniMON29K, XRAY29K, ASM29K, ISS, SIM29, Scalable Clocking,
Traceable Cache and UDI are a trademark of Advanced Micro Devices, Inc.
Fusion29K is a registered service trademark of Advanced Micro Devices, Inc.
AMD and Am29000 are registered trademarks of Advanced Micro Devices, Inc.
PowerPC is a trademark of International Buisness Machines Corp.
MRI and XRAY are trademarks of Microtec Reasearch Inc.
High C is a registered trade mark of MetaWare Inc.
i960 is a trademarks of Intel, Inc.
MC68020 is a trademark of Motorola Inc.
UNIX is a trademark of AT&T.
NetROM is a trademark of XLNT Designs, Inc.
UDB and UMON are trademarks of CaseTools Inc.
Windows is a trademarks of Microsoft Corp.
Product names used in this publication are for identification purposes only and may
be trademarks of their respective companies.
To my wife
Audrey
and my son
Geoffrey
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Chapter 1
Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
A RISC DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FAMILY MEMBER FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE Am29000 3–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 The Am29005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE Am29050 3–BUS FLOATING–POINT MICROPROCESSOR . . . . . .
THE Am29030 2–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Am29030 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.2 The Am29035 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE Am29040 2–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Am29040 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A SUPERSCALAR 29K PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 Instruction Issue and Data Dependency . . . . . . . . . . . . . . . . . . . . .
1.7.2 Reservation Stations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.3 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.4 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE Am29200 MICROCONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.1 ROM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.2 DRAM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.3 Virtual DRAM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.4 PIA Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.5 DMA Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.6 16–bit I/O Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.7 Parallel Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.8 Serial Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
5
7
11
11
13
16
16
17
19
20
21
25
27
30
33
35
36
37
37
37
38
38
39
vii
1.9
1.10
1.11
1.12
1.13
1.14
1.8.9 I/O Video Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.10 The SA29200 Evaluation Board . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.11 The Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.12 Am29200 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.13 The Am29205 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE Am29240 MICROCONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.1 The Am29243 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.2 The Am29245 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.3 The Am2924x Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REGISTER AND MEMORY SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10.1 General Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10.2 Special Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10.3 Translation Look–Aside Registers . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10.4 External Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INSTRUCTION FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
KEEPING THE RISC PIPELINE BUSY . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PIPELINE DEPENDENCIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ARCHITECTURAL SIMULATION, sim29 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.14.1 The Simulation Event File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.14.2 Analyzing the Simulation Log File . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
39
40
40
40
41
44
44
45
46
47
49
61
62
64
65
67
70
75
83
Chapter 2
Applications Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.1
2.2
2.3
2.4
viii
C LANGUAGE PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Activation Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Spilling And Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.4 Global Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.5 Memory Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RUN–TIME HIF ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 OS Preparations before Calling start In crt0 . . . . . . . . . . . . . . . . . .
2.2.2 crt0 Preparations before Calling main() . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Run–Time HIF Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Switching to Supervisor Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C LANGUAGE COMPILER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Metaware High C 29K Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Free Software Foundation, GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 C++ Compiler Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.5 Executable Code and Source Correspondence . . . . . . . . . . . . . . .
2.3.6 Linking Compiled Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LIBRARY SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
91
92
93
94
95
95
97
100
101
103
106
106
110
112
113
113
116
119
119
Contents
2.5
2.6
2.4.2 Setjmp and Longjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Support Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C LANGUAGE INTERRUPT HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 An Interrupt Context Cache with High C 29K . . . . . . . . . . . . . . . . .
2.5.2 An Interrupt Context Cache with GNU . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Using Signals to Deal with Interrupts . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 Interrupt Tag Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 Overloaded INTR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6 A Signal Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.7 Minimizing Interrupt Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.8 Signal Processing Without a HIF Operating System . . . . . . . . . . .
2.5.9 An Example Am29200 Interrupt Handler . . . . . . . . . . . . . . . . . . . . .
SUPPORT UTILITY PROGRAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Examining Object Files (Type .o And a.Out) . . . . . . . . . . . . . . . . . .
2.6.2 Modifying Object Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3 Getting a Program into ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
122
123
127
128
131
134
140
143
152
153
153
156
156
158
159
Chapter 3
Assembly Language Programming . . . . . . . . . . . . . . . . . . . . . . . . 161
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Contents
INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.5 Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.6 Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.7 Floating–point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.8 Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.9 Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.10 Reserved Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CODE OPTIMIZATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AVAILABLE REGISTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Useful Macro–Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Using Indirect Pointers and gr0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Using gr1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Accessing Special Register Space . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.5 Floating–point Accumulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DELAYED EFFECTS OF INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . .
TRACE–BACK TAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTERRUPT TAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TRANSPARENT ROUTINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INITIALIZING THE PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ASSEMBLER SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
162
164
165
165
168
172
173
173
175
176
177
178
180
181
182
183
184
185
186
188
190
190
191
ix
3.9.1
3.9.2
The AMD Assembler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Free Software Foundation (GNU), Assembler . . . . . . . . . . . . . . . . 192
Chapter 4
Interrupts and Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.1
4.2
4.3
4.4
29K PROCESSOR FAMILY INTERRUPT SEQUENCE . . . . . . . . . . . . . . . .
29K PROCESSOR FAMILY INTERRUPT RETURN . . . . . . . . . . . . . . . . . .
SUPERVISOR MODE HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 The Interrupt Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Interrupt Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Simple Freeze-mode Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Operating in Freeze mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 Monitor mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.6 Freeze-mode Clock Interrupt Handler . . . . . . . . . . . . . . . . . . . . . . .
4.3.7 Removing Freeze mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.8 Handling Nested Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.9 Saving Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.10 Enabling Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.11 Restoring Saved Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.12 An Interrupt Queuing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.13 Making Timer Interrupts Synchronous . . . . . . . . . . . . . . . . . . . . . . .
USER-MODE INTERRUPT HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Supervisor mode Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Register Stack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 SPILL and FILL Trampoline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 SPILL Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 FILL Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.6 Register File Inconsistencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.7 Preparing the C Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.8 Handling Setjmp and Longjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
196
197
199
199
200
201
202
203
204
205
208
210
212
214
216
221
222
223
226
228
229
230
231
234
236
Chapter 5
Operating System Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.1
5.2
5.3
5.4
5.5
5.6
5.7
x
REGISTER CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNCHRONOUS CONTEXT SWITCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ASYNCHRONOUS CONTEXT SWITCH . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTERRUPTING USER MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROCESSING SIGNALS IN USER MODE . . . . . . . . . . . . . . . . . . . . . . . . . .
INTERRUPTING SUPERVISOR MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
USER SYSTEM CALLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
240
242
245
249
250
253
254
258
260
261
Contents
5.8
5.9
5.10
5.11
5.12
5.13
FLOATING–POINT ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DEBUGGER ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RESTORING CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTERRUPT LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ON–CHIP CACHE SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INSTRUCTION CACHE MAINTENANCE . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.1 Cache Locking and Invalidating . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.2 Instruction Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.3 Branch Target Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.4 Am29030 2–bus Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.5 Am29240 and Am29040 Processors . . . . . . . . . . . . . . . . . . . . . . . .
5.14 DATA CACHE MAINTENANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.1 Am29240 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.2 Am29040 2–bus Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.3 Cache Locking and Invalidating . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.4 Cache Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.15 SELECTING AN OPERATING SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.16 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
264
265
266
268
270
270
273
275
275
276
277
277
279
283
288
288
290
294
Chapter 6
Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
6.1
6.2
6.3
6.4
SRAM VERSUS DRAM PERFORMANCE . . . . . . . . . . . . . . . . . . . . . . . . . . .
TRANSLATION LOOK–ASIDE BUFFER (TLB) OPERATION . . . . . . . . . .
6.2.1 Dual TLB Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Taking a TLB Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PERFORMANCE EQUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SOFTWARE CONTROLLED CACHE MEMORY ARCHITECTURE . . . . .
6.4.1 Cache Page Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Data Access TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Instruction Access TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.4 Data Write TLB Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.5 Supervisor TLB Signal Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.6 Copying a Page into the Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.7 Copying a Page Out of the Cache . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.8 Cache Set Locked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.9 Returning from Signal Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.10 Support Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.11 Performance Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
296
300
305
307
308
310
313
315
318
319
320
322
323
325
326
327
328
Chapter 7
Software Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
7.1
7.2
Contents
REGISTER ASSIGNMENT CONVENTION . . . . . . . . . . . . . . . . . . . . . . . . . . 331
PROCESSOR DEBUG SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
xi
7.3
7.4
7.5
7.6
7.7
7.8
7.9
xii
7.2.1 Execution Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Memory Access Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.3 Trace Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.4 Program Counter register PC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.5 Monitor Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.6 Instruction Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.7 Data Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE MiniMON29K DEBUGGER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 The Target MiniMON29K Component . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 The DebugCore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.4 DebugCore installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.5 Advanced DBG and CFG Module Features . . . . . . . . . . . . . . . . . .
7.3.6 The Message System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.7 MSG Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.8 MSG Virtual Interrupt Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . .
THE OS–BOOT OPERATING SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 OS–boot Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 HIF Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Adding New Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.5 Memory Access Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.6 Down Loading a New OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
UNIVERSAL DEBUG INTERFACE (UDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Debug Tool Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.2 UDI Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.3 P–trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.4 The GDB–UDI Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.5 The UDI–MiniMON29K Monitor Connection, MonTIP . . . . . . . . . .
7.5.6 The MiniMON29K User–Interface, MonDFE . . . . . . . . . . . . . . . . .
7.5.7 The UDI – Instruction Set Simulator Connection, ISSTIP . . . . . .
7.5.8 UDI Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.9 Getting Started with GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.10 GDB and MiniMON29K Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
SIMPLIFYING ASSEMBLY CODE DEBUG . . . . . . . . . . . . . . . . . . . . . . . . . .
SOURCE LEVEL DEBUGGING USING A WINDOW INTERFACE . . . . . .
TRACING PROGRAM EXECUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fusion3D TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9.1 NetROM ROM Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9.2 HP16500B Logic Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9.3 Selecting Trace Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9.4 Corelis PI – Am29040 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . .
7.9.5 Corelis PI – Am29460 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . .
333
334
334
335
336
336
338
338
339
340
341
342
347
348
349
349
350
351
351
352
353
354
358
359
360
361
363
364
365
366
368
369
370
373
374
377
383
397
397
400
404
406
408
Contents
Chapter 8
Selecting a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
8.1
THE 29K FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Selecting a Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Moving up to an Am2920x Microcontroller . . . . . . . . . . . . . . . . . . .
8.1.3 Selecting a Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.4 Reducing the Register Window Size . . . . . . . . . . . . . . . . . . . . . . . .
419
423
431
434
443
Appendix A
HIF Service Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
A.1
A.2
Service Call Numbers And Parameters . . . . . . . . . . . . . . . . . . . . . . 450
Error Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Appendix B
HIF Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.1
B.2
B.3
User Trampoline Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Library Glue Routines to HIF Signal Services . . . . . . . . . . . . . . . . . 518
The Library signal() Routine for Registering a Handler . . . . . . . . . 519
Appendix C
Software Assigned Trap Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 522
Appendix D
DebugCore 2.0 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
D.1
D.2
D.3
D.4
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REGISTER USAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DEBUGCORE 1.0 ENHANCEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3.1 Executing OS Service Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3.2 Per–Process Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3.3 Current PID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3.4 Virtual or Physical Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3.5 Breakpoint Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MODULE INTERCONNECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.4.1 The DebugCore 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.4.2 The Message System 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.4.3 The DebugCore 2.0 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . .
526
527
527
528
529
530
530
531
531
531
536
539
References and Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
Contents
xiii
xiv
Contents
Figures
Figure 1-1. RISC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Figure 1-2. CISC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Figure 1-3. Processor Price–Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Figure 1-4. Am29000 Processor 3–bus Harvard Memory System . . . . . . . . . . . . . . . . . . . .
9
Figure 1-5. The Instruction Window for Out–of–Order Instruction Issue . . . . . . . . . . . . . .
24
Figure 1-6. A Function Unit with Reservation Stations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Figure 1-7. Register Dependency Resolved by Register Renaming . . . . . . . . . . . . . . . . . . .
28
Figure 1-8. Circular Reorder Buffer Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Figure 1-9. Multiple Function Units with a Reorder Buffer . . . . . . . . . . . . . . . . . . . . . . . . .
29
Figure 1-10. Instruction Decode with No Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . .
31
Figure 1-11. Four–Instruction Decoder with Branch Prediction . . . . . . . . . . . . . . . . . . . . .
32
Figure 1-12. Am29200 Microcontroller Address Space Regions . . . . . . . . . . . . . . . . . . . . .
35
Figure 1-13. Am29200 Microcontroller Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Figure 1-14. General Purpose Register Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Figure 1-15. Special Purpose Register Space for the Am29000 Microprocessor . . . . . . . .
51
Figure 1-16. Am29000 Processor Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Figure 1-17. Additional Special Purpose Registers for the Monitor Mode Support . . . . . .
56
Figure 1-18. Additional Special Purpose Registers for the Am29050 Microprocessor . . .
57
Figure 1-19. Additional Special Purpose Registers for Breakpoint Control . . . . . . . . . . . .
57
Figure 1-20. Additional Special Purpose Registers for On–Chip Cache Control . . . . . . . .
58
Figure 1-21. Additional Special Purpose Register for the Am29050 Microprocessor . . . . .
61
Figure 1-22. Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Figure 1-23. Frequently Occurring Instruction–Field Uses . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Figure 1-24. Pipeline Stages for BTC Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
xv
Figure 1-25. Pipeline Stages for a BTC Hit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Figure 1-26. Data Forwarding and Bad–Load Scheduling . . . . . . . . . . . . . . . . . . . . . . . . .
69
Figure 1-27. Register Initialization Performed by sim29 . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
Figure 2-1. Cache Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Figure 2-2. Overlapping Activation Record Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
Figure 2-3. 29K Microcontroller Interrupt Control Register . . . . . . . . . . . . . . . . . . . . . . . . 141
Figure 2-4.
Processing Interrupts with a Signal Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . 146
Figure 3-1. The EXTRACT Instruction uses the Funnel Shifter . . . . . . . . . . . . . . . . . . . . . 166
Figure 3-2. LOAD and STORE Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 3-3.
General Purpose Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Figure 3-4. Global Register gr1 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Figure 3-5. Trace–Back Tag Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 3-6. Walking Back Through Activation Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Figure 3-7. Interrupt Procedure Tag Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Figure 4-1. Interrupt Handler Execution Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 4-2. The Format of Special Registers CPS and OPS . . . . . . . . . . . . . . . . . . . . . . . . 197
Figure 4-3.
Interrupted Load Multiple Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Figure 4-4.
Am29000 Processor Interrupt Enable Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Figure 4-5.
Interrupt Queue Entry Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Figure 4-6.
An Interrupt Queuing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Figure 4-7. Queued Interrupt Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Figure 4-8. Saved Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Figure 4-9. Register and Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Figure 4-10. Stack Upon Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 4-11. Stack After Fix–up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Figure 4-12. Long–Jump to Setjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Figure 5-1. A Consistent Register Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Figure 5-2. Current Procedures Activation Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Figure 5-3. Overlapping Activation Records Eventual Spill Out of the
Register Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Figure 5-4. Context Save PCB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Figure 5-5. Register Stack Cut–Across . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Figure 5-6. Instruction Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Figure 5-7. Am29240 Microcontroller Cache Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Figure 5-8. Am29240 Data Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Figure 5-9. Am29040 2–bus Microprocessor Cache Data Flow . . . . . . . . . . . . . . . . . . . . . 284
xvi
Figures
Figure 5-10. Am29040 Data Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Figure 6-1. Average Cycles per Instruction Using DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Figure 6-2. Average Cycles per Instruction Using SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Figure 6-3. Block Diagram of Example Joint I/D System . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Figure 6-4. Average Cycles per Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Figure 6-5. Probability of a TLB Access per Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Figure 6-6. TLB Field Composition for 4K Byte Page Size . . . . . . . . . . . . . . . . . . . . . . . . . 302
Figure 6-7. Block Diagram of Am29000 processor TLB Layout . . . . . . . . . . . . . . . . . . . . . 303
Figure 6-8. Am29000 Processor TLB Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Figure 6-9. TLB Register Format for Processor with Two TLBs . . . . . . . . . . . . . . . . . . . . . 306
Figure 6-10. TLB Miss Ratio for Joint I/D 2–1 SRAM System . . . . . . . . . . . . . . . . . . . . . . . 309
Figure 6-11. Average Cycles Required per TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Figure 6-12. PTE Mapping to Cache Real Page Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Figure 6-13. Software Controlled Cache, K bytes paged–in . . . . . . . . . . . . . . . . . . . . . . . . 314
Figure 6-14. Probability of a Page–in Given a TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Figure 6-15. TLB Signal Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Figure 6-16. Cache Performance Gains with the Assembly Utility . . . . . . . . . . . . . . . . . . . 328
Figure 6-17. Cache Performance Gains with NROFF Utility . . . . . . . . . . . . . . . . . . . . . . . 329
Figure 6-18. Comparing Cache Based Systems with DRAM Only Systems . . . . . . . . . . . . . 329
Figure 7-1. 29K Development and Debug Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Figure 7-2. MinMON29k Debugger Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Figure 7-3. 29K Target Software Module Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Figure 7-4. Vector Table Assignment for DebugCore 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Figure 7-5. Processor Initialization Code Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Figure 7-6. Operating System Information Passed to dbg_control() . . . . . . . . . . . . . . . . . . 345
Figure 7-7. Return Structure from dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Figure 7-8. Typical OS–boot Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Figure 7-9. Currently Available Debugging Tools that Conform to UDI Specification . . . . 362
Figure 7-10. The UDB to 29K Connection via the GIO Process . . . . . . . . . . . . . . . . . . . . . 378
Figure 7-11. UDB Main Window Showing Source Code
. . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Figure 7-12. UDB Window Showing the Assembly Code Associated with the
Previous Source Code Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Figure 7-13. UDB Window Showing Global Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Figure 7-14. HP16500B Logic Analyzer Window Showing State Listing . . . . . . . . . . . . . . . . 386
Figure 7-15. Path Taken By Am29040 Recursive Trace Processing Algorithm . . . . . . . . . . . 390
Figure 7-16. UDB Console Window Showing Processed Trace Information . . . . . . . . . . . . . 395
Figures
xvii
Figure 7-17. UDB Trace Window Showing Processed Trace Information . . . . . . . . . . . . . . 396
Figure 7-18. PI–Am29460 Preprocessor Trace Capture Scheme . . . . . . . . . . . . . . . . . . . . . 409
Figure 7-19. PI–Am29460 Preprocessor Trace Capture Timing . . . . . . . . . . . . . . . . . . . . . 410
Figure 7-20. Slave Data Supporting Am29460 Traceable Cache . . . . . . . . . . . . . . . . . . . . 412
Figure 7-21. RLE Output Queue From Reorder Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Figure 8-1. 29K Microcontrollers Running the LAPD Benchmark
With 16 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Figure 8-2. 29K Microcontrollers Running the LAPD Benchmark
With 20 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Figure 8-3. 29K Microcontrollers Running the LAPD Benchmark
With 25 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Figure 8-4. 29K Microcontrollers Running the LAPD Benchmark
With 33 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Figure 8-5. Am2920x Microcontrollers Running the LAPD Benchmark with
8–bit and 16–bit Memory Systems Operating at 12 and 16 MHz . . . . . . . . . . . 432
Figure 8-6. 29K Microprocessors Running the LAPD Benchmark
with 16 MHz Memory systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Figure 8-7. 29K Microprocessors Running the LAPD Benchmark
with 20 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Figure 8-8. 29K Microprocessors Running the LAPD Benchmark
with 25 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
Figure 8-9. 29K Microprocessors Running the LAPD Benchmark
with 33 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Figure 8-10. Am29040 Microprocessors Running the LAPD Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Figure 8-11. Am29200 Microcontroller Running the LAPD Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
Figure 8-12. Am29040 Microprocessors Running the Stanford Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Figure 8-13. Reduction In Worst–Case Asynchronous Task Context Switch Times
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Figure A-1. HIF Register Preservation for Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Figure D-1. 29K Target Software Module configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Figure D-2. Data Structure Shared by Operating System and DebugCore 2.0 . . . . . . . . . . 529
Figure D-3. DebugCore 2.0 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Figure D-4. OS Information Passed to dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Figure D-5. Return Structure from dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Figure D-6. DebugCore 2.0 Receive Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Figure D-7. Message System 1.0 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Figure D-8. Configuration Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
xviii
Figures
Tables
Table 1-1. Pin Compatible 3–bus 29K Family Processors . . . . . . . . . . . . . . . . . . . . . . . . .
8
Table 1-2. Pin Compatible 2–bus 29K Family Processors . . . . . . . . . . . . . . . . . . . . . . . . . 14
Table 1-3. Am2920x Microcontroller Members of 29K Processor Family . . . . . . . . . . . . . . 34
Table 1-4. Am2924x Microcontroller Members of 29K Processor Family . . . . . . . . . . . . . . 42
Table 1-5. 3–bus Processor Memory Modeling Parameters for sim29 . . . . . . . . . . . . . . . . 76
Table 1-6. 3–bus Processor DRAM Modeling Parameters for sim29 (continued) . . . . . . . . 77
Table 1-7. 3–bus Processor Static Column Modeling Parameters for sim29 (continued) . . 77
Table 1-8. 3–bus Processor Memory Modeling Parameters for sim29 (continued) . . . . . . . 78
Table 1-9. 2–bus Processor Memory Modeling Parameters for older sim29 . . . . . . . . . . . . 78
Table 1-10. 2–bus Processor Memory Modeling Parameters for newer sim29 . . . . . . . . . . 79
Table 1-11. Microcontroller Memory Modeling Parameters for sim29 . . . . . . . . . . . . . . . . 81
Table 1-12. Microcontroller Processor Memory Modeling Parameters for newer sim29 . . 83
Table 2-1. Trap Handler Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 2-2. HIF Service Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Table 2-3. HIF Service Call Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Table 2-4. HIF Service Call Parameters (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Table 3-1. Integer Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Table 3-2. Integer Arithmetic Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Table 3-3. Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table 3-4. Compare Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 3-5. Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table 3-6. Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table 3-7. Data Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Table 3-8. Data Move Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xix
Table 3-9. Constant Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Table 3-10. Floating–Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Table 3-11. Floating–Point Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 3-12. Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Table 3-13. Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Table 4-1. Global Register Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Table 4-2. Expanded Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Table 5-1. 29K Family Instruction and Date Cache Support . . . . . . . . . . . . . . . . . . . . . . . . 271
Table 5-2. Instruction Cache Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Table 5-3. Data Cache Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Table 5-4. PGM Field of the Am29040 Microprocessor TLB . . . . . . . . . . . . . . . . . . . . . . . . 286
Table 6-1. PGM Field of the Am29040 Microprocessor TLB . . . . . . . . . . . . . . . . . . . . . . . . 307
Table 7-1. 29K Family On-chip Debug Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Table 7-2. UDI–p Procedures (Version 1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Table 7-3. ptrace() Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Table 7-4. GDB Remote–Target Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Table 7-5.
PI–Am29040 Logic Analyzer Pod Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 407
Table 7-6.
PI–Am29460 Logic Analyzer Pod Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 415
Table 8-1. Memory Access Times for Am2920x Microcontroller ROM Space . . . . . . . . . . . 426
Table 8-2. ROM and FLASH Memory Device Access Times . . . . . . . . . . . . . . . . . . . . . . . . 427
Table 8-3. Memory Access Times for Am2924x Microcontroller ROM Space . . . . . . . . . . . 427
Table 8-4. Cache Block Reload Times for Various Memory Types . . . . . . . . . . . . . . . . . . . . 436
Table A-1. HIF Open Service Mode Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Table A-2. Default Signals Handled by HIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
Table A-3. HIF Signal Return Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Table A-4. HIF Error Numbers Assigned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Table A-5. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Table A-6. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Table A-7. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Table A-8. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Table A-9. HIF Error Numbers Assigned (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Table C-1. Software Assigned Trap Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Table C-2. Software Assigned Trap Numbers (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Table C-3. Software Assigned Trap Numbers (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . 524
xx
Tables
Preface
The first edition of this book brought together, for the first time, a
comprehensive collection of information required by the person developing software
for the Advanced Micro Devices 29K family of RISC microprocessors and
microcontrollers. This second edition contains all the material from the first. In
addition it adds many new topics such as performance evaluation and on–chip cache
operation. Topics such as interrupt processing and software debugging are extended
with the addition of new techniques. The book is useful to the computer professional
and student interested in the 29K family RISC implementation. It does not assume
that the reader is familiar with RISC techniques.
Although certain members of the 29K family are equally suited to the
construction of a workstation or an embedded application, the material is mainly
applicable for embedded application development. This slant shall be appreciated by
most readers; since early in the 29K’s introduction AMD has promoted the family as
a collection of processors spanning a wide range of embedded performance.
Additionally, in recent years, AMD started a range of microcontrollers, initially with
the Am29200. The inclusion of onchip peripherals in the microcontroller
implementations resulted in this particular extension to the family being well
received by the embedded processor community.
The success of the 29K family, and of RISC technology in general, has created
considerable interest within the microprocessor industry. A growing number of
engineers are evaluating RISC, and an increasing number are selecting RISC rather
than CISC designs for new products. Higher processor performance is the main
reason cited for adopting new RISC designs. This book describes the methods used
by the 29K family –– many of which are characteristic of the RISC–approach –– to
obtain a performance gain vis–a–vis CISC processors. Many of the processor and
software features described will be compared with an equivalent CISC method; this
shall assist the engineer making the CISC to RISC transition.
xxi
Because the 29K family architecture reveals the processor’s internal pipeline
operation much more than a CISC architecture, a better understanding of how the
software can control the hardware and avoid resource conflicts is required to obtain
the best performance. Up to this point, software engineers have had to glean
information about programming the 29K family from scattered application notes,
conference proceedings and other publications. In addition much of the necessary
information has never been documented. This has lead to a number of difficulties,
particularly where the most efficient use of the RISC design features is sought.
The material presented is practical rather than theoretical. Each chapter is in a
somewhat standalone form, reducing the need to read earlier chapters before later
chapters are studied. Many of the code examples are directly usable in real embedded
systems rather than as student exercises. Engineers planning on using the 29K
family will be able to extract useful code sequences from the book for integration into
their own designs. Much of the material presented has been used by AMD, and other
independent companies, in building training classes for computer professionals
wishing to quickly gain an understanding of the 29K family.
This book is organized as follows:
Chapter 1 describes the architectural characteristics of the 29K RISC
microprocessor and microcontroller family. The original family member, the
Am29000 processor, is described first. Then the family tree evolution is dealt with in
terms of each member’s particular features. Although all 29K processors are
application code compatible they are not all pin compatible. The ability of the 29K
family to be flexible in its memory requirements is presented. In addition, the chapter
shows the importance of keeping the RISC pipeline busy if high performance is to be
achieved.
Chapter 2 deals with application programming. It covers the main topics
required by a software developer to produce code for execution on a 29K.
Application coding is done in a high level language and the chapter assumes the C
language is most widely used. The dual register and memory stack technique used by
the 29K procedure calling–convention is described in detail, along with the process
of maintaining the processor’s local register file as a cache for the top of the register
stack. Application programs require runtime support. The library services typically
used by developers make demands upon such operating system services. The Host
Interface (HIF) specifies a set operating system services. The HIF services are
described and their relevance put in context.
Chapter 3 explains how to program a 29K at assembly level. Methods of
partioning and accessing a processor’s register space are described. This includes the
special register space which can only be reached by assembly level instructions. The
reader is shown how to deal with such topics as branch delay slots and memory access
latency. It is not expected that application programs will be developed in assembly
xxii
Preface
language, rather, that assembly language coding skills are required by the operating
system developer. Some developers may only be required to utilize assembly coding
to implement, say, a small interrupt handler routine.
Chapter 4 deals with the complex subject of 29K interrupts. Because 29K
processors make no use of microcode, the range of interrupt handler options is
extended over the typical CISC type processor. Techniques new to the reader familiar
with CISC, such as lightweight interrupts and interrupt context caching, are
presented. Most application developers are moving toward writing interrupt
handlers in a high level language, such as C. This chapter describes the process of
preparing the 29K to handle a C level signal handler after taking an interrupt or trap.
Chapter 5 deals with operating system issues. It describes, in detail, the process
of performing an application task context switch. This is one of the major services
performed by an operating system. A detailed knowledge of the utilized
procedural–linkage mechanism and 29K architectural features is required to
implement a high performance context switch. Also dealt with are issues concerning
the operation and maintenance of on–chip instruction and data memory cache.
Chapter 6 describes the Translation Look–Aside Buffer (TLB) which is
incorporated into many of the 29K family members. Its use as a basic building block
for a Memory Management Unit (MMU) is described. This chapter also
demonstrates the use of the TLB to implement a software–controlled cache which
improves overall system performance.
Chapter 7 explains the operation of popular software debugging tools such as
MiniMON29K and GDB. The process of building a debug environment for an
embedded application is described. Also dealt with is the Universal Debug Interface
(UDI) which is used to connect the user–interface process with the process
controlling the target hardware. The use of UDI introduces new freedom in tool
choice to the embedded product developer.
Chapter 8 helps with the sometimes difficult task of processor selection.
Performance benchmarks are presented for all the current 29K family members. The
effect on–chip cache and memory system performance have on system performance
is quantified. Systems are considered in terms of their performance and software
programming requirements.
Although I am the sole author of this book, I would like to thank my colleagues
at Advanced Micro Devices for their help with reviewing early manuscripts. I am
also grateful for their thoughtful suggestions, many of which were offered during the
porting of 4.3bsd UNIX to the Am29000 processor. I would also like to thank Grant
Maxwell for his helpful comments and in particular his review of chapters 1, 5 and 8.
Bob Brians also extensively reviewed the first edition and suggested a number of
improvements; he also made many helpful comments when he reviewed the
Preface
xxiii
manuscript for this second edition. Mike Johnson and Steve Guccione reviewed the
section introducing superscalar processors. Chip Freitag reviewed chapter 8 and
helped me improve its quality. Discussions with Leo Lozano helped resolve many of
the issues concerning cache operation dealt with in chapter 5. Thanks also to
Embedded Systems Programming for allowing the use of material describing the
GDB debugger which first appeared in their volume 5 number 12 issue. Embedded
System Engineering is also thanked for allowing the reuse of material describing the
Am29040 processor and Architectural Simulator. Finally, I would like to thank the
Product Marketing Department of AMD’s Embedded Processor Division, for their
encouragement to complete this second edition.
xxiv
Preface
Chapter 1
Architectural Overview
This Chapter deals with a number of topics relevant to the selection of a 29K
family member. General RISC architecture characteristics are discussed before each
family member is described in more detail. A RISC microprocessor can achieve high
performance only if its pipeline is kept effectively busy — this is explained. Finally,
the architectural simulator is described; it is an important tool in evaluating a processors performance.
The instruction set of the 29K family was designed to closely match the internal
representation of operations generated by optimizing compilers. Instruction execution times are not burdened by redundant instruction formats and options. CISC microprocessors trap computational sequences in microcode. Microcode is a set of sequences of internal processor operations combined to perform a machine instruction.
A CISC microprocessor contains an on–chip microprogram memory to hold the microcode required to support the complex instructions. It is difficult for a compiler to
select CISC instruction sequences which result in the microcode being efficiently
applied to the overall computational task. The myopic microcode results in processor
operational overhead. The compiler for a CISC can not remove the overhead, it can
only reduce it by making the best selection from the array of instruction options and
formats — such as addressing modes. The compiler for a 29K RISC can exploit lean
instructions whose operation is free of microcode and always visible to the compiler
code–generator.
Each 29K processor has a 4–stage RISC pipeline: consisting of first, a fetch
stage, followed by decode, execute and write–back stages. Instructions, with few exceptions, execute in a single–cycle. Although instructions are streamlined, they still
support operations on two source operands, placing the result in a third operand. Registers are used to supply operands for most instructions, and the processor contains a
1
large number of registers to reduce the need to fetch data from off–chip memory.
When external memory is accessed it is via explicit load and store operations, and
never via extended instruction addressing modes. The large number of registers,
within the processor’s register file, act effectively as a cache for program data. However, the implementation of a multiport register file is superior to a conventional data
cache as it enables simultaneous access to multiple operands.
Parameter passing between procedure calls is supported by dynamically sized
register windows. Each procedure’s register window is allocated from a stack of 128
32–bit registers. This results in a very efficient procedure call mechanism, and is responsible for considerable operational benefits compared to the typical CISC method of pushing and popping procedure parameters from a memory stack.
Processors in the 29K family also make use of other techniques usually
associated with RISC, such as delayed branching, to keep the instruction hungry
RISC fed and prevent pipeline stalling.
The freedom from microcode not only benefits the effectiveness of the instruction processing stream, but also benefits the interrupt and trap mechanism required to
support such events as external hardware interrupts. The preparations performed by
29K hardware for interrupt processing are very brief, and this lightweight approach
enables programmers to define their own interrupt architecture; enabling optimizations to be selected which are best for, say, interrupt through put, or short latency in
commencing handler processing.
The 29K family includes 3–bus Harvard memory architecture processors,
2–bus processors which have simplified and flexible memory system interfaces, and
microcontrollers with considerable on–chip system support. The range is extensive,
yet User mode instruction compatibility is achieved across the entire family [AMD
1993a]. Within each family–grouping, there is also pin compatibility. The family
supports the construction of a scalable product range with regard to performance and
system cost. For example, all of the performance of the top–end processor configurations may not be required, or be appropriate, in a product today but it may be necessary in the future. Because of the range and scalability of the family, making a commitment to 29K processor technology is an investment supported by the ability to
scale–down or scale–up a design in the future. Much of the family’s advantages are
attained by the flexibility in memory architecture choice. This is significant because
of the important impact a memory system can have on performance, overall cost, and
design and test time [Olson 1988][Olson 1989].
The microcontroller family members contain all the necessary RAM and ROM
interface glue–logic on–chip, permitting memory devices to be directly connected to
the processor. Given that memory systems need only be 8–bit or 16–bit wide, the
introduction of these devices should hasten the selection of embedded RISC in future
product designs. The use of RISC need not be considered an expensive option in
terms of system cost or hardware and software design times. Selecting RISC is not
2
Evaluating and Programming the 29K RISC Family
only the correct decision for expensive workstation designs, but increasingly for a
wide range of performance and price sensitive embedded products.
1.1
A RISC DEFINITION
The process of dealing with an instruction can be broken down into stages (see
Figure 1-1). An instruction must then flow through the pipeline of stages before its
processing is complete. Independent hardware is used at each pipeline stage. Information is passed to subsequent pipeline stages at the completion of each processor
cycle. At any instant, the pipeline stages are processing several instructions which are
each at a different stage of completion. Pipelining increases the utilization of the processor hardware, and effectively reduces the number of processor cycles required to
process an instruction.
Instruction #1
fetch
Instruction #2
decode
execute
write–back
fetch
decode
execute
write–back
fetch
decode
execute
Instruction #3
cycle t
t+1
t+2
1–cycle
Figure 1-1. RISC Pipeline
With a 4–stage pipeline an instruction takes four cycles to complete, assuming
the pipeline stages are clocked at each processor cycle. However, the processor is
able to start a new instruction at each new processor cycle, and the average processing time for an instruction is reduced to 1–cycle. Instructions which execute in
1–cycle have only 1–cycle latency as their results are available to the next instruction.
The 4–stage pipeline of the 29K processor family supports a simplified execute
stage. This is made possible by simplifying instruction formats, limiting instruction
complexity and operating on data help in registers. The simplified execute stage
means that only a single processor cycle is required to complete execute–stage processing and the cycle time is also minimized.
CISC processors support a complex execution–stage which require several processor cycles to complete. When an instruction is ready for execution it is broken
down into a sequence of microinstructions (see Figure 1-2). These simplified
instructions are supplied by the on–chip microprogram memory. Each microinstruction must be decoded and executed separately before the instruction execution–stage
Chapter 1
Architectural Overview
3
is complete. Depending on the amount of microcode needed to implement a CISC
instruction, the number of cycles required to complete instruction processing varies
from instruction to instruction.
microcode program
dec dec dec dec
fetch
Instruction #1
exe exe exe exe
fetch
Instruction #2
dec dec dec dec
exe exe exe exe
t
t+1 t+2 t+3
1–cycle
Figure 1-2. CISC Pipeline
Because the hardware used by the execute–stage of a CISC processor is utilized
for a number of processor cycles, the other stages of the pipeline have available additional cycles for their own operation. For example, if an execute–stage requires four
processors cycles, the overlapping fetch–stage of the next instruction has four cycles
to complete. If the fetch–stage takes four or less cycles, then no stalling of the pipeline due to execute–stage starvation shall occur. Starvation or pipeline stalling occurs
when a previous stage has not completed its processing and can not pass its results to
the input of the next pipeline stage.
During the evolution of microprocessors, earlier designs operated with slower
memories than are available today. Both processor and memory speeds have seen
great improvements in recent years. However, the low cost of high performance
memory devices now readily available has shifted microprocessor design. When
memory was slow it made sense overlapping multicycle instruction fetch stages with
multicycle execute stages. Once an instruction had been fetched it was worthwhile
getting as much execute–value as possible since the cost of fetching the instruction
was high. This approach drove processor development and led to the name Complex
Instruction Set Computer.
Faster memory means that instruction processing times are no longer fetch–
stage dominated. With a reduction in the number of cycles required by the fetch–
stage, the execute–stage becomes the dominant factor in determining processor performance. Consequently attention turned to the effectiveness of the microcode sequences used to perform CISC instruction execution. Careful analysis of CISC
instruction usage revealed that the simpler instructions were much more frequently
used than the complex ones which required long microcode sequences. The conclu-
4
Evaluating and Programming the 29K RISC Family
sion drawn was that microcode rarely provides the exact sequence of operations required to support a high level language instruction.
The variable instruction execution times of CISC instructions results in complex pipeline management. It is also more difficult for a compiler to work out the
execution times for different combinations of CISC instructions. For that matter it is
harder for the assembly level programmer to estimate the execution times of, say, an
interrupt handler code sequence compared to the equivalent RISC code sequence.
More importantly, streamlining pipeline operations enables reduced processor cycle
times and greater control by a compiler of the processor’s operation. Given that the
execute–stage dominates performance, the RISC approach is to fetch more instructions which can be simply executed. Although a RISC program may contain 20%
more instructions than a program for a CISC, the total number of cycles required to
perform a task is reduced.
A number of processor characteristics have been proposed in the press as indicative of RISC or CISC. Many of these proposals are made by marketing departments
which wish to control markets by using RISC and CISC labels as marketing rather
than engineering expressions. I consider a processor to be RISC if it is microcode free
and has a simple instruction execute–stage which can complete in a single cycle.
1.2
FAMILY MEMBER FEATURES
Although this book is about Programming the 29K RISC Family, the following
sections are not restricted to only describing features which can be utilized by software. They also briefly describe key hardware features which affect a processor’s
performance and hence its selection.
All members of the family have User mode binary code compatibility. This
greatly simplifies the task of porting application code from one processor to another.
Some system–mode code may need to be changed due to differences in such things as
field assignments of registers in special register space.
Given the variation between family members such as the 3–bus Am29050 floating–point processor and the Am29205 microcontroller, it is remarkable that there is
so much software compatibility. The number of family members is expected to continue to grow; but already there is a wide selection enabling systems of ranging performance and cost to be constructed (see Figure 1-3). If AMD continues to grow the
family at “both ends of the performance spectrum”, we might expect to see new microcontroller family members as well as superscalar microprocessors [Johnson
1991]. AMD has stated that future microprocessors will be pin compatible with the
current 2–bus family members.
I think one of the key features of 29K family members is their ability to operate
with varying memory system configurations. It is possible to build very high performance Harvard type architectures, or low cost –– high access latency –– DRAM
based systems. Two types of instruction memory caching are supported. Branch TarChapter 1
Architectural Overview
5
40
2–bus processors
30
29K
processor
MIPS
20
10
3–bus processors
microcontrollers
cost
Figure 1-3. Processor Price–Performance Summary
get Cache (BTC) memory is used in 3–bus family members to hide memory access
latencies. The 2–bus family members make use of more conventional bandwidth improving instruction cache memory.
The higher performance 2–bus processors and microcontrollers have on–chip
data cache. When cache hit ratios are high, processing speeds can be decoupled from
memory system speeds; especially when the processor is clocked at a higher speed
than the off–chip memory system.
A second key feature of processors in the 29K family is that the programmer
must supply the interrupt handler save and restore mechanism. Typically a CISC type
processor will save the processor context, when an exception occurs, in accordance
with the on–chip microcode. The 29K family is free of microcode, making the user
free to tailor the interrupt and exception processing mechanism to suit the system.
This often leads to new and more efficient interrupt handling techniques. The fast interrupt response time, and large interrupt handling capacity made possible by the
flexible architecture, has been sited as one of the key reasons for selecting a 29K processor design.
All members of the 29K family make some use of burst–mode memory interfaces. Burst–mode memory accesses provide a simplified transfer mechanism for
high bandwidth memory systems. Burst–mode addressing only applies to consecutive access sequences, it is used for all instruction fetches and for load–multiple and
store–multiple data accesses.
6
Evaluating and Programming the 29K RISC Family
The 3–bus microprocessors are dependent on burst–mode addressing to free–up
the address bus after a new instruction fetch sequence has been established. The
memory system is required to supply instructions at sequential addresses without the
processor supplying any further address information; at least until a jump or call type
instruction is executed. This makes the address bus free for use in data memory access.
The non 3–bus processors can not simultaneously support instruction fetching
and data access from external memory. Consequently the address bus continually
supplies address information for the instruction or data access currently being supported by the external memory. However, burst–mode access signals are still supplied by the processor. Indicating that the processor will require another access at the
next sequential address, after the current access is complete, is an aid in achieving
maximum memory access bandwidth. There are also a number of memory devices
available which are internally organized to give highest performance when accessed
in burst–mode.
1.3
THE Am29000 3–BUS MICROPROCESSOR
The Am29000 processor is pin compatible with other 3–bus members of the
family (see Table 1-1) [AMD 1989][Johnson 1987]. It was the first member of the
family, introduced in 1987. It is the core processor for many later designs, such as the
current 2–bus processor product line. Much of this book describes the operation of
the Am29000 processor as the framework for understanding the rest of the family.
The processor can be connected to separate Instruction and data memory systems, thus exploiting the Harvard architectural advantages (See Figure 1-4). Alternatively, a simplified 2–bus system can be constructed by connecting the data and
address busses together; this enables a single memory system to be constructed.
When the full potential of the 3–bus architecture is utilized, it is usually necessary to
include in the memory system a bridge to enable instruction memory to be accessed.
The processor does not support any on–chip means to transfer information on the
instruction bus to the data bus.
The load and store instructions, used for all external memory access, have an
option field (OPT2–0) which is presented to device pins during the data transfer operation. Option field value OPT=4 is defined to indicate the bridge should permit
ROM space to be read as if it were data. Instructions can be located in two separate
spaces: Instruction space and ROM space. Often these spaces become the same, as
the IREQT pin (instruction request type) is not decoded so as to enable distinction
between the two spaces. When ROM and Instruction spaces are not common, a range
of data memory space can be set aside for accessing Instruction space via the bridge.
It is best to avoid overlapping external address spaces if high level code is to access
any memory located in the overlapping regions (see section 1.10.4).
Chapter 1
Architectural Overview
7
Table 1-1. Pin Compatible 3–bus 29K Family Processors
Processor
8
Am29000
Am29050
Am29005
Instruction Cache
BTC
32x4 words
BTC 64x4 or
128x2 words
No
I–Cache Associativity
2 Way
2 Way
N/A
Date Cache
–
–
–
D–Cache Associativity
–
–
–
On–Chip Floating–Point
No
Yes
No
On–Chip MMU
Yes
Yes
No
Integer Multiply in h/w
No
Yes
No
Programmable Bus Sizing
No
No
No
On–Chip Interrupt
Controller Inputs
Yes
6
Yes
6
Yes
6
Scalable Bus Clocking
No
No
No
Burst–mode Addressing
Yes, up to 1K bytes
Yes, up to 1K bytes
Yes, up to 1K bytes
Freeze Mode Processing
Yes
Yes
Yes
Delayed Branching
Yes
Yes
Yes
On–Chip Timer
Yes
Yes
Yes
On–Chip Memory Controler
No
No
No
DMA Channels
–
–
–
Byte Endian
Big/Little
Big/Little
Big/Little
JTAG Debugging
No
No
No
Clock Speeds (MHz)
16,20,25,33
20,25,33,40
16
Evaluating and Programming the 29K RISC Family
Coprocessor
ADDRESS
Am29000
DATA
RISC
INSTRUCTION
32
32
Bridge
Instruction
ROM
32
Instruction
Memory
Data
Memory
Input/Output
Figure 1-4. Am29000 Processor 3–bus Harvard Memory System
Chapter 1
Architectural Overview
9
All processors in the 29K family support byte and half–word size read and write
access to data memory. The original Am29000 (pre rev–D, 1990) only supported
word sized data access. This resulted in read–modify–write cycles to modify sub–
word sized objects. The processor supports insert– and extract–byte and half–word
instructions to assist with sub–word operations. These instructions are little used
today.
The processor has a Branch Target Cache (BTC) memory which is used to supply the first four instructions of previously taken branches. Successful branches are
20% of a typical instruction mix. Using burst–mode and interleaf techniques,
memory systems can sustain the high bandwidths required to keep the instruction
hungry RISC fed. However, when a branch occurs, memory systems can present considerable latency before supplying the first instruction of the branch target. For example, consider an instruction memory system which has a 3–cycle first access latency but can sustain 1–cycle access in burst–mode. Typically every 5th instruction is a
branch and for the example the branch instruction would take effectively 5–cycles to
complete its execution (the pipeline would be stalled for 4–cycles (see section 1.13)).
If all other instructions were executed in a single–cycle the average cycle time per
instruction would be 1.8 (i.e. 9/5); not the desired sustained single–cycle operation.
The BTC can hide all 3–cycles of memory access latency, and enable the branch
instruction to execute in a single–cycle.
The programmer has little control over BTC operation; it is maintained internally by processor hardware. There are 32 cache entries (known as cache blocks) of four
instructions each. They are configured in a 2–way set associative arrangement. Entries are tagged to distinguish between accesses made in User mode and Supervisor
mode; they are also tagged to differentiate between virtual addresses and physical
addresses. Because the address in the program counter is presented to the BTC at the
same time it is presented to the MMU, the BTC does not operate with physical addresses. Entries are not tagged with per–process identifiers; consequently the BTC
can not distinguish between identical virtual addresses belonging to different processes operating with virtual addressing. Systems which operate with multiple tasks
using virtual addressing must invalidate the cache when a user–task context switch
occurs. Using the IRETINV (interrupt return and invalidate) instruction is one convenient way of doing this.
The BTC is able to hold the instructions of frequently taken trap handler routines, but there is no means to lock code sequences into the cache. Entries are replaced
in the cache on a random basis, the most recently occurring branches replacing the
current entries when necessary.
The 3–bus members of the the 29K family can operate the shared address bus in
a pipeline mode. If a memory system is able to latch an address before an instruction
or data transfer is complete, the address bus can be freed to start a subsequent access.
10
Evaluating and Programming the 29K RISC Family
Allowing two accesses to be in progress simultaneously can be effectively used by
the separate instruction and data memory systems of a Harvard architecture.
1.3.1 The Am29005
The Am29005 is pin compatible with other 3–bus members of the family (see
Table 1-1). It is an inexpensive version of the Am29000 processor. The Translation
Look–Aside Buffer (TLB) and the Branch Target Cache (BTC) have been omitted. It
is available at a lower clock speed, and only in the less expensive plastic packaging. It
is a good choice for systems which are price sensitive and do not require Memory
Management Unit support or the performance advantages of the BTC. An Am29005
design can always be easily upgraded with an Am29000 replacement later. In fact the
superior debugging environment offered by the Am29000 or the Am29050 may
make the use of one of these processor a good choice during software debugging. The
faster processor can always be replaced by an Am29005 when production commences.
1.4
THE Am29050 3–BUS FLOATING–POINT MICROPROCESSOR
The Am29050 processor is pin compatible with other 3–bus members of the
family (see Table 1-1) [AMD 1991a]. Many of the features of the Am29050 were already described in the section describing its closely related relative, the Am29000.
The Am29050 processor offers a number of additional performance and system support features when compared with the Am29000. The most notable is the direct
execution of double–precision (64–bit) and single–precision (32–bit) floating–point
arithmetic on–chip. The Am29000 has to rely on software emulation or the
Am29027 floating–point coprocessor to perform floating–point operations. The
introduction of the Am29050 eliminated the need to design the Am29027 coprocessor into floating–point intensive systems.
The processor contains a Branch Target Cache (BTC) memory system like the
Am29000; but this time it is twice as big, with 32 entries in each of the two sets rather
than the Am29000’s 16 entries per set. BTC entries are not restricted to four instructions per entry; there is an option (bit CO in the CFG register) to arrange the BTC as
64 entries per set, with each entry containing two instructions rather than four. The
smaller entry size is more useful with lower latency memory systems. For example, if
a memory system has a 2–cycle first–access start–up latency it is more efficient to
have a larger number of 2–instruction entries. After all, for this example system, the
third and fourth instructions in a four per entry arrangement could just as efficiently
be fetched from the external memory.
The Am29050 also incorporates an Instruction Forwarding path which additionally helps to reduce the effects of instruction memory access latency. When a new
instruction fetch sequence commences, and the target of the sequence is not found in
Chapter 1
Architectural Overview
11
the BTC, an external memory access is performed to start filling the Instruction Prefetch Buffer (IPB). With the Am29000 processor the fetch stage of the processor
pipeline is fed from the IPB, but the Am29050 can by–pass the fetch stage and feed
the first instruction directly into the decode pipeline stage using the instruction forwarding technique. By–passing also enables up to four cycles of external memory
latency to be hidden when a BTC hit occurs (see section 1.10).
The Am29050 incorporates a Translation Look–Aside Buffer (TLB) for
Memory Management Unit support, just like the Am29000 processor. However it
also has two region mapping registers. These permit large areas of memory to be
mapped without using up the smaller TLB entries. They are very useful for mapping
large data memory regions, and their use reduces the TLB software management
overhead.
The processor can also speed up data memory accesses by making the access
address available a cycle earlier than the Am29000. The method is used to reduce
memory load operations which have a greater influence on pipeline stalling than
store operations. Normally the address of a load appears on the address bus at the start
of the cycle following the execution of the load instruction. If virtual addressing is in
use, then the TLB registers are used to perform address translation during the second
half of the load execute–cycle. To save a cycle, the Am29050 must make the physical
address of the load available at the start of the load instruction execution. It has two
ways of doing this.
The access address of a load instruction is specified by the RB field of the
instruction (see Figure 1–13). A 4–entry Physical Address Cache (PAC) memory is
used to store most recent load addresses. The cache entries are tagged with RB field
register numbers. When a load instruction enters the decode stage of the pipeline, the
RB field is compared with one of the PAC entries, using a direct mapping technique,
with the lower 2–bits of the register number being used to select the PAC entry. When
a match occurs the PAC supplies the address of the load, thus avoiding the delay of
reading the register file to obtain the address from the register selected by the RB field
of the LOAD instruction. If a PAC miss occurs, the new physical address is written to
the appropriate PAC entry. The user has no means of controlling the PAC; its operation is completely determined by the processor hardware.
The second method used by the Am29050 processor to reduce the effect of pipeline stalling occurring as a result of memory load latency is the Early Address Generator (EAG). Load addresses are frequently formed by preceding the load with
CONST, CONSTH and ADD type instructions. These instructions prepare a general
purpose register with the address about to be used during the load. The EAG circuitry
continually generates addresses formed by the use of the above instructions in the
hope that a load instruction will immediately follow and use the address newly
formed by the preceding instructions. The EAG must make use of the TLB address
translation hardware in order to make the physical address available at the start of the
12
Evaluating and Programming the 29K RISC Family
load instruction. This happens when, fortunately, the RB field of the load instruction
matches with the destination register of the previous address computation instructions.
Software debugging is better supported on the Am29050 processor than on any
other current 29K family member. All 29K processors have a trace facility which enables single stepping of processor instructions. However, prior to the Am29050 processor, tracing did not apply to the processor operation while the DA bit (disable all
traps and interrupts) was set in the current processor status (CPS) register. The DA bit
is typically set while the processor is operating in Freeze mode (FZ bit set in the CPS
register). Freeze mode code is used during the entry and exit of interrupt and trap
handlers, as well as other critical system support code. The introduction of Monitor
mode operation with the Am29050 enables tracing to be extended to Freeze mode
code debugging. The processor enters Monitor mode when a synchronous trap occurs while the DA bit is set. The processor is equipped with a second set of PC buffer
registers, known as the shadow PC registers, which record the PC–bus activity while
the processor is operating in Monitor mode. The first set of PC buffer registers have
their values frozen when Freeze mode is entered.
The addition of two hardware breakpoint registers aids the Am29050 debug
support. As instructions move into the execute stage of the processor pipeline, the
instruction address is compared with the break address values. The processor takes a
trap when a match occurs. Software debug tools, such as monitors like MiniMON29K, used with other 29K family members, typically use illegal instructions to
implement breakpoints. The use of breakpoint registers has a number of advantages
over this technique. Breakpoints can be placed in read–only memories, and break addresses need not be physical but virtual, tagged with the per–process identifier.
1.5
THE Am29030 2–BUS MICROPROCESSOR
The Am29030 processor is pin compatible with other 2–bus members of the
family (see Table 1-2) [AMD 1991b]. It was the first member of the 2–bus family
introduced in 1991. Higher device construction densities enable it to offer high performance with a simplified system interface design. From a software point of view
the main differences between it and the Am29000 processor occur as a result of replacing the Branch Target Cache (BTC) memory with 8k bytes of instruction cache,
and connecting the instruction and data busses together on–chip. However, the system interface busses have gained a number of important new capabilities.
The inclusion of an instruction cache memory reduces off–chip instruction
memory access bandwidth requirements. This enables instructions to be fetched via
the same device pins used by the data bus. Only when instructions can not be supplied
by the cache is there contention for access to external memory. Research [Hill 1987]
has shown that with cache sizes above 4k bytes, a conventional instruction cache is
Chapter 1
Architectural Overview
13
Table 1-2. Pin Compatible 2–bus 29K Family Processors
Processor
Am29030
Am29035
Am29040
Instruction Cache
8K bytes
4K bytes
8K bytes
I–Cache Associativity
2–Way
Direct–Mapped
2–Way
Date Cache (Physical)
–
–
4K bytes
D–Cache Associativity
–
–
2–Way
On–Chip Floating–Point
No
No
No
On–Chip MMU
Yes
Yes
Yes
Integer Multiply in h/w
No
No
Yes, 2–cycles
Narrow Memory Reads
Yes, 8/16 bit
Yes, 8/16 bit
Yes, 8/16 bit
Programmable Bus Sizing
No
Yes, 16/32 bit
Yes, 16/32 bit
On–Chip Interrupt
Controller Input’s
Yes
6
Yes
6
Yes
6
Scalable Clocking
1x,2x
1x,2x
1x,2x
Burst–mode Addressing
Yes, up to 1K bytes Yes, up to 1K bytes Yes, up to 1K bytes
Freeze Mode Processing
Yes
Yes
Yes
Delayed Branching
Yes
Yes
Yes
On–Chip Timer
Yes
Yes
Yes
On–Chip Memory Controler
No
No
No
DMA Channels
–
–
–
Byte Endian
Big/Little
Big/Little
Big/Little
JTAG Debugging
Yes
Yes
Yes
Clock Speeds (MHz)
20,25,33
16
0–33,40,50
14
Evaluating and Programming the 29K RISC Family
more effective than a BTC. At these cache sizes the bandwith requirements are sufficiently reduced as to make a shared instruction/data bus practicable.
Each cache entry (known as a block) contains four consecutive instructions.
They are tagged in a similar manner to the BTC mechanism of the Am29000 processor. This allows cache entries to be used for both User mode and Supervisor mode
code at the same time, and entries to remain valid during application system calls and
system interrupt handlers. However, since entries are not tagged with per–process
identifiers, the cache entries must be invalidated when a task context switch occurs.
The cache is 2–way set associative. The 4k bytes of instruction cache provided by
each set results in 256 entries per set (each entry being four instructions, i.e. 16 bytes).
When a branch instruction is executed and the block containing the target
instruction sequence is not found in the cache, the processor fetches the missing
block and marks it valid. Complete blocks are always fetched, even if the target
instruction lies at the end of the block. However, the cache forwards instructions to
the decoder without waiting for the block to be reloaded. If the cache is not disabled
and the block to be replaced in the cache is not valid–and–locked, then the fetched
block is placed in the cache. The 2–way cache associativity provides two possible
cache blocks for storing any selected memory block. When a cache miss occurs, and
both associated blocks are valid but not locked, a block is chosen at random for replacement.
Locking valid blocks into the cache is not provided for on a per–block basis but
in terms of the complete cache or one set of the two sets. When a set is locked, valid
blocks are not replaced; invalid blocks will be replaced and marked valid and locked.
Cache locking can be used to preload the cache with instruction sequences critical to
performance. However, it is often difficult to use cache locking in a way that can out–
perform the supported random replacement algorithm.
The processor supports Scalable Clocking which enables the processor to operate at the same or twice the speed of the off–chip memory system. A 33 MHz processor could be built around a 20 MHz memory system, and depending on cache utilization there may be little drop–off in performance compared to having constructed
a 33 MHz memory system. This provides for higher system performance without increasing memory system costs or design complexity. Additionally, a performance
upgrade path is provided for systems which were originally built to operate at lower
speeds. The processor need merely be replaced by a pin–compatible higher frequency device (at higher cost) to realize improved system performance.
Memory system design is further simplified by enforcing a 2–cycle minimum
access time for data and instruction accesses. Even if 1–cycle burst–mode is supported by a memory system, the first access in the burst is hardwired by the processor
to take 2–cycles. This is effective in relaxing memory system timing constraints and
generally appreciated by memory system designers. The high frequency operation of
the Am29030 processor can easily result in electrical noise [AMD1992c]. Enforcing
Chapter 1
Architectural Overview
15
2–cycle minimum access times ensures that the address bus has more time to settle
before the data bus is driven. This reduces system noise compared with the data bus
changing state during the same cycle as the address bus.
At high processor clock rates, it is likely that an interleafed memory system will
be required to obtain bandwidths able to sustain 1–cycle burst mode access. Interleafing requires the construction of two, four or more memory systems (known as
banks), which are used in sequence. When accessed in burst–mode, each bank is given more time to provide access to its next storage location. The processor provides an
input pin, EARLYA (early address), by which a memory system can request early address generation by the processor. This can be used to simplify the implementation of
interleaved memory systems. When requested, the processor provides early the address of even–addressed banks, allowing the memory system to begin early accesses
to both even– and odd–addressed banks.
The processor can operate with memory devices which are not the full 32–bit
width of the data bus. This is achieved using the Narrow Read capability. Memory
systems which are only 8–bit or 16–bit wide are connected to the upper bits of the
data/instruction bus. They assert the RDN (read narrow) input pin along with the
RDY (ready) pin when responding to access requests. When this occurs the processor
will automatically perform the necessary sequences of accesses to assemble instructions or data which are bigger than the memory system width.
The Narrow Read ability can not be used for data writing. However, it is very
useful for interfacing to ROM which contains system boot–up code. Only a single
8–bit ROM may be required to contain all the necessary system initialization code.
This can greatly simplify system design, board space, and cost. The ROM can be used
to initialize system RAM memory which, due to its 32–bit width, will permit faster
execution.
1.5.1 Am29030 Evaluation.
AMD provides a low cost evaluation board for the Am29030 at 16 MHz, known
as the EZ030 (pronounced easy–030). Like the microcontroller evaluation board, it is
a standalone, requiring an external 5v power supply and connection to a remote computer via an RS–232 connection. The board is very small, measuring about 4 inches
by 4 inches (10x10 cm). The memory system is restricted to 16 MHz operation but
with scalable clocking the processor can run at 16 MHz or 33 MHz.
It contains 128k bytes of EPROM, which is accessed via 8–bit narrow bus protocol. There is also 1M byte of DRAM arranged as 256kx32 bits. The DRAM is expandable to 4M bytes. The EPROM is preprogrammed with the MiniMON29K debug monitor and the OS–boot operating system described in Chapter 7.
1.5.2 The Am29035
The Am29035 processor is pin compatible with other 2–bus members of the
family (see Table 1-2). As would be expected, given the AMD product number, its
16
Evaluating and Programming the 29K RISC Family
operation is very similar to the Am29030 processor. It is only available at lower clock
frequencies, compared with its close relative. And with half the amount of instruction
cache memory, it contains one set of the two sets provided by the Am29030. That is, it
has 4k bytes of instruction memory cache which is directly mapped. Consequently it
can be expected to operate with reduced overall performance.
In all other aspects it is the same as the Am29030 processor, except it has Programmable Bus Sizing which the Am29030 processor does not. Programmable Bus
Sizing provides for lower cost system designs. The processor can be dynamically
programmed (via the configuration register) to operate with a 16–bit instruction/data
bus, performing both read and write operations. When the option is selected, 32–bit
data is accessed by the processor hardware automatically performing two consecutive accesses. The ability to operate with 16–bit and 32–bit memory systems makes
the 2–bus 29K family members well suited to scalable system designs, in terms of
cost and performance.
1.6
THE Am29040 2–BUS MICROPROCESSOR
The Am29040 processor is pin compatible with other 2–bus members of the
family (see Table 1-2). The processor was introduced in 1994 and offers higher performance than the 2–bus Am29030; it also has a number of additional system support
facilities.
There is an enhanced instruction cache, now 8k bytes; which is tagged in much
the same way as the Am29030’s instruction cache, except there are four valid bits per
cache block (compared to the Am29030’s one bit per block). Partially filled blocks
are supported, and block reload begins with the first required instruction (target of a
branch) rather than the first instruction in the block. An additional benefit of having a
valid bit per–instruction rather than per–block is that load or store instructions can
interrupt cache reload. With the Am29030 processor, once cache reload had started,
it could not be postponed or interrupted by a higher priority LOAD instruction.
The Am29040 was the first 29K microprocessor to have a data cache. The 4k
byte data cache is physically addressed and supports both “copy–back” and “write–
through” policies. Like other 29K Family members, the data cache always operates
with physical addresses and cache blocks are only allocated on LOAD instructions
which miss (a “read–allocate” or “load–allocate” policy). The block size is 16 bytes
and there is one valid bit per block. This means that complete data blocks must be
fetched when data cache reload occurs. Burst mode addressing is used to reload a
block, starting with the first word in the block. The addition of a data cache makes the
Am29040 particularly well–suited to high–performance data handling applications.
The default data cache policy is “copy–back”. A four word copy–back buffer is
used to improve the performance of the copy–back operation. Additionally, cache
blocks have an M bit–field, which becomes set when data in the block is modified. If
Chapter 1
Architectural Overview
17
the M bit is not set when a cache block is reallocated, the out–going block is not copied back.
When data cache is added to a processor, there can be difficulties dealing with
data consistency. Problems arise when there is more than one processor or data controller (such as a DMA controller) accessing the same memory region. The Am29040
processor uses bus snooping to solve this problem. The method relies on the processor monitoring all accesses performed on the memory system. The processor intervenes or updates its cache when an access is attempted on a currently cached data
value. Cache consistency is dealt with in detail in section 5.14.4.
Via the MMU, each memory page can be separately marked as “non cached”,
“copy–back”, or “write–through”. A two word write–through buffer is used to assist
with writes to memory. It enables multiple store instructions to be in–execution without the processor pipeline stalling. Data accesses which hit in the cache require
2–cycle access times. Two cycles, rather than one, are required due to the potentially
high internal clock speed. The data cache operation is explained in detail in section
5.14.2. However, load instructions do not cause pipeline stalling if the instruction immediately following the load does not require the data being accessed.
Scalable bus clocking is supported; enabling the processor to run at twice the
speed of the off–chip memory system. Scalable Clocking was first introduced with
the Am29030 processors, and is described in the previous section describing the
Am29030. If cache hit rates are sufficiently high, Scalable Clocking enables high
performance systems to be built around relatively slow memory systems. It also offers an excellent upgrade path when additional performance is required in the future.
The maximum on–chip clock speed is 50 MHz.
The Am29040 processor supports integer multiply directly. A latency of two
cycles applies to integer multiply instructions (most 29K instructions require only
one cycle). Again, this is a result of the potentially high internal clocking speeds of
the processor. Most 29K processors take a trap when an integer multiply is attempted.
It is left to trapware to emulate the missing instruction. The ability to perform high
speed multiply makes the processor a better choice for calculation intensive applications such as digital signal processing. Note, floating–point performance should also
improve with the Am29040 as floating–point emulation routines can make use of the
integer multiply instruction.
The Am29040 has two Translation Look–Aside Buffers (TLBs). Having two
TLBs enables a larger number of virtual to physical address translations to be cached
(held in a TLB register) at any time. This reduces the TLB reload overhead. The TLB
format is similar to the arrangement used with the Am29243 microcontroller. Each
TLB has 16 entries (8 sets, two entries per set). The page size used by each TLB can
be the same or different. If the TLB page sizes are the same, a four–way set associative MMU can be constructed with supporting software. Alternatively one TLB can
be used for code and the second, with a larger page size, for data buffers or shared
18
Evaluating and Programming the 29K RISC Family
libraries. The TLB entries have a Global Page (GLB) bit; when set the mapped page
can be accessed by any processes regardless of its process identifier (PID). The TLB
also enables parity checking to be enabled on a per page basis; and pages can be allocated from 16–bit or 32–bit wide memory regions.
On–chip debug support is extended with the inclusion of two Instruction Breakpoint Controllers and one Data Breakpoint Controller. This enables inexpensive debug monitors such as the DebugCore incorporated within MiniMON29K to be used
when developing software. Breakpoints are supported when physical or virtual addressing is in use. The JTAG test interface has also been extended over other 29K
family members to include several new JTAG–processed instructions. The effectiveness of the JTAG interface for hardware and software debugging is improved.
The Am29040 family grouping is implemented with a silicon process which enables processors to operate at 3.3–volts. However, the device is tolerant of 5–volt input/output signal levels. The lower power consumption achievable at 3.3–volts
makes the Am29040 suitable for hand–held type applications. Note, the device operates at a maximum clock frequency of 50 MHz.
A 29K processor enters Wait Mode when the Wait Mode bit is set in the Current
Processor Status (CPS) register. Wait Mode is extended to include a Snooze Mode
which is entered from Wait Mode while the interrupt and trap input lines are held inactive. An interrupt is normally used to depart Wait or Snooze Mode. While in
Snooze mode, Am29040 power consumption is reduced. Returning from Snooze
mode to an interrupt processing state requires approximately 256 cycles. The processor can be prevented from entering Snooze Mode while in Wait Mode by holding, for
example, the INTR3 input pin active and setting the interrupt mask such as to disable
the INTR3 interrupt.
If the input clock is held high or low while the processor is in Snooze mode,
Sleep Mode is entered. Minimum power consumption occurs in this mode. The processor returns to Snooze Mode when the input clock is restarted. Using Snooze and
Sleep modes enables the Am29040 processor to be used in applications which are
very power sensitive.
1.6.1 Am29040 Evaluation.
Like any 29K processor, the Am29040 can be evaluated using the Architectural
Simulator. But for those who wish for real hardware, AMD manufactures a number
of evaluation boards. The most popular being the SE29040 evaluation board. The
board, originally constructed in rev–A form, supports 4M bytes of DRAM (expandable to 64M bytes); DRAM timing is 3/1, i.e. 3–cycle first access then 1–cycle burst.
There is also 1M byte of 32–bit wide ROM and space for 1M byte of 2/1 SRAM.
Boards are typically populated with only 128K of SRAM. The memory system clock
speed is 25 MHZ and the maximum processor speed of 50 MHz is supported.
Chapter 1
Architectural Overview
19
There are connections for JTAG and a logic analyzer as well as two UARTs via
an 85C30 serial communications controller. The board requires a 5–volt power supply and there is a small wire–warp area for placement of additional system components.
The later rev–B boards have an additional parallel port and Ethernet connection
(10–base–T). An AMD HiLANCE is used for Ethernet communication. The rev–B
board can also support memory system speeds up to 33 MHz.
1.7
A SUPERSCALAR 29K PROCESSOR
AMD representatives have talked at conferences and to the engineering press
about a superscalar 29K processor. No announcements have yet been made about
when such a processor will be available, but it is generally expected to be in the near
future. At the 1994 Microprocessor Forum, AMD presented a product overview, but
much of the specific details about the processor architecture were not announced.
However, piecing together available information, it is possible to form ideas about
what a superscalar 29K would look like.
This section does not describe a specific processor, but presents the superscalar
techniques which are likely to be utilized. A lead architect of the 29K family, Mike
Johnson, has a text book dealing with “Superscalar Microprocessor Design” ([Johnson 1991]) which covers the technology in depth. It might be expected that many of
the conclusions drawn in Johnson’s book will appear in silicon in a future 29K processor.
AMD has stated that future microprocessors will be pin compatible with the current 2–bus family members. This indicates that a superscalar 29K will be pin compatible with the Am29030 and Am29040 processors. It is much more likely that the processor will take 2–bus form rather than a microcontroller. User mode instruction
compatibility can also be expected. Given the usual performance increments that accompany a new processors introduction, it will likely sustain two–times the performance of an Am29040 processor. This may be an underestimate, as higher clock rates
or increased use of Scalable Clocking may allow for even higher performance. The
processor is certain to have considerable on–chip instruction and data cache. AMD’s
product overview indicates that 2x, 3x and 4x Scalable Clocking will be supported
and there will be an 8K byte instruction cache and an 8K byte data cache. Also reported was an internal clock speed up to 100 MHz at 3.3–volts.
A superscalar processor achieves higher performance than a conventional scalar processor by executing more than one instruction per cycle. To achieve this it must
have multiple function units which can operate in parallel. AMD has indicated that
the initial superscalar 29K processor will have six function units. And since about
50% of instructions perform integer operations, there will be two integer operation
units, one integer multiplier and one funnel shifter. If a future the processor supports
floating–point operations directly, we can expect to see a floating–point execution
20
Evaluating and Programming the 29K RISC Family
unit added. Other execution units are included to deal with off–chip access via load
and store instructions; and to deal with branch instruction execution. All six function
units, except the integer multiplier, produce their results in a single–cycle.
High speed operation can only be obtained if as many as possible of the function
units can be kept productively busy during the same processor cycles. This will place
a heavy demand on instruction decoding and operand forwarding. Several instructions will have to be decoded in the same cycle and forwarded to the appropriate
execution unit. The demand for operands for these instructions shall be considerably
higher than must be dealt with by a scalar processor. The following sections describe
some of the difficulties encountered when attempting to execute more than one
instruction per cycle. Architectural techniques which overcome the inherent difficulties are presented.
1.7.1 Instruction Issue and Data Dependency
The term instructions issue refers to the passing of an instruction from the processor decode stage to an execution unit. With a scalar processor, instructions are issued in–order. By that, I mean, in the order the decoder received the instructions from
cache or off–chip memory. Instructions naturally complete in–order. However with a
RISC processor out–of–order completion is not unusual for certain instructions. Typically load and store instructions are allowed to execute in parallel with other instructions. These instructions are issued in–order; they don’t complete immediately but
some time (a few cycles) later. The instructions following loads or stores are issued
and execute in parallel unless there is any data dependencies. Dependencies arise
when, for example, a load instructions is followed by an operation on the loaded data.
A superscalar processor can reduce total execution time for a code sequence if it
allows all instruction types to complete out–of–order. Instruction issue need not stop
after an instruction is issued to a function unit which takes multiple cycles to complete. Consequently, function units with long latency may complete their operation
after a subsequent instruction issued to a low latency function unit. The Am29050
processor allows long latency floating–point operations to execute in parallel with
other integer operations. The processor has an additional port on it’s register file for
writing–back the results of floating–point operations. An additional port is required
to avoid the contention which would arise with an integer operation writing back its
result at the same time. Most instructions are issued to an integer unit which, with a
RISC processor, has only one cycle latency. However, there is very likely to be more
than one integer unit, each operating in parallel.
Chapter 1
Architectural Overview
21
Write–Read Dependency
Even if a processor is able to support out–of–order instruction completion, it
still must deal with the data dependencies that flow through a program’s execution.
These flow dependencies (often known as true dependencies) represent the movement of operands between instructions in a program.Examine the code below:
mul
add
gr96,lr2,lr5
gr97,gr96,1
;write gr96, gr96 = lr2 * lr5
;read gr96, write–read dependency
lr5
The first instruction would be issued to the integer lr2
multiply unit; this will have (according to AMD’s
product overview) two cycles of latency. The result is
mul
1
written to register gr96. The second instruction would
be issued to a different integer handling unit. However,
gr96
it has a source operand supplied in gr96. If the second
add
instruction had no data dependencies on the first, it
would be easy to issue the instruction while the first was
still in execute. However, execution of the first
gr97
instruction must complete before the second instruction
can start execution. Steps must be taken to deal with the
data dependency. This kind of dependency is also know
as write–read dependency, because gr96 must be
written by an earlier instruction before a later one can
read the result.
Some superscalar processors, such as the Intel i960 CA, use a
reduced–scoreboarding mechanism to resolve data dependances [Thorton 1970].
When a register is required for a result, a one–bit flag is set to indicate the register is in
use. Currently in–execute instructions set the scoreboard bit for their result registers.
Before an instruction is issued the scoreboard bit is examined. Further instructions
are not issued if the scoreboard indicates that an in–execute instruction intends to
write a register which supplies a source operand for the instruction waiting for issue.
When an instruction completes, the relevant scoreboard bit is cleared. This may
result in a currently stalled instruction being issued.
It is unlikely a 29K processor will use scoreboarding; and even less likely it will
use a reduced–scoreboarding mechanism, such as the i960 CA, which only detects
data dependency for out–of–order instruction completion. A superscalar 29K
processor will support out–of–order instruction issue, which is described shortly.
Scoreboarding can resolve the resulting data dependencies. However, other
techniques, such as register renaming, enable instructions to be decoded and issued
further ahead than is possible with scoreboarding. This will be described in more
detail as we proceed.
22
Evaluating and Programming the 29K RISC Family
Write–Write Dependency
A second type of data dependency can complicate out–of–order instruction
completion. Examine the code sequence shown below:
mul
add
add
gr96,lr2,lr5
gr97,gr96,1
gr96,lr5,1
;write gr96, gr96 = lr2 * lr5
;write gr96, write–write dependency
The result of the third instruction has an output dependency on the first
instruction. The third instruction can not complete before the first. Both instructions
write their results to register gr96, and completing the first instruction last would
result in an out–of–date value being held in gr96. Steps must be taken to deal with the
data dependency. Because the completion of multiple instructions is dependent on
writing gr96 with the correct value, this kind of dependence is also known as a
write–write dependance.
Scoreboarding or reduced–scoreboarding can also resolve write–write
dependences. Before an instruction is issued, the scoreboard bit for the result register
is tested. If there is a currently in–execute instruction planning on writing to the same
result register, the scoreboard bit will be set. This information can be used to stall
issuing until the result register is available.
The parallel execution possible with out–of–order completion, enables higher
performance than in–order completion, but extra logic is required to deal with data
dependency checking. With in–order instruction issue, instructions can no longer be
issued when a dependency is detected. If instruction issue is to continue when data
dependencies are present, the processor architecture becomes yet more complicated;
but the performance reward is extended beyond that of out–of–order completion
with in–order issue.
Read–Write Dependency
Instruction issuing can continue even when the write–read and write–write
dependencies described above are present. The preceding discussion on data
dependency was restricted to in–order instruction issue. Certainly, when a data
dependency is detected, the unfortunate instruction can not be issued; but this need
not mean that future instructions can not be issued. Of course the future instruction
must be free of any dependencies. With out–of–order instruction issue, instructions
are decoded and placed in an instruction window. Instructions can be issued from the
window when they are free of dependencies and there is an available function unit.
The processes of decoding and executing an instruction is separated by the
instruction window, see Figure 1-5. This does not add an additional pipeline stage to
the superscalar processor. The decoder places instructions into the window. When an
instruction is free of dependencies it can be issued from the window to a function unit
for execution. The register window could be implemented as a large buffer within the
instruction decode unit, but this leads to a complex architecture. When an instruction
Chapter 1
Architectural Overview
23
is issued, the op–code and operands must be communicated to the function unit.
When multiple instructions are issued in a single cycle, a heavy demand is placed on
system busses and register file access ports. An alternative window implementation
is to hold instructions at the function units in reservation stations. This way
instructions are sent during decode to the appropriate function unit along with any
available operands. They are issued from the reservation station (really the window)
when any remaining dependencies are resolved and the function unit is available for
execution. The operation of reservation stations is described in more detail in section
1.7.2.
mul gr96,lr2,lr5
add gr97,gr96,1
Instruction
Decode
add gr96,lr5,1
Instruction
Execute
Instruction Window
Figure 1-5. The Instruction Window for Out–of–Order Instruction Issue
An instruction is issued from the window when its operands are available for
execution. Future instructions may be issued ahead of earlier instructions which
become blocked due to data dependencies. Executing instructions out–of–order
introduces a new form of data dependency not encountered with in–order instruction
issue. Examine the code sequence below:
mul
add
add
gr96,lr2,lr5
gr97,gr96,1
gr96,lr5,1
;gr96 = lr2 * lr5
;read gr96
;write gr96, read–write dependency
The third instruction in the sequence uses gr96 for its result. The second
instruction receives an operand in the same gr96 register. The third instruction can
not complete and write its result until the second instruction begins execution;
otherwise the second will receive the wrong operand. The result of the third
instruction has an antidependency on the operand to the second instruction. The
dependency is very much like an in–order issue dependency but reversed. This kind
of dependency is also know as read–write dependance, because gr96 must be read by
the second instruction before the third can write its result to gr96.
Registers are used to hold data values. The flow of data through a program is
represented by the registers accessed by instructions. When instructions execute
out–of–order; the flow of data between instructions is restricted by the reuse of
registers to hold different data values. In the above example we want to issue the third
instruction but its use of gr96 creates a problem. The second instruction is receiving,
24
Evaluating and Programming the 29K RISC Family
via gr96, a data value produced by the first instruction. The register label gr96 is
merely used as an identifier for the data flow. What is intended is that data be passed
from the first instruction to the second. If our intentions could be communicated
without restricting data passing to gr96, then the third instruction could be executed
before the second.
The problem can be overcome by using register renaming, see section 1.7.3.
Briefly, when the first instruction in the above example is issued, it writes its result to
a temporary register identified by the name gr96. The second instruction receives its
operand from the same temporary register used by the first instruction. Execution of
the third instruction need not be stalled if it writes its result to a different copy of
register gr96. So now there are multiple copies of gr96. What really happens is
temporary registers are renamed to be gr96 for the duration of the data flow. These
temporary registers play the role of registers indicated by the instruction sequence.
They are tagged to indicate the register they are duplicating.
1.7.2 Reservation Stations
Each function unit has a number of reservation stations which hold instructions
and operands waiting for execution, see Figure 1-6. All the reservation stations for
each function unit combined represent the instruction window from which
instructions are issued. The decoder places instructions into reservation stations
[Tomasulo 1967] with copies of operands, when available. Otherwise operand values
are replaced with tags indicating the register supplying the missing data. Placing a
copy of a source operand into the reservation station when an instruction is decoded,
prevents the operand being updated by a future instruction; and hence eliminates
anidependency conflicts. A function unit issues instructions to its execute stage when
it is not busy and a reservation station has an instruction ready for execution. Once an
instruction is placed in a reservation station, its issue occurs regardless of any
instruction issue occurring in another function unit. There can be any number of
reservation stations attached to a function unit. The greater the number, the larger the
instruction window; and the further ahead the processor can decode and issue
instructions. Additionally, a greater number of reservation stations prevents short
term demands on a function unit resulting in decoder–stalling.
An instruction may be stalled in a reservation station when a data dependency
causes a tag, rather than data, to be placed in the operand field. The necessary data
will become available when some other instruction completes and the result made
available. The instruction producing the required data value may be in a reservation
station or in execution in the same function unit, or in another function unit. Result
values are tagged indicating the register they should be placed in. With a scalar
processor, the result is always written to the instruction’s destination register. But
when register renaming is used by a superscalar processor, results are written to a
register which is temporarily playing the role of the destination register. These
Chapter 1
Architectural Overview
25
OP–code
destination
operand C
tag or source
operand A
tag or source
operand B
Reservation
Station
OP–code
reorder–buffer
register
tag or source
operand A
tag or source
operand B
Reservation
Station
Execution Unit
result
result from other
function unit
tag information
Figure 1-6. A Function Unit with Reservation Stations
temporary registers, known as copy or duplicate registers, are tagged to indicate the
real register they are duplicating.
When a function unit completes an instruction, it places the result along with the
tag information identifying the result register on a result bus. If several function units
complete in the same cycle, there can be competition for the limited number of result
busses. Other function units monitor the result bus (or busses). Their intention is to
obtain the missing operands for instructions held in reservation stations. When they
observe a data valued tagged with a register value matching a missing operand they
copy the data into the reservation station’s operand field. This may enable the
instruction to be issued.
Once an instruction is placed into a reservation station it will execute in
sequence with other instructions held in other reservation stations within the same
function unit. Of course exceptional events, or the placing of instructions into the
instruction window which represent over speculation, can divert the planned
execution. The instruction window supports speculative instruction decoding. It is
possible that a branch instruction can result in unsuccessful speculation; and the
window must be refilled with instructions fetched from a new instruction sequence.
If a superscalar processor’s performance is to be kept high, it is important that
26
Evaluating and Programming the 29K RISC Family
speculation be successful. For this to be accomplished, branch prediction techniques
must be employed; more on this is in section 1.7.4.
1.7.3 Register Renaming
It was briefly described in the previous section dealing with read–write
dependency (antidependency), that register renaming can help deal with the conflicts
which arise from the reuse of the same register to hold data values. Of course these
dependencies only arise from the out–of–order instruction issue which occurs with a
superscalar processor. Also described were write–write (output) dependencies,
which occur with even in–order instruction issue when more than one instruction
wishes to write the same result register. Both these types of dependency can be
grouped under the heading storage conflicts. Their interference with concurrent
instruction execution is only temporary. Duplication of the result register for the
duration of the conflict can resolve the dependency and enable superscalar
instruction execution to continue.
The temporary result registers are allocated from a reorder buffer which
consists of 10 registers and supporting tag information. Every new result value is
allocated a new copy of the original assignment register. Copies are tagged to enable
them to be used as source operands in future instructions. Register renaming is shown
for the example code sequence below.
;original code
mul
gr96,lr2,lr5
add
gr97,gr96,1
add
gr96,lr5,1
;code after register renaming
mul
RR1,lr2,lr5 ;gr96 = lr2 * lr5
add
RR2,RR1,1
add
RR3,lr5,1
The write–write dependency between the first and third instruction is resolved
by renaming register gr96 to be register RR3 in the third instruction. The renaming
gr96 to be RR3 in the third instruction also resolves the read–write dependency
between the second and third instruction Using register renaming, execution of the
third instruction need not be stalled due to storage (register) dependency. Figure 1-7
shows the dependencies before and after register renaming.
Let’s look in more detail at the operation of the reorder buffer. When an instruction is decoded and placed in the instruction window (in practice, a reservation station), a register in the reorder buffer is assigned to hold the instruction result.
Figure 1-8 shows the format of information held in the reorder buffer. When the
instruction is issued from the reservation station and, at a later time, execution completed, the result is written to the assigned reorder buffer entry.
If a future instruction refers to the result of a previous instruction, the reorder
buffer is accessed to obtain the necessary value. The reorder buffer is accessed via the
contents of the destination–tag field. This is known as a content–addressable
memory access. A parallel search of the reorder buffer is performed. All memory
locations are simultaneously examined to determine if they have the requested data.
Chapter 1
Architectural Overview
27
lr2
lr5
lr5
lr2
lr5
1
lr5
1
mul
mul
1
add
1
add
gr96
RR1
gr96
RR3
“gr96”
add
Before Renaming
add
After Renaming
gr97
RR2
“gr97”
Figure 1-7. Register Dependency Resolved by Register Renaming
If the instruction producing the result has not yet completed execution, then the
dispatched instruction is provided with a reorder–buffer–tag for the pending data.
For example, the second instruction in the above code sequence would receive
register–buffer–tag RR1.
It is likely that the reorder buffer contains entries which are destined (tagged) for
the same result register. When the reorder buffer is accessed with a destination–tag
which has multiple entries, the reorder buffer provides the most recent entry. This
ensures the most recently assigned (according to instruction decode) value is used. In
such case, the older entry could be discarded; but it is kept in case of an exceptional
event, such as an interrupt or trap, occurring.
When an instruction completes, the reorder buffer entry is updated with the result value. A number of result busses are used to forward result values, and their
RR0
destination tag
value
status
RR1
gr96
–
in use
RR2
gr97
–
in use
RR3
gr96
–
in use
RRn
Reorder buffer tag
(entry name)
free
older
entry
entry
selected
newer
entry
instruction source operand address = gr96
Figure 1-8. Circular Reorder Buffer Format
28
Evaluating and Programming the 29K RISC Family
associated tag information, to the reorder buffer. Function units monitor the flow of
data along these buses in the hope of acquiring data values required by their reservation stations. In this way, instructions are supplied the operands which where missing
when the instruction was decoded. When a reorder buffer has been updated with a
result, the entry is ready for retiring. This is the term given to writing the result value
into the real register in the register file. There is a bus for this task which connects
read ports on the reorder buffer to write ports on the register file. The number of ports
assigned to this task (2) limits the number of instructions which can be retired in any
one processor cycle. A register file with two write ports supports a maximum of four
instructions being retired during the same cycle; two instructions which modify result registers, one store instruction, and one branch instruction (these last two instruction types do not write to result registers). Figure 1-9 outlines the system layout.
When the reorder buffer becomes full, no further instruction decoding can occur
until entries are made available via instruction retiring. Instructions are retired in the
order they are placed in the reorder buffer. This ensures in–order retiring of
instructions. Should an exceptional event occur during program execution, the state
of instruction retirement specifies the precise position which execution has reached
within the program. Only completed instructions, without exceptions, are retired.
Instruction
Memory
Retirement Bus
Instruction
Cache
2 word
4 instructions
Register
File
Instruction
Decode
4 word
Reorder
Buffer
4 word
Instruction Bus
10 word
Operand Buses
reservation
stations (2)
Function
Unit
Function
Unit
Function
Unit
Result and Tag Buses
3 word
Figure 1-9. Multiple Function Units with a Reorder Buffer
Chapter 1
Architectural Overview
29
Figure 1-9 shows the operand busses supplying source operands from the
reorder buffer to the reservation stations. However, in some cases, when an
instruction is decoded and the operand register’s number presented to the reorder
buffer, no entry is found. This indicates there is currently no copy of the required
register. Consequently, the real register in the register file must be accessed to obtain
the data. For this reason the register file is provided with read ports (4) which supply
data to the operand bus.
1.7.4 Branch Prediction
Out–of–order instruction issue places a heavy demand on instruction decoding.
If reservation stations are to be kept filled, instruction decode must proceed at a rate
equal to, or greater than, instruction execution. Otherwise, performance will be
limited by the ability to decode instructions. The major obstacle in the way of
achieving efficient decoder operation is branching instructions. Unfortunately,
instruction sequences typically contain only about five or six instructions before a
further branch–type instruction is encountered. Compilers directed to producing
code specifically for superscalar processor execution try to increase this critical
parameter. Additionally, the fact that a target of a branch instruction need not be
aligned on a cache block boundary, can further reduce the efficiency of the decoding
processes.
The decoder fetches instructions and places them into the instruction window
for issue by a function unit. If an average decode rate of more than two instructions
per cycle is to be achieved, it is likely that a four–instruction decoder (or better) will
be required. In fact, AMD’s product overview indicates a four–instruction decoder is
used. To study this further, first examine the code below. The first target sequence
begins at address label L13. The linker need not align the L13 label at a cache block
boundary –– a cache block size of four instructions will be assumed. The same
alignment issue occurs with the second target sequence beginning at label L14. The
decoder is presented with a complete cache block rather than sequential instructions
from within the block. This requires a 128–bit bus between the instruction cache and
the decode unit. However, this is essential if instructions are to be decoded in parallel.
Figure 1-10 shows a possible cache block assignment, assuming the target of the first
instruction sequence begins in the second entry of the cache block. The target of the
second sequence begins in the third instruction of the block.
L13:
add
sll
cpgt
jmpt
add
gr98,gr98,10
gr99,gr99,2
gr97,gr97,gr98
gr97,L14
lr4,lr4,gr99
load
store
0,0,gr97,lr4
0,0,gr97,gr96
;target of a branch
;gr98 = gr98 + 10
;conditional branch to L14
;branch delay slot, see section 1.13
L15:
30
Evaluating and Programming the 29K RISC Family
. . .
L14:
jmp
const
. . .
L16
lr10,0
;target of branch
;unconditional branch to L16
;branch delay slot, always executed
The branch instruction from the first code sequence to label L14 is located in the
second instruction of the block. Assuming two cycles are required to fetch the target
block, the decoder is left with nothing to decode for several cycles. Additionally,
branch alignment has resulted in there being less than four instructions available for
decode during any cycle. The resulting decode rate is 1 instruction per cycle. This
would result in little better than scalar processor performance –– much less that the
desired 2 or more instructions per cycle.
Cache block being decoded
add gr98,gr98,10
cpgt gr97,gr97,gr98 jmpt gr97,L14
add lr4,lr4,gr99
jmp
L16: . . .
time
in cycles
two–cycle
delay
sll gr99,gr99,2
L16
const
lr10,0
Average Decode = 7/7
rate
= 1 instructions/cycle
Figure 1-10. Instruction Decode with No Branch Prediction
In Figure 1-10 the target sequence is found in the cache. Of course the cost of the
branch would be much higher if the target instructions had to be fetched from
off–chip memory. Additionally, a two–cycle branch delay is shown. This is typically
defined as the time from decoding the branch instruction till decoding the target
instruction. The actual delay encountered is difficult to estimate, as the target address
is not known until the jump instruction is executed. Figure 1-10 shows the cycle
when the jump instruction is placed in the instruction window. When it will be issued
depends on a number of factors such as register dependency and reservation station
activity. Additionally the result of the jump must be forwarded to the decode unit
before further instruction decode can proceed. In practice, several cycles could
elapse before the decoder obtains the address of the cache block containing the target
instruction.
It is clear from the above discussion that a superscalar processor must take steps
to achieve a higher instruction decode rate. This is likely to involve some form of
Chapter 1
Architectural Overview
31
branch prediction. The decoder can not wait for the outcome of the branch instruction
to be know before it starts fetching the new instruction stream. It must examine the
instruction currently being decoded, and determine if a branch is present. When a
branching instruction is found, the decoder must predict both if the branch will be
taken and the target of the branch. This enables instructions to be fetched and
decoded along the predicted path. Of course, unconditional branches also benefit
from early fetching of their target instruction sequence; and they do not require
branch prediction support.
The instruction decode sequence for the previous code example is shown in
Figure 1-11 using branch prediction. Without waiting for the conditional–jump
instruction in the second entry of the cache block to execute, the decoder predicts the
branch will be taken and in the next cycle starts decoding the block containing the
target instruction. This results in a decode rate of 2.33 instructions per cycle. If the
prediction is correct, the decoder should be able to sustain a decode rate which
prevents starving the function units of instructions.
Cache block being decoded
add gr98,gr98,10
cpgt gr97,gr97,gr98 jmpt gr97,L14
add lr4,lr4,gr99
jmp
time
in cycles
sll gr99,gr99,2
L16
const
lr10,0
Average Decode = 7/3
rate
= 2.33 instructions/cycle
Figure 1-11. Four–Instruction Decoder with Branch Prediction
Branch prediction supports speculative instruction fetching. It results in
instructions being placed in the instruction window which may be speculatively
dispatched and executed. If the branch is wrongly predicted, instructions still waiting
in reservation stations must be cancelled. Any wrongly predicated instructions which
reach execution must not be retired. This requires considerable support circuitry. For
this reason scoreboarding is used by some processors to support speculative
instruction fetching. With scoreboarding the decoder sets a scoreboard bit for each
instruction’s destination register. Since there is only one bit indicating there is a
pending update, there can be only one such update per register. Consequently, the
decoder stalls when encountering an instruction required to update a register which
already has a pending update. The scoreboarding mechanism is simpler to implement
than register renaming using a reorder buffer. However, its restrictions limit the
decoder’s ability to speculatively fetch instruction further ahead of actual execution.
This has been shown to result in about 21% poorer performance when a
four–instruction decoder is used [Johnson 1991].
32
Evaluating and Programming the 29K RISC Family
It is certain that a superscalar 29K processor will incorporate a branch
prediction technique. Given that instruction compatibility is to be maintained, it is
likely that a hardware prediction rather than a software prediction method will be
employed. This will require the processor to keep track of previous branch activity.
An algorithm will likely help with selecting the most frequent branch paths; such as
branches to lower addresses are more often taken then not –– jump at bottom of loop.
1.8
THE Am29200 MICROCONTROLLER
The Am29200 was the first of the 29K family microcontrollers (see
Table 1-3) [AMD 1992b]. To date the Am29205 is the only other microcontroller
added to the family. Being microcontrollers, many of the device pins are assigned I/O
and other dedicated support tasks which reduce system glue logic requirements. For
this reason none of the devices are pin compatible. The system support facilities, included within the Am29200 package, make it ideal for many highly integrated and
low cost systems.
The processor supports a 32–bit address space which is divided into a number of
dedicated regions (see Figure 1-12). This means that ROM, for example, can only be
located in the region preallocated for ROM access. When an address value is generated, the associated control–logic for the region is activated and used to control data
or instruction access for the region.
There is a 32–bit data bus and a separate 24–bit address bus. The rest of the 104
pins used by the device are mainly for I/O and external peripheral control tasks
associated with each of the separate address regions.
By incorporating memory interface logic within the chip, the processor enables
lower system costs and simplified designs. In fact, DRAM devices can be wired directly to the microcontroller without the need for any additional circuitry.
At the core of the microcontroller is an Am29000 processor. The additional I/O
devices and region control mechanisms supported by the chip are operated by programmable registers located in the control register region of memory space. These
control registers are accessible from alternate address locations –– for historical reasons. It is best, and essential if C code is used, to access these registers from the optional word–aligned addresses.
Accessing memory or peripherals located in each address region is achieved
with a dedicated region controller. While initializing the control registers for each
region it is possible to specify the access times and, say, the DRAM refresh requirements for memory devices located in the associated region.
Other peripheral devices incorporated in the microcontroller, such as the UART,
are accessed by specific control registers. The inclusion of popular peripheral devices and the associated glue logic for peripheral and memory interfaces within a
single RISC chip, enables higher performance at lower costs than existing systems
Chapter 1
Architectural Overview
33
Table 1-3. Am2920x Microcontroller Members of 29K Processor Family
Processor
Am29200
Am29205
Instruction Cache
–
–
I–Cache Associativity
–
–
Date Cache
–
–
D–Cache Associativity
–
–
On–Chip Floating–Point
No
No
On–Chip MMU
No
–
Integer Multiply in h/w
No
No
Programmable I/O
16 pins
8 pins
8/16/32 bit
16/32 bit
16 bit
16 bit
ROM width
DRAM width
34
On–Chip Interrupt
Controller Inputs
Yes
14
Yes
10
Scalable Clocking
No
No
Burst–mode Addressing
Yes, up to 1K bytes
Yes, up to 1K bytes
Freeze Mode Processing
Yes
Yes
Delayed Branching
Yes
Yes
On–Chip Timer
Yes
Yes
On–Chip Memory Controler
Yes
Yes
DMA Channels
2
1
Byte Endian
Big
Big
Serial Ports
1
1
JTAG Debugging
Yes
No
Clock Speeds (MHz)
16.7, 20
12.5,16.7
Evaluating and Programming the 29K RISC Family
Region Allocation
reserved
Address Range
0xffff,ffff
0x9600,0000
PIA space
0x9000,0000
control regs.
0x8000,0000
video–DRAM
0x6000,0000
virtual–DRAM
0x5000,0000
DRAM
0x4000,0000
ROM
0x0
Figure 1-12. Am29200 Microcontroller Address Space
Regions
(see Figure 1-13). Let’s take a quick look at each of the region controllers and specialized on–chip peripherals in turn.
1.8.1 ROM Region
First thing to realize is that ROM space is really intended for all types of
nonmultiplexed–address devices, such as ROM and SRAM. Controlling access to
these types of memories is very similar. The region is divided into four banks. Each
bank is individually configurable in width and timing characteristics. A bank can be
associated with 8–bit, 16–bit or 32–bit memory and can contain as much as 16M
bytes of memory (enabling a 64M bytes ROM region).
Bank 0, the first bank, is normally attached to ROM memory as code execution
after processor reset starts at address 0. During reset the BOOTW (boot ROM width)
input pin is tested to determine the width of Bank 0 memory. Initially the memory is
assumed to have 4–cycle access times (three wait states) and no burst–mode. The
SA29200 evaluation board contains an 8–bit EPROM at bank 0 (SA stands for stand–
alone). Other banks may contain, say, 32–bit SRAM with different wait state requirements. It is possible to arrange banks to form a contiguous address range.
Whenever memory in the ROM address range is accessed, the controller for the
region is activated and the required memory chip control signals such as CE (chip
enable), R/W, OE (output enable) and others are generated by the microcontroller.
Thus SRAM and EPROM devices are wired directly to pins on the microcontrol chip.
Chapter 1
Architectural Overview
35
Am29000
A
I
D
parallel
port
ROM
controller
serial
port
PIA
video
interface
DMA
controller
I/O
port
DRAM
controller
ROM or
SRAM
Memory
DRAM
Memory
interrupt
controller
Figure 1-13. Am29200 Microcontroller Block Diagram
1.8.2 DRAM Region
In a way similar to the ROM region, there is a dedicated controller for DRAM
devices which are restricted to being located in the DRAM address region. Once
again the region is divided into four banks which may each contain as much as 16M
bytes of off–chip memory. The DRAM region controller supports 16–bit or 32–bit
wide memory banks which may be arranged to appear as contiguous in address
range.
DRAM, unlike ROM, is always assumed to have 4–cycle access times. However, if page–mode DRAM is used it is possible to achieve 2–cycle rather than 4–cycle
burst–mode accesses. Burst–mode is used when consecutive memory addresses are
being accessed, such as during instruction fetching between program branches. The
DRAM memory is often referred to as 3/2 rather than 4/2. The four cycles consist of
1-cycle precharge and 3–cycles latency, under certain circumstances the 1–cycle of
precharge can be hidden. This is explained in section 1.14.1 under the Am29200 and
Am29205 subheading.
The control register associated with each DRAM bank, maintains a field for
DRAM refresh support. This field indicates the number of processor cycles between
DRAM refresh. If refresh is not disabled, “CAS before RAS” cycles are performed
36
Evaluating and Programming the 29K RISC Family
when required. Refresh is overlapped in the background with non–DRAM access
when possible.
If a DRAM bank contains video–DRAM rather than conventional DRAM, then
it is possible to perform data transfer to the VDRAM shift register via accesses in the
VDRAM address range. The VDRAM is aliased over the DRAM region. Accessing
the memory as VDRAM only changes the timing of memory control signals such as
to indicate a video shift register transfer is to take place rather than a CPU memory
access.
1.8.3 Virtual DRAM Region
A 16–Mbyte (24 address bit) virtual address space is supported via four mapping registers. The virtually addressed memory is divided into 64K byte (16 address
bits) memory pages which are mapped into physical DRAM. Each mapping register
has two 8–bit fields specifying the upper address bits of the mapped memory pages.
When memory is accessed in the virtual address space range, and one of the four
mapping registers contains a match for the virtually addressed page being accessed,
then the access is redirected to the physical DRAM page indicated by the mapping
register.
When no mapping register contains a currently valid address translation for the
required virtual address, a processor trap occurs. In this case memory management
support software normally updates one of the mapping registers with a valid mapping
and normal program execution is restarted.
Only DRAM can be mapped into the virtual address space. The address region
supports functions such as image compression and decompression that yield lower
overall memory requirements and, thus, lower system costs. Images can be stored in
virtually addressed space in a compressed form, and only uncompressed into physically accessed memory when required for image manipulation or output video imaging.
1.8.4 PIA Region
The Peripheral Interface Adapter (PIA) region is divided into six banks, each of
24–bit address space. Each bank can be directly attached to a peripheral device. The
control registers associated with the region give extra flexibility in specifying the
timing for signal pins connecting the microcontroller and PIA peripherals. The PIA
device–enable and control signals are again provided on–chip rather than in external
support circuitry.
When external DMA is utilized, transfer of data is always between DRAM or
ROM space and PIA space. More on DMA follows.
1.8.5 DMA Controller
When an off–chip device wishes to gain access to the microcontroller DRAM, it
makes use of the Direct Memory Access (DMA) Controller. On–chip peripherals can
Chapter 1
Architectural Overview
37
also perform DMA transfers; this is referred to as internal DMA. DMA is initiated by
an external or internally generated peripheral DMA request.
The only internal peripherals which can generate DMA requests are the parallel
port, the serial port and the video interface. These three devices are described shortly.
There are two external DMA request pins, one for each of the two on–chip DMA control units. Internal peripherals have a control register field which specifies which
DMA controller their DMA request relates to.
The DMA controllers must be initialized by software before data transfer from,
or to, DRAM takes place. The associated control registers specify the DRAM start
address and the number of transfers to take place. Once the DMA control registers
have been prepared, a DMA transfer will commence immediately upon request with
out any further CPU intervention. Once the DMA transfer is complete the DMA controller may generate an interrupt. The processor may then refresh the DMA control
unit parameters for the next expected DMA transfer.
One of the DMA control units has the special feature of having a duplicate set of
DMA parameter registers. At the end of a DMA transfer, when the primary set of
DMA parameter registers have been exhausted, the duplicate set is immediately copied into the primary set. This means the DMA unit is instantly refreshed and prepared for a further DMA request. Ordinarily the DMA unit is not ready for further use
until the support software has executed, usually via an end of DMA interrupt request.
Just such an interrupt may be generated but it will now be concerned with preparing
parameters for the duplicate control registers for the one–after–next DMA request.
This DMA queue technique is very useful when DMA transfers are occurring to the
video controller. In such case DMA can not be postponed as video imaging requirements mean data must be available if image distortion is to be avoided.
External DMA can only occur between DRAM or ROM space and two of the six
PIA address space banks. DMA only supports an 8–bit address field within a PIA address bank.
One further note on DMA, the microcontroller does support an external DMA
controller; enabling random access by the external DMA device to DRAM and
ROM. The external DMA unit must activate the associated control pins and place the
address on the microcontroller address bus. In conjunction with the microcontroller,
the external DMA unit must complete the single 32–bit data access.
1.8.6 16–bit I/O Port
The I/O port supports bit programmable access to 16 input or output pins. These
pins can also be used to generate level–sensitive or edge–sensitive interrupts. When
used as outputs, they can be actively driven or used in open collector mode.
1.8.7 Parallel Port
The parallel port is intended for connecting the microcontroller chip to a host
processor, where the controller acts as an intelligent high performance control unit.
38
Evaluating and Programming the 29K RISC Family
Data can be transferred in both directions, either via software controlled 8–bit or
32–bit data words, or via DMA unit control. Once again the associated control registers give the programmer flexibility in specifying the timing requirements for connecting the parallel port directly to the host processor.
1.8.8 Serial Port
The on–chip serial port supports high speed full duplex, bi–directional data
transfer using the RS–232 protocol. The serial port can be used in an polled or interrupted driven mode. Alternatively, it may request DMA access. The lightweight interrupt structure of the Am29000 processor core, coupled with the smart on–chip peripherals, presents the software engineer with a wide range of options for controlling
the serial port.
1.8.9 I/O Video Interface
The video interface provides direct connection to a number of laser–beam
marking engines. It may also be used to receive data from a raster input device such as
a scanner or to serialize/deserialize a data stream. It is possible with external circuitry
support that a noninterleaved composite TV video signal could be generated.
The video shift register clock must be supplied on an asynchronous input pin,
which may be tied to the processor clock. (Note, a video image is built by serially
clocking the data in the shift register out to the imaging hardware. When the shift register is empty it must be quickly refilled before the next shift clock occurs.) The
imaged page may be synchronized to an external page–sync signal. Horizontal and
vertical image margins as well as image scan rates are all programmable via the now
familiar on–chip control register method.
The video shift registers are duplicated, much like some of the DMA control
registers. This reduces the need for rapid software response to maintain video shift
register update. When building an image, the shift register is updated from the duplicate support register. Software, possibly activated via a video–register–empty interrupt, must fill the duplicate shift register before it becomes used–up. Alternatively,
the video data register can be maintained by the DMA controller without the need for
frequent CPU intervention.
1.8.10 The SA29200 Evaluation Board
The SA29200 is an inexpensive software development board utilizing the
Am29200 microcontroller. Only a 5v supply and a serial cable connection to a host
computer are required to enable board operation. Included on the board is an 8–bit
wide EPROM (128Kx8) which contains the MiniMON29K debug monitor and the
OS–boot operating system. There is also 1M byte of 32–bit DRAM (256Kx32) into
which programs can be loaded via the on–chip UART. The processor clock rate is 16
Chapter 1
Architectural Overview
39
MHz and the DRAM operates with 4–cycle initial access and 2–cycle subsequent
burst accesses. So, although the performance is good, it is not as high as other members of the 29K family.
The SA29200 board measures about 3 by 3.5 inches (9x10 cm) and has connections along both sides which enable attachment to an optional hardware prototyping
board (see following section). This extension board has additional I/O interface devices and a small wire–wrap area for inclusion of application specific hardware.
1.8.11 The Prototype Board
The prototying board is inexpensive because it contains mainly sockets, which
can support additional memory devices, and a predrilled wire–wrap area. The RISC
microcontroller signals are made available on the prototyping board pins. Some of
these signals are routed to the empty memory sockets so as to enable simple memory
expansions for 8–bit, 16–bit or 32–bit EPROM or SRAM. There is also space for up
to 16M bytes of 32–bit DRAM.
Using the wire–wrap area the microcontroller I/O signals can be connected to
devices supporting specific application tasks, such as A/D conversion or peripheral
control. This makes the board ideal for a student project. Additionally, the access
times for memory devices are programmable, thus enabling the effects of memory
performance on overall system operation to be evaluated.
1.8.12 Am29200 Evaluation
The Combination of the GNU tool chain and the low cost SA29200 evaluation
board and associated prototping board, makes available an evaluation environment
for the industry’s leading embedded RISC. The cost of getting started with embedded
RISC is very low and additional high performance products can be selectively purchased from specialized tool builders. The evaluation package should be of particular interest to university undergraduate and post–graduate courses studying RISC.
1.8.13 The Am29205 Microcontroller
The Am29205 is a microcontroller member of the 29K family (see Table 1-3). It
is functionally very similar to the Am29200 microcontroller. It differs as a result of
reduced system interface specifications. This reduction enables a lower device pin–
count and packaging cost. The Am29205 is available in a 100–lead Plastic Quad Flat
Pack (PQFP) package. It is suitable for use in price sensitive systems which can operate with the somewhat reduced on–chip support circuitry.
The reduction in pin count results in a 16–bit data/instruction bus. The processor
generates two consecutive memory requests to access instructions and data larger
than 16–bits. The memory system interface has also been simplified in other ways.
Only 16–bit transfers to memory are provided for; no 8–bit ROM banks are sup-
40
Evaluating and Programming the 29K RISC Family
ported. The parallel port, DMA controller, and PIA, also now support transfers limited to the 16–bit data width.
Generally the number of service support pins such as: programmable Input/Output pins (now 8, 16 for the Am29200 processor); serial communication handshake
signals DTR, DSR; DMA request signals; interrupt request pins; and number of decoded PIA and memory banks, have all been reduced. The signal pins supporting video–DRAM and burst–mode ROM access have also been omitted. These omissions
do not greatly restrict the suitability of the Am29205 microcontroller for many projects. The need to make two memory accesses to fetch instructions, which are not supported by an on–chip cache memory, will result in reduced performance. However,
many embedded systems do not require the full speed performance of a 32–bit RISC
processor.
AMD provides a low cost evaluation board known as the SA29205. The board is
standalone and very like the SA29200 evaluation board; in fact, it will fit with the
same prototype expansion board used by the SA29200. It is provided with a 256k
byte EPROM, organized as 128kx16 bits. The EPROM memory is socket upgradable
to 1M byte. There is 512K byte of 16–bit wide DRAM. For debugging purposes, it
can use the MiniMON29K debug monitor utilizing the on–chip serial port.
1.9
THE Am29240 MICROCONTROLLER
The Am29240 is a follow–on to the Am29200 microcontroller (see Table 1-4).
It was first introduced in 1993. The Am29240 is a member of the Am2924x family
grouping which offers increased on–chip support and greater processing power. In
terms of peripherals the Am29240 has two serial ports in stead of the Am29200’s one.
It also has 4 DMA controllers in stead of two.
Unlike the Am29200, all of the Am29240 DMA channels support queued data
transfer. Additionally, fly–by DMA transfers are optionally supported. Normal
DMA transfers require a read stage followed by a write stage. The data being transferred is temporarily held in an on–chip buffer after being read. With fly–by DMA the
read and write stages occur at the same time. This results in a faster DMA transfer.
However, the device being accessed must be able to transfer data at the maximum
DRAM access rate.
The Am2924x family grouping, unlike the Am2920x grouping, support virtual
memory addressing. The Translation Look–Aside Buffer (TLB) used to construct an
MMU scheme supports larger page sizes than the Am29000 processor. The page size
can be up to 16M bytes. The large page size enables extensive memory regions to be
mapped with only a few TLB mapping entries. For this reason only 16 TLB entries
are provided (8 sets, two entries per set). A consequence of the relatively large page
size is pages can not be individually protected against Supervisor mode reads and
execution –– this is possible with the smaller pages used by the Am29000 processor
(see section 6.2.1). This loss is outweighed by the benefits of the larger page size
Chapter 1
Architectural Overview
41
Table 1-4. Am2924x Microcontroller Members of 29K Processor Family
Processor
Am29240
Am29243
Am29245
Instruction Cache
4K bytes
4K bytes
4K bytes
I–Cache Associativity
2–Way
2–Way
2–Way
Data Cache (Physical)
2K bytes
2K bytes
–
D–Cache Associativity
2–Way
2–Way
–
On–Chip Floating–Point
No
No
No
On–Chip MMU
Yes
Yes
Yes
Integer Multiply in h/w
Yes, 1–cycle
Yes, 1–cycle
No
Programmable I/O
16 pins
8 pins
8 pins
ROM width
DRAM width
8/16/32 bit
16/32 bit
16/32 bit
8/16/32 bit (parity)
8/16/32 bit
16/32 bit
On–Chip Interrupt
Controller Input’s
Yes
14
Yes
14
Yes
14
Scalable Clocking
1x,2x
1x,2x
No
Burst–mode Addressing
Yes, up to 1K bytes
Yes, up to 1K bytes
Yes, up to 1K bytes
Freeze Mode Processing
Yes
Yes
Yes
Delayed Branching
Yes
Yes
Yes
On–Chip Timer
Yes
Yes
Yes
On–Chip Memory Controller
Yes
Yes
Yes
DMA Channels
4
4
2
Byte Endian
Big
Big
Big
Serial Ports
2
2
1
JTAG Debugging
Yes
Yes
Yes
Clock Speeds (MHz)
0–20,25,33
0–20,25,33
0–16
42
Evaluating and Programming the 29K RISC Family
which achieves virtual memory addressing with little TLB reload activity and with
only a small amount of chip area being required.
Increased performance is achieved by the inclusion of separate 4k byte instruction and 2k byte data caches. As with all 29K instruction caches, address tags are
based on virtual addresses when address translation is turned on. The first processor
in the 29K Family to have a conventional instruction cache was the Am29030. The
Am29240 cache is similar in operation to the Am29030’s cache. However, the
Am29240 processor has four valid bits per cache entry (four instructions) in place of
the previous one bit. This offers a performance advantage as cache blocks need only
be partially filled and need not be fetched according to block boundaries (more on
this in section 5.13.5).
The data cache always operates with physical addresses. The block size is 16
bytes and there is one valid bit per block. This means that compete data blocks must
be fetched when data cache reload occurs. A “write–through” policy is supported by
the cache which ensures that external memory is always consistent with cache contents. Cache blocks are only allocated for data loaded from DRAM or ROM address
regions. Access to other address regions is not cached. A two word write–through
buffer is used to assist with writes to memory. It enables multiple store instructions to
be in–execution without the processor pipeline stalling. Data accesses which hit in
the cache require 1–cycle access times. The data cache operation is explained in detail in section 5.14.
Scalable bus clocking is supported; enabling the processor to run at twice the
speed of the off–chip memory system. Scalable Clocking was first introduced with
the Am29030 processors, and is described in the previous section describing the
Am29030. If cache hit rates are sufficiently high, Scalable Clocking enables high
performance systems to be built around relatively slow memory systems. It also offers an excellent upgrade path when addition performance is required in the future.
Initially the ROM memory region is assumed to have four cycle access times
(three wait states) and no burst–mode –– same as Am29200. The four banks within
the region can be programmed for zero wait–state read and one wait–state write, or
another combination suitable for slower memory devices.
DRAM, unlike ROM, is always assumed to have 3–cycle access times. However, if page–mode DRAM is used it is possible to achieve 1–cycle burst–mode accesses. Burst–mode is used when consecutive memory addresses are being accessed,
such as during instruction fetching. The Am29200 microcontroller supports 4–cycle
DRAM access with 2–cycle burst. The faster DRAM interface of the Am29240
should result in a substantial performance gain. Additionally, the 3–cycle initial
DRAM access can be reduced to 2–cycle if the required 1–cycle precharge can be
hidden. This is explained in section 1.14.1 under the Am29200 and Am29205 subheading. Consequently the Am29240 DRAM is often referred to as 2/1 rather than
3/1.
Chapter 1
Architectural Overview
43
The Am29240 processor supports integer multiply directly in a single cycle.
Most 29K processors take a trap when an integer multiply is attempted. It is left to
trapware to emulate the missing instruction. The ability to perform high speed multiply makes the processor a better choice for calculation intensive applications such as
digital signal processing. Note, floating–point performance should also improve
with the Am29240 as floating–point emulation routines can make use of the integer
multiply instruction.
The Am2924x family grouping is implemented with a silicon process which enables processors to operate at 3.3–volts or 5–volts. The lower power consumption
achievable at 3.3–volts makes the Am29240 suitable for hand–held type applications.
1.9.1 The Am29243 Microcontroller
The Am29243 is an Am29240 microcontroller enhanced to deal with communication applications (see Table 1-4). For this reason the video interface is omitted.
The pins used have not been reassigned, and there is a possibility they will be allocated in a future microcontroller for an additional communications support function.
Communication applications frequently require large amounts of DRAM, and it
is often critical that no corruption of the data occur. Parity error checking is often performed by memory systems with the objective of detecting data corruption. It can be
difficult to built the necessary circuitry at high memory system speeds. The
Am29243 microcontroller has built–in parity generation and checking for all DRAM
accesses. When enabled by the DRAM controller, the processor will take trap number 4 when a parity error is detected. Having parity handling built–in enables single–
cycle DRAM accesses to be performed without any external circuitry required.
Because of the larger amounts of memory typically used in communication applications, the Am29243 has a second Translation Look–Aside Buffer (TLB). Having two TLBs enables a larger number of virtual to physical address translations to be
cached (held in a TLB register) at any time. This reduces the TLB reload overhead.
The second TLB also has 16 entries (8 sets, two entries per set), and the page size can
be the same or different. If the TLB page sizes are the same, a four–way set associative MMU can be constructed with supporting software. Alternatively one TLB can
be used for code and the second, with a larger page size, for data buffers or shared
libraries. The TLB entries have a Global Page (GLB) bit; when set the mapped page
can be accessed by any processes regardless of its process identifier (PID).
1.9.2 The Am29245 Microcontroller
The Am29245 is a low–cost version of the Am29240 microcontroller (see
Table 1-4). To enable the lower cost, the data cache and the integer multiply unit have
been omitted. Further, there are only two DMA channels in place of the Am29240’s
four. To further reduce cost, one of the two serial ports has also been omitted.
44
Evaluating and Programming the 29K RISC Family
The Am29245 is intended for use in systems which do not need the maximum
performance of the Am29240 or all of its peripherals; and can benefit from a reduced
processor cost. The Am29245 does not support Scalable Clocking and is only available at relatively lower clock speeds.
1.9.3 The Am2924x Evaluation
AMD has a number of boards available for Am2924x evaluation. Microcontrollers in this family grouping all have the same pin configuration. This enables the
boards to operate with any of the Am2924x processors. The least expensive board is
the SD29240 it is a very small board, similar in form to the SA29200 board; it does
not have the expansion connector available with the SA29200. It is normally supplied with an Am29240 or Am29245 installed. There is 1M byte of 32–bit wide
DRAM which operates at 16 MHz. When an Am29240 is used, Scalable Clocking
can enable the processor to operate at 32 MHz. The board also has a JTAG and
RS–232 connector. The 1M byte of 32–bit wide EPROM supplied with the board is
preprogrammed for MiniMON29K operation.
Those with more money to spend, or requiring a more flexible evaluation board,
can use the SE29240 board. It contains an Am29243 processor but can be used to
evaluate an Am29240 or Am29245. Initially the board contains 1M byte of 36–bit
wide DRAM. However, this can be expanded considerably. The DRAM is 36–bits
wide due to the additional 4–bits required for parity checking. The maximum
memory speed is 25 MHz. Scalable Clocking can be used with a 32 MHz processor
when the memory system is configured for 16 MHz operation.
The SE29240 board has greater I/O capability than the SD29240 board. There
are connectors for two RS–232 ports and a parallel port. Debugging can be achieved
via a serial or parallel port link to the MiniMON29K DebugCore located in EPROM.
Debugging is also supported via the JTAG or Logic Analyzer connections. There is a
small wire–wrap area for additional circuitry, and extra boards can be connected via
an expansion connector.
AMD also has an evaluation board intended for certain communication applications. The NET29K board has a triple processor pad–site. The board can operate with
either an Am29205, Am29200 or Am2924x (probably an Am29243) processor. The
processor pad site is concentric, the larger processor being at the outer position. The
similarity in the memory region controllers enables the construction of this unusual
board.
The memory system consists of 4M bytes of 36–bit wide DRAM, which is expandable. There is also 2M bytes of 32–bit EPROM. The EPROM can be replaced
with 1M byte of Flash programmable memory. For communications there is an AMD
MACE chip which provides an Ethernet capability via an 10–Base–T connector. Two
of the processors DMA channels are wired for MACE access. Once channel of an
85C30 UART is connected to an RS–449 connector which supports RS–422 signal
Chapter 1
Architectural Overview
45
level communication. This enables very fast UART communication. The MiniMON29K DebugCore and OS–boot operating system are initially installed in
EPROM (or Flash); and the DebugCore communicates via an on–chip UART connected to an RS–232 (9–way) connector.
When the NET29K board is used with an Am29205 processor, the 16–bit processor bus enables only half of the memory system to be accessed. The board is
physically small, measuring about 5 1/2 x 5 1/2 inches (14cm x 14cm). Debugging is
further supported by JTAG and Logic Analyzer connections. An inexpensive 9–volt
power supply is required.
1.10
REGISTER AND MEMORY SPACE
Most of the 29K instructions operate on information held in various processor
registers. Load and store type instructions are available for moving data between external memory and processor registers. Members of the 29K family generally support registers in three independent register regions which make up the 29K register
space. These regions are the General Purpose registers, Translation Look–Aside
(TLB) registers, and Special Purpose registers. Members of the 29K family which do
not support Memory Management Unit operation, do not have TLB registers implemented.
There are currently two core processors within the 29K family, the Am29000
and the Am29050. Other processors are generally derived from one of these core processors. For example, the Am29030 has an Am29000 at its core, with additional silicon area being used to implement instruction cache memory and a 2–bus processor
interface. The differences between the core processors and their derivatives is reflected in expansions to the special register space.
However, the special register space does appear uniform through out the 29K
family. Generally only those concerned with generating operating system support
code are concerned with the details of the special register space. AMD has specified a
subset of special registers which are supported on all 29K family processors. This
aids in the development and porting of Supervisor mode code.
The core processors support a 3–bus Harvard Architecture, with instructions
and data being held in separate external memory systems. There is one 32–bit bus
each for the two memory systems and a shared 32–bit address bus. Some RISC chips
have a 4–bus system, where there is an address bus for each of the two memory systems. This avoids the contention for use of a shared address bus. Unfortunately, it also
results in increased pin–count and, consequently, processor cost. The 29K 3–bus processors avoid conflicts for the address bus by supporting burst mode addressing and a
large number of on–chip registers. It has been estimated that the Am29000 processor
losses only 5% performance as a result of the shared address bus.
All instruction fetches are directed to instruction memory; data accesses are directed to data memory or I/O space. These two externally accessible spaces consti-
46
Evaluating and Programming the 29K RISC Family
tute two of the four external access spaces. The other two are the ROM space and the
coprocessor space. The ROM space is accessed via the instruction bus. Like the
instruction space it covers a 232 range.
1.10.1 General Purpose Registers
All members of the family have general purpose registers which are made up
from 128 local registers and more than 64 global registers (see Figure 1-14). These
registers are the primary source and destination for most 29K instructions. Instructions have three 8–bit operand fields which are used to supply the addresses of general registers. All User mode executable instructions and code produced by high level
language compilers, are restricted to only directly assessing general purpose registers. The fact that these registers are all 32–bit and that there is a large number of
them, vis–a–vis CISC, reduces the need to access data held in external memory.
General purpose registers are implemented by a multiport register file. This file
has a minimum of three access ports, the Am29050 processor has an additional port
for writing–back floating–point results. Two of the three ports provide simultaneous
read access to the register file; the third port is for updating a register value. Instructions generally specify two general purpose register operands which are to be operated on. After these operands have been presented to the execution unit, the result of
the operation is made available in the following cycle. This allows the result of an
integer operation to be written back to the selected general purpose register in the
cycle following its execution. At any instant, the current cycle is used to write–back
the result of the previous computation.
The Am29050 can execute floating–point operations in parallel with integer
operations. The latency of floating–point instructions can be more than the 1–cycle
achieved by the integer operation unit. Floating–point results are written back, when
the operation is complete, via their own write–back port, without disrupting the integer units ability to write results into the general purpose register file.
Global Registers
The 8–bit operand addressing fields enable only the lower 128 of the possible
256 address values to be used for direct general purpose register addressing. This is
because the most significant address bit is used to select a register base–plus–offset
addressing mode. When the most significant bit is zero, the accessed registers are
known as Global Registers. Only the upper 64 of the global registers are implemented in the register file. These registers are known as gr64–gr127. Some of the lower
address–value global registers are assigned special support tasks and are not really
general purpose registers.
The Am29050 processor supports a condition code accumulator with global
registers gr2 and gr3. The accumulator can be used to concatenate the result of several Boolean comparison operations into a single condition code. Later the accumuChapter 1
Architectural Overview
47
Absolute
REG#
GENERAL–PURPOSE
REGISTER
0
Indirect Pointer Access
1
Stack Pointer
2 THRU 63
not implemented
64
GLOBAL REGISTER 64
65
GLOBAL REGISTER 65
66
GLOBAL REGISTER 66
126
GLOBAL REGISTER 126
127
GLOBAL REGISTER 127
128
LOCAL REGISTER 125
129
LOCAL REGISTER 126
130
LOCAL REGISTER 127
131
LOCAL REGISTER 0
132
LOCAL REGISTER 1
254
LOCAL REGISTER 123
255
LOCAL REGISTER 124
GLOBAL
REGISTERS
LOCAL
REGISTERS
STACK
POINTER
=131
(example)
Figure 1-14. General Purpose Register Space
48
Evaluating and Programming the 29K RISC Family
lated condition can be quickly tested. These registers are little used and on the whole
other, more efficient, techniques can be found in preference to their use.
Local Registers
When the most significant address bit is set, the upper 128 registers in the general purpose register file are accessed. The lower 7–bits of the address are used as an
offset to a base register which points into the 128 registers. These general purpose
registers are known as the Local Registers. The base register is located at the global
register address gr1. If the addition of the 7–bit operand address value and the register
base value produces a results too big to be contained in the 7–bit local register address
space, the result is rounded modulo–128. When producing a general purpose register
address from a local register address, the most significant bit of the general purpose
register address value is always set.
The local register base address can be read by accessing global register gr1.
However, the base register is actually a register which shadows global register gr1.
The shadow support circuitry requires that the base be written via an ALU operation
producing a result destined for gr1. This also requires that a one cycle delay follow
the setting of the base register and any reference to local registers.
Global register address gr0 also has a special meaning. Each of the three operand fields has an indirect pointer register located in the special register space. When
address gr0 is used in an operand field, the indirect pointer is used to access a general
purpose register for the associated operand. Each of the three indirect pointers has an
8–bit field and can point anywhere in the general purpose register space. When indirect pointers are used, there is no distinction made between global and local registers.
All of the general purpose registers are accessible to the processor while executing in User mode unless register bank protection is applied. General purpose registers
starting with gr64 are divided into groups of 16 registers. Each group can have access
restricted to the processor operating in Supervisor mode only. The AMD high level
language calling convention specifies that global registers gr64–gr95 be reserved for
operating system support tasks. For this reason it is normal to see the special register
used to support register banking set to disable User mode access to global registers
gr64–gr95.
1.10.2 Special Purpose Registers
Special purpose register space is used to contain registers which are not accessed directly by high level languages. Registers such as the program counter and
the interrupt vector table base pointer are located in special register space. Normally
these registers are accessed by operating system code or assembly language helper
routines. Special registers can only be accessed by move–to and move–from type
instructions; except for the move–to–immediate case. Move–to and move–from
instructions require the use of a general purpose register. It is worth noting that
Chapter 1
Architectural Overview
49
move–to special register instructions are among a small group of instructions which
cause processor serialization. That is, all outstanding operations, such as overlapping
load or store instructions, are completed before the serializing instruction commences.
Special register space is divided into two regions (see Figure 1-15). Those registers whose address is below sr128 can only be accessed by the processor operating
in Supervisor mode. Different members of the 29K family have extensions to the
global registers shown in Figure 1-15. However, special registers sr0–sr14 are a subset which appear in all family members. Certain, generally lower cost, family members such as the Am29005 processor, which have no memory management unit, do
not have the relevant MMU support registers (sr13 and sr14). I shall first describe the
restricted access, or protected, special registers. I shall not go into the exact bit–field
operations in detail, for an expansion of field meanings see later chapters or the relevant processor User’s Manual. The objective here is to provide a framework for better understanding the special register space.
Special registers are not generally known by their special register number. For
example, the program counter buffer register PC1 is known as PC1 by assembly language programming tools rather than sr11.
Vector Area Base
Special register sr0, better known as VAB, is a pointer to the base of a table of
256 address values. Each interrupt or trap is assigned a unique vector number. Vector
numbers 0–63 are assigned to specific processor support tasks. When an interrupt or
trap exception is taken, the vector number is used to index the table of address values.
The identified address value is read and used as the start address of the exception
handling routine. Alternatively with 3–bus members of the 29K family, the vector
table can contain 256 blocks of instructions. The VF bit (vector fetch) in the processor Configuration register (CFG) is used to select the vector table configuration.
Each block is limited to 64 instructions, but via this method the interrupt handler can
be reached faster as the start of, say, an interrupt handler need not be preceded by a
fetch of the address of the handler. In practice the table of vectors to handlers, rather
than handlers themselves, is predominantly used due to the more efficient use of
memory. For this reason the two later 2–bus members of the 29K family only support
the table of vectors method; and the VF bit in the CFG register is reserved and effectively set.
The first 29K processor, the Am29000, has a VAB register which requires the
base of the vector table to be aligned to a 64k byte address boundary. This can be inconvenient and lead to memory wastage. More recent family members provide for a
1k byte boundary. Because the 3–bus family members support instructions being located in Instruction space and ROM space (memory space is described in section
1.10.4), it is possible with these processors to specify that handler routines are in
ROM space by setting the RV bit (ROM vector area) in the CFG register when the VF
50
Evaluating and Programming the 29K RISC Family
Special Purpose
Reg. No.
Protected Registers
Mnemonic
VAB
0
Vector Area Base Address
1
Old Processor Status
OPS
2
Current Processor Status
CPS
3
Configuration
CFG
4
CHA
5
Channel Address
Channel Data
CHD
6
Channel Control
CHC
7
Register Bank Protect
RBP
Timer Counter
Timer Reload
TMC
TMR
10
Program Counter 0
PC0
11
Program Counter 1
Program Counter 2
PC1
MMU Configuration
MMU
LRU
8
9
12
13
14
PC2
LRU Recommendation
Unprotected Registers
128
Indirect Pointer C
IPC
129
Indirect Pointer A
IPA
130
Indirect Pointer B
IPB
131
Q
ALU Status
132
133
134
135
160
Byte pointer
Funnel Shift Count
Load/Store Count Remaining
Floating–Point Environment
Q
ALU
BP
FC
CR
FPE
161
Integer Environment
INTE
162
Floating–Point Status
FPS
Figure 1-15. Special Purpose Register Space for the Am29000 Microprocessor
Chapter 1
Architectural Overview
51
bit is zero. Or, when the more typical table of vectors method is being used by, setting
bit–1 of the handler address. Since handler routines all start on 4–byte instruction
boundaries, bits 0 and 1 of the vector address are not required to hold address information. The 2–bus and microcontroller members of the 29K family do not support
ROM space and RV bit in the CFG registers is reserved.
Processor Status
Two special registers, sr1 and sr2, are provided for processor status reporting
and control. The two registers OPS (old processor status) and CPS (current processor
status) have the same bit–field format. Each bit position has been assigned a unique
task. Some bit positions are not effective with particular family members. For example, the Am29030 processor does not use bit position 15 (CA). This bit is used to indicate coprocessor activity. Only the 3–bus family members support coprocessor operation in this way.
The CPS register reports and controls current processor operation. Supervisor
mode code is often involved with manipulating this register as it controls the enabling
and disabling of interrupts and address translation. When a program execution exception is taken, or an external event such as an interrupt occurs, the CPS register
value is copied to the OPS register and the processor modifies the CPS register to
enter Supervisor mode before execution continues in the selected exception handling
routine. When returning from the handler routine, the interrupted program is restarted with an IRET type instruction. Execution of an IRET instruction causes the
OPS register to be copied back to the CPS register, helping to restore the interrupted
program context. Supervisor mode code often prepares OPS register contents before
executing an IRET and starting User mode code execution.
Configuration
Special register sr3, known as the configuration control register (CFG), establishes the selected processor operation. Such options as big or little endian byte order, cache enabling, coprocessor enabling, and more are selected by the CFG setting.
Normally this register value is established at processor boot–up time and is infrequently modified.
The original Am29000 (rev C and later) only used the first six bits of the CFG
register for processor configuration. Later members of the family offer the selection
of additional processor options, such as instruction memory cache and early address
generation. Additional options are supported by extensions to the CFG bit–field assignment. Because there is no overlap with CFG bit–field assignment across the 29K
family, and family members offer a matrix of functionality, there are often reserved
bit–fields in the CFG register for any particular 29K processor. The function provided at each bit position is unique and if the function is not provided for by a processor, the bit position is reserved.
52
Evaluating and Programming the 29K RISC Family
The upper 8–bits of the CFG register are used for processor version and revision
identification. The upper 3–bits of this field, known as the PRL (processor revision
level) identify the processor. The Am29000 processor is identified by processor
number 0, the Am29050 is processor number 1, and so on. The lower 5–bits of the
PRL give the the revision level; a value of 3 indicates revision ‘D’. The PRL field is
read–only.
Data Access Channel
Three special registers, sr4–sr6, known as CHA (channel address), CHD (channel data) and CHC (channel control), are used to control and record all access to external data memory. Processors in the 29K family can perform data memory access in
parallel with instruction execution. This offers a considerable performance boost,
particularly where there is high data memory access latency. Parallel operation can
only occur if the instruction pipeline can be kept fed from the instruction prefetch
buffer (IPB), instruction memory cache, or via separate paths to data and instruction
memory (Harvard style 3–bus processors). It is an important task of a high level language compiler to schedule load and store instructions such that they can be successfully overlapped with other nondependent instructions (see section 1.13).
When data memory access runs in parallel, its completion will occur some time
after the instruction originally making the data access. In fact it could be several
cycles after the original request, and it may not be possible to determine the original
instruction. On many processors, keeping track of the original instruction is required
in case the load or store operation does not complete for some reason. The original
instruction is restarted after the interrupting complication has been dealt with. However, with the 29K family the original instruction is not restarted. All access to external memory is via the processor Data Channel. The three channel support registers
are used to restart any interrupted load or store operation. Should an exception occur
during data memory access, such as an address translation fault, memory access
violation, or external interrupt, the channel registers are updated by the processor reporting the state of the in–progress memory access.
The channel control register (CHC) contains a number of bit–fields. The contents–valid bit (CV) indicates that the channel support registers currently describe a
valid data access. The CV bit is normally seen set when a channel operation is interrupted. The ML bit indicates a load– or store–multiple operation is in progress.
LOADM and STOREM instructions set this bit when commencing and clear it when
complete. It is important to note that non–multiple LOAD and STORE instructions
do not set or clear the ML bit. When a load– or store–multiple operation is interrupted
and nested interrupt processing is supported, it is not sufficient to just clear the CV bit
to temporary cancel the channel operation. If the ML bit was left set, a subsequent
load or store operation would become confused with a multiple type operation. The
ML bit should be cleared along with the CV bit; this is best done by writing zero into
the CHC register. (See section 4.3.8 for more information about clearing CHC.)
Chapter 1
Architectural Overview
53
Integer operations complete in a single cycle, enabling the result of the previous
integer operation to be written back to the general purpose register file in the current
cycle. Because external memory reads are likely to take several cycles to complete,
and pipeline stalling is to be avoided, the accessed data value is not written back to the
global register file during the following instruction (the write–back cycle). This results in the load data being held by the processor until access to the write–back port is
available. This is certain to occur during the execution of any future load or store
instruction which itself can not make use of its own write–back cycle. The processor
makes available via load forwarding circuitry the load data which awaits write–back
to the register file.
Register Access Protection
Special register sr7, known as RBP (register bank protect), provides a means to
restrict the access of general purpose registers by programs executing in User mode.
General purpose registers starting with gr64 are divided into groups of 16 registers.
When the corresponding bit in the RBP register is set, the associated bank of 16 registers is protected from User mode access. The RBP register is typically used to prevent
User mode programs from accessing Supervisor–maintained information held in
global registers gr64–gr95. These registers are reserved by the AMD high level language calling convention for system level information.
On–Chip Timer Control
Special registers sr8 and sr9, known as TMC (timer counter) and TMR (timer
reload value), support a 24–bit real–time clock. The TMC register decrements at the
rate of the processor clock. When it reaches zero it will generate an interrupt if enabled. In conjunction with support software these two registers can be used to implement many of the functions often supported by off–chip timer circuitry.
Program Counter
A 29K processor contains a Master and Slave PC (program counter) address
register. The Master PC register contains the address of the instruction currently being fetched. The Slave contains the next sequentional instruction. Once an instruction flows into the execution unit, unless interrupted, the following instruction, currently in decode, will always flow into the execution unit. This is true for all instructions except for instructions such as IRET. Even if the instruction in execute is a
jump–type, the following instruction known as the delay–slot instruction is executed
before the jump is taken. This is known as delayed branching and can be very useful
in hiding memory access latencies, as the processor pipeline can be kept busy executing the delay–slot instruction while the new instruction sequence is fetched. It is an
important activity of high level language compilers to find useful instructions to
place in delay–slot locations.
The Master PC value flows along the PC–bus and the bus activity is recorded by
the PC buffer registers, see Figure 1-16. There are three buffer registers arranged in
54
Evaluating and Programming the 29K RISC Family
sequence. These buffer registers are accessible within special registers’ space as
sr10–sr12, better known as PC0, PC1 and PC2. The PC0 register contains the address
of the instruction currently in decode; register PC1 contains the address of the
instruction currently in execute; and PC2 the instruction now in write–back.
R–BUS
Instruction
Fetch
or
Cache
Fetch address
PC–BUS
30
30–bit
Incrementer
Address
Generation
Unit
PC 0
Decodeaddress
Master PC
PC 1
Execute
address
Slave PC
Branch
PC MUX
Return
Address
PC 2
PC–Buffer
Write–Back
address
B–BUS, supplies branch addresses
Figure 1-16. Am29000 Processor Program Counter
When a program exception occurs the PC–buffer registers become frozen. This
is signified by the FZ bit in the current processor status register being set. When frozen, the PC–buffer registers accumulate no new PC–bus information. The frozen PC
information can be used later to restart program execution. An IRET instruction
causes the PC1 and PC0 register information to be copied to the Master and Slave PC
registers and instruction fetching to commence. For this reason it is important to
maintain both PC1 and PC0 values when dealing with such system level activities as
nested interrupt servicing. Since the PC2 register records the address of a now
executed instruction, maintenance of its value is less important; but it can play an important role in debugging
When a CALL instruction is executed, the B–bus supplies the Master PC with
the address of the new instruction stream. Earlier, when the CALL instruction entered the decode stage, the PC–bus was used to fetch the delay–slot instruction; and
the address of the instruction following the delay–slot (the return address) was prepared for entry into the Slave PC. On the following cycle, the CALL instruction enChapter 1
Architectural Overview
55
ters the execute stage and the return address enters the Return Address register. During CALL execution, the return address is transferred to the register file via the R–
BUS.
MMU control
The last of the generally available special registers are concerned with memory
management unit (MMU) operation. Processors which have the Translation Look–
Aside Buffer (TLB) registers omitted will not have these two special registers. The
operation of the MMU is quite complex, and Chapter 6 is fully dedicated to the description of its operation. Many computer professionals working in real–time projects may be unfamiliar with MMU operation. The MMU enables virtual addresses
generated by the processor to be translated into physical memory addresses. Additionally, memory is divided into page sized quantities which can be individually protected against User mode or Supervisor mode read and write access.
Special register sr13, known as MMU, is used to select the page size; a minimum of 1k bytes, and a maximum of 8k bytes. Also specified is the current User
mode process identifier. Each User mode process is given a unique identifier and Supervisor mode processes are assumed to have identifier 0.
Certain newer 29K processors support two TLB systems on–chip. Each TLB
has a independently programmable page size. These processors, and their close relatives can be programmed for a maximum page size of 16M bytes.
Additional Protected Special Registers
Monitor Mode
Some newer members of the 29K family have additional Supervisor only accessible special registers which are addressed above sr14. Figure 1-17 shows the additional special registers for processors which support Monitor mode. Special register
sr15, known as RSN (reason vector), records the trap number causing Monitor mode
Special Purpose
Reg. No.
15
Mnemonic
Protected Registers
Reason Vector
RSN
20
Shadow Program Counter 0
SPC0
21
Shadow Program Counter 1
SPC1
22
Shadow Program Counter 2
SPC2
Figure 1-17. Additional Special Purpose Registers for the Monitor Mode Support
56
Evaluating and Programming the 29K RISC Family
to be entered. Monitor mode extends the software debugging capability of the processor; it was briefly described in the previous section describing the processor features, and is dealt with in detail in later chapters. The shadow Program Counter registers constituted a second set of PC–buffer registers. They record the PC–bus activity
and are used to support Monitor mode debugging.
Am29050
Figure 1-18 shows the additional special registers used by the Am29050 processor for region mapping. In the Am29050 case, the additional special registers support two functions: debugging and region mapping. Four special registers in the
range sr16–sr19 extend the virtual address mapping capabilities of the TLB registers.
They support the mapping of two regions which are of programmable size. Their use
reduces the demand placed on TLB registers to supply all of a systems address mapping and memory access protection requirements.
Special Purpose
Reg. No.
Mnemonic
Protected Registers
16
Region Mapping Address 0
RMA0
17
Region Mapping Control 0
RMC0
18
Region Mapping Address 1
RMA1
19
Region Mapping Control 1
RMC1
Figure 1-18. Additional Special Purpose Registers for the Am29050 Microprocessor
Instruction and Data Breakpoints
Figure 1-19 shows the additional special registers for processors which support
breakpoint debugging. They facilitate the control of separate instruction access
Special Purpose
Reg. No.
Mnemonic
Protected Registers
23
Instruction Breakpoint Address 0
IBA0
24
Instruction Breakpoint Control 0
IBC0
25
Instruction Breakpoint Address 1
IPA1
26
Instruction Breakpoint Control 1
IBC1
27
Data Breakpoint Address 0
DBA0
28
Date Breakpoint Control 0
DBC0
Figure 1-19. Additional Special Purpose Registers for Breakpoint Control
Chapter 1
Architectural Overview
57
breakpoints and data access breakpoints. Some 29K processors have instruction
breakpoints only; others support both types of breakpoint.
On–Chip Cache Control
Figure 1-20 shows the additional special registers required to access on–chip
cache. There are only two additional registers, sr29 and sr30, required. Both registers
are used for communicating with the instruction memory cache supported by many
29K processors. If a processor also contains data cache, the memory can similarly be
accessed via the same cache interface registers. Supervisor mode support code controls cache operation via the processor configuration register (CFG), and is not likely
to make use of the cache interface registers. These registers may be used by debuggers and monitors to preload and examine cache memory contents.
Special Purpose
Reg. No.
Mnemonic
Protected Registers
29
Cache Interface Register
CIR
30
Cache Data Register
CDR
Figure 1-20. Additional Special Purpose Registers for On–Chip Cache Control
User Mode Accessible Special Registers
Figure 1-15 showed the special register space with its two regions. The region
addressed above sr128 is always accessible; and below sr128, registers are only accessible to the processor when operating in Supervisor mode.
The original Am29000 processor defined a subset of User mode accessible registers, in fact those shown in Figure 1-15. Every 29K processor supports the use of
these special registers, but, only the Am29050 has the full complement implemented.
Registers in the range sr128–sr135 are always present. However, the three registers sr160–sr162 are used to support floating–point and integer operations. Only
certain members of the 29K family directly support these operations in processor
hardware. Other 29K family members virtualize these three registers. When not
available, an attempt to access them causes a protection violation trap. The trap handler identifies the attempted operation and redirects the access to shadow copies of
the missing registers. The accessor is unaware that the virtualization has occurred,
accept for the delay in completing the requested operation. In practice, floating–point
supporting special registers are not frequently accessed; except for the case of floating–point intensive systems which tend to be constructed around an Am29050 processor.
58
Evaluating and Programming the 29K RISC Family
Indirect Pointers
Special registers sr128–sr130, better known as IPA, IPB and IPC, are the indirect pointers used to access the general purpose register file. For instructions which
make use of the three operand fields, RA, RB and RC, to address general purpose
registers, the indirect pointer can be used as an alternative operand address source.
For example, the RA operand field supplies the register number for the source operand–A; if global register address gr0 is used in the RA instruction field, then the operand register number is provided by the IPA register.
The IPA, IPB and IPC registers are pointers into the global register file. They are
generally used to point to parameters passed to User mode helper routines. They are
also used to support instruction emulation, where trap handler routines perform in
software the missing instruction. The operands for the emulated instruction are
passed to the trap handler via the indirect pointers.
ALU Support
Special registers sr131–sr134 support arithmetic unit operation. Register
sr131, better known as Q, is used during floating–point and integer multiply and divide steps. Only the Am29050 processor can perform floating–point operations directly, that is, without coprocessor or software emulation help. It is also the only processor which directly supports integer multiply. All other current members of the
29K family perform these operations in a sequence of steps which make use of the Q
register.
The result of a comparison instruction is placed in a general purpose register, as
well as in the condition field of the ALU status register (special register sr132). However, the ALU status register is not conveniently tested by such instructions as conditional branch. Branch decisions are made on the basis of True or False values held in
general purpose registers. This makes a lot of sense, as contention for use of a single
resource such as the ALU status register would lead to a resource conflict which
would likely result in unwanted pipeline stalling.
The ALU status register controls and reports the operation of the processor integer operation unit. It is divided into a number of specialized fields which, in some
cases, can be more conveniently accessed via special registers sr134 and sr135. The
short hand access provided by these additional registers avoids the read, shift and
mask operations normally required before writing to bit–fields in the ALU register.
Data Access Channel
The three channel control registers, CHA, CHD and CHC, were previously described in the protected special registers section. However, User mode programs
have a need to establish load– and store–multiple operations which are controlled by
the channel support registers. Special register sr135, known as CR, provides a means
for a User mode program to set the Count Remaining field of the protected CHC register. This field specifies the number of consecutive words transferred by the multiple
Chapter 1
Architectural Overview
59
data move operation. Should the operation be interrupted for any reason, the CR field
reports the number of transfers yet to be completed. Channel operation is typically
restarted (if enabled) when an IRET type instruction is issued.
Instruction Environment Registers
Special registers sr160 and sr162, known as FPE and FPS, are the floating–
point environment and status registers. The environment register is used by User
mode programs to establish the required floating–point operations, such as double–
or single–precision, IEEE specification conformance, and exception trap enabling.
The status register reports the outcome of floating–point operations. It is typically
examined as a result of a floating–point operation exception occurring. Only processors (Am29050) which support floating–point operations directly (free of trapware)
have real sr161 and sr162 registers. All other processors appear to have these registers via trapware support which creates virtual registers.
The integer environment is established by setting special register sr161, known
as INTE. There are two control bits which separately enable integer and multiplication overflow exceptions. If exception detection is enabled, the processor will take
an Out–of–Range trap when an overflow occurs. Only processors (Am29040,
Am29240 and Am29243) which support integer multiply directly (free of trapware)
have a real sr161 register. All other processors appear to have an sr161 register via
trapware support.
Additional User Mode Special Registers
Am29050
The Am29050 has an additional special register, shown in Figure 1-21. Register
sr164, known as EXOP, reports the instruction operation code causing a trap. It is
used by floating–point instruction exceptions. Unlike other 29K processors the
Am29050 directly executes all floating–point instructions. Exception traps can occur during these operations. When instruction emulation techniques are being used, it
is an easy matter to determine the instruction being emulated at the time of the trap.
However, with direct execution things are not as simple. The processor could examine the memory at the address indicated by the PC–buffer registers to determine
the relevant instruction opcode. But the Am29050 supports a Harvard memory architecture and there is no path within the processor to access the instruction memory as if
it were data. The EXOP register solves this problem. Whenever an exception trap is
taken, the EXOP register reports the opcode of the instruction causing the exception.
Users of other 3–bus Harvard type processors such as the Am29000 and
Am29005 should take note; virtualizing the unprotected special registers sr160–162
requires that the instruction space be readable by the processor (virtualizing, in this
case, means making registers sr160–162 appear to be accessible even when they are
not physically present). This can only be achieved by connecting the instruction and
60
Evaluating and Programming the 29K RISC Family
Special Purpose
Reg. No.
164
Mnemonic
Unprotected Registers
Exception Opcode
EXOP
Figure 1-21. Additional Special Purpose Register for the Am29050 Microprocessor
data busses together (disabling the Harvard architecture advantages by creating a
2–bus system) or providing an off–chip bridge. This bridge must enable the address
space to be reached from within some range of data memory space, at least for word–
size read accesses, and, all be it, with additional access time penalties.
The Am29050 processor has an additional group of registers known as the floating–point accumulators. There are four 64–bit accumulators ACC3–0 which can be
used with certain floating–point operations. They can hold double– or single–precision numbers. They are not special registers in the sense they lie in special register
space. They are located in their own register space, giving the Am29050 one more
register space than the normal three register spaces of the other 29K family members.
However, like special registers, they can only be accessed by move–to and move–
from accumulator type instructions.
Double–precision numbers (64–bit) can be moved between accumulators and
general registers in a single cycle. Global registers are used in pairs for this operation.
This is possible because the Am29050 processor is equipped with an additional
64–bit write–back port for floating point data, and the register file is implemented
with a width of 64–bits.
1.10.3 Translation Look–Aside Registers
Although some 29K family members are equipped with region mapping registers, a Translation Look–Aside Buffer (TLB) technique is generally used to provide
virtual to physical address translation. The TLB is two–way set associative and up to
64 translations are cached in the TLB support registers.
The TLB registers form the basis for implementing a Memory Management
Unit. The scheme for reloading TLB registers is not dictated by processor micorcode,
but left to the programmer to organize. This enables a number of performance boosting schemes to be implemented with low overhead costs. However, it does place the
burden of creating a TLB maintenance scheme on the user. Those used to having to
work around a processor’s microcode imposed scheme will appreciate the freedom.
TLB registers can only be accessed by move–to TLB and move–from TLB
instructions executed by the processor operating in Supervisor mode. Each of the
possible 64 translation entries (less than 64 with some 29K family members) requires
Chapter 1
Architectural Overview
61
a pair of TLB registers to fully describe the address translation and access permissions for the mapped page. Pages are programmable in size from 1k bytes to 8k bytes
(to 16M byte with newer 29K processors), and separate read, write and execute permissions can be enabled for User mode and Supervisor mode access to the mapped
page.
There is only a single 32–bit virtual address space supported. This space is
mapped to real instruction, data or I/O memory. Address translation is performed in a
single cycle which is overlapped with other processor operations. This results in the
use of an MMU not imposing any run–time performance penalties, except where
TLB misses occur and the TLB cache has to be refilled. Each TLB entry is tagged
with a per–process identifier, avoiding the need to flush TLB contents when a user–
task context switch occurs. Chapter 6 fully describes the operation of the TLB.
1.10.4 External Address Space
The 3–bus members of the 29K family support five external 32–bit address
spaces. They are:
Data Memory — accessed via the data bus.
Input/Output — also accessed via the data bus.
Instruction — accessed via the instruction bus, normally read–only.
ROM — also accessed via the instruction bus, normally read–only.
Coprocessor — accessed via both data and address busses. Note, the address
bus is only used for stores to coprocessor space. This enables 64–bit transfers
during stores and 32–bit during loads.
The address bus is used for address information when accessing all address
spaces except the coprocessor space. During load and store operations to coprocessor
space, address information can be supplied in a limited way by the OPT2–0 field of
the load and store instructions. Of course, with off–chip address decoding support,
access to coprocessor space could always be made available via a region of I/O or
data space. Coprocessors support off–chip extensions to a processor’s execution
unit(s). AMD supplied a coprocessor in the past, which was for floating–point support, the Am29027. It is possible that users could construct their own coprocessor for
some specialized support task.
Earlier sections discussed the read–only nature of the instruction bus of 3–bus
processors. Instructions are fetched along the instruction bus from either the ROM
space or the Instruction space. Access to the two 32–bit spaces is distinguished by the
IREQT processor pin. The state of this pin is determined by the RE (ROM enable) bit
of the current processor status register (CPS). This bit can be set by software or via
programmed event actions, such as trap processing. ROM space is intended for system level support code. Typically systems do not decode this pin and the two spaces
are combined into one.
62
Evaluating and Programming the 29K RISC Family
The Input/Output (I/O) space can be reached by setting the AS (address space)
bit in load and store instructions. Transfers to I/O space, like coprocessor space and
data space transfers, are indicated by the appropriate value appearing on the
DREQT1–0 (data request type) processor pins. I/O space access is only convenient
for assembly level routines. There is typically no convenient way for a high level language to indicate an access is to be performed to I/O space rather than data space. For
this reason use of I/O space is often best avoided, unless it is restricted to accessing
some Supervisor maintained peripheral which is best handled via assembly language
code.
The 2–bus 29K family processors support a reduced number of off–chip address
spaces, in fact, only two: Input/Output space, and a combined Instruction/Data
memory space. Accessing both instructions and data via a shared instruction/data bus
simplifies the memory system design. It can also simplify the software; for example,
instruction space and data space can no longer overlap. Consider a 3–bus system
which has physical memory located at address 0x10000 in instruction space and also
different memory located at address 0x10000 in data space. Software errors can occur regarding accessing the correct memory for address 0x10000. It can also complicate system tasks such as virtual memory management, where separate free–page
lists would have to be kept for the different types of memory.
The Translation Look–Aside buffer (TLB), used to support virtual memory addressing, supports separate enabling of data and instruction access via the R/W/X
(read/write/execute) enable bits. However, permission checking is only performed
after address translation is performed. It is not possible to have two valid virtual–to–
physical address translations present in the TLB at the same time for the same virtual
address, even if one physical address is for data space and the other instruction space.
This complicates accessing overlapping address spaces via a single 32–bit virtual
space.
Accessing virtual memory has similar characteristics to accessing memory via a
high level language. For example, C normally supports a single address space. It is
difficult and nonportable to have C code which can reach different address spaces.
Except for instruction fetching, all off–chip memory accesses are via load and store
type instructions. The OPT2–0 field for these instructions specifies the size of the
data being transferred: byte, half–word or 32–bit. The compiler assigns OPT field
values for all load and store instructions it generates. Unless via C language extensions or assembly code post–processing, there is no way to set the load and store
instruction address–space–selecting options. Software is simplified by locating all
external peripherals and memory in a single address space; or when a Harvard architecture is used, by not overlapping the regions of data and instruction memory spaces
used.
Chapter 1
Architectural Overview
63
1.11
INSTRUCTION FORMAT
All instructions for the Am29000 processor are 32 bits in length, and are divided
into four fields, as shown in Figure 1-22. These fields have several alternative definitions, as discussed below. In certain instructions, one or more fields are not used, and
are reserved for future use. Even though they have no effect on processor operation,
bits in reserved fields should be 0 to insure compatibility with future processor versions.
23
31
15
0
7
Op
A//M
RC
I17..I10
I15..I8
VN
CE//CNTL
RA
SA
RB
RB or I
I9..I2
I7..I0
UI//RND//FD//FS
Figure 1-22. Instruction Format
The instruction fields are defined as follows:
BITS 31–24
Op
This field contains an operation code, defining the operation to be
performed. In some instructions, the least–significant bit of the operation code selects between two possible operands. For this reason,
the least–significant bit is sometimes labeled “A” or “M”, with the
following interpretations:
A
(Absolute): The A bit is used to differentiate between Program–
Counter relative (A = 0) and absolute (A = 1) instruction addresses,
when these addresses appear within instructions.
M
(IMmediate): The M bit selects between a register operand (M = 0)
and an immediate operand (M =1), when the alternative is allowed by
an instruction.
BITS 23–16
RC
The RC field contains a global or local register–number, which is the
destination operand for many instructions.
I17..I10
This field contains the most–significant 8 bits of a 16–bit instruction
address. This is a word address, and may be Program–Counter relative or absolute, depending on the A bit of the operation code.
64
Evaluating and Programming the 29K RISC Family
I15..I8
VN
CE//CNTL
BITS 15–8
RA
SA
This field contains the most–significant 8 bits of a 16–bit instruction
constant.
This field contains an 8–bit trap vector number.
This field controls a load or store access.
The RA field contains a global or local register–number, which is a
source operand for many instructions.
The SA field contains a special–purpose register–number.
BITS 7–0
RB
The RB field contains a global or local register–number, which is a
source operand for many instructions.
RB or I
This field contains either a global or local register–number, or an
8–bit instruction constant, depending on the value of the M bit of the
operation code.
I9..I2
This field contains the least–significant 8 bits of a 16–bit instruction
address. This is a word address, and may be Program–Counter relative, or absolute, depending on the A bit of the operation code.
I7..I0
This field contains the least–significant 8 bits of a 16–bit instruction
constant.
UI//RND//FD//FS
This field controls the operation of the CONVERT instruction.
The fields described above may appear in many combinations. However, certain combinations which appear frequently are shown in Figure 1-23.
1.12
KEEPING THE RISC PIPELINE BUSY
If the external interface of a microprocessor can not support an instructon fetch
rate of one instruction per cycle, execution rates of one per cycle can not be sustained.
As described in detail in Chapter 6, a 4–1 DRAM (4–cycle first access, 1–cycle subsequent burst–mode access) memory system used with a 3–bus Am29000 processor,
can sustain an average processing time per instruction of typically two cycles, not the
desired 1–cycle per instruction. However, a 2–1 SRAM based system comes very
close to this target. From these example systems it can be seen that even if a memory
system can support 1–cycle burst–mode access, there are other factors which prevent
the processor from sustaining single–cycle execution rates.
It is important to keep the processor pipeline busy doing useful work. Pipeline
stalling is a major source of lost processor performance. Stalling occurs as a result of:
Chapter 1
Architectural Overview
65
Three operands, with possible 8–bit constant:
31
23
X X X X X X X M
15
RC
7
RA
0
RB or I
Three operands, without constant::
31
23
X X X X X X X 0
15
RC
7
RA
0
RB
One register operand, with 16–bit constant:
31
23
X X X X X X X 1
15
I15..I8
7
RA
0
I7..I0
Jumps and calls with 16–bit instruction address:
31
23
X X X X X X X A
15
I17..I10
Two operands with trap vector number:
31
23
X X X X X X X M
Loads and stores:
31
RA
15
VN
23
X X X X X X X M
7
I9..I2
7
RA
15
CNTL
0
0
RB or I
7
RA
0
RB or I
CE
Figure 1-23. Frequently Occurring Instruction–Field Uses
66
Evaluating and Programming the 29K RISC Family
inadaquate memory bandwidth, high memory access latency, bus access contention,
excesive program branching, and instruction dependancies. To get the best from a
processor an understanding of instruction stream dependancies is required. Processors in the 29K familiy all have pipeline interlocks supported by processor hardware.
The programmer does not have to ensure correct pipeline operation, as the processor
will take care of any dependancies. However, it is best that the programmer arranges
code execution to smooth the pipeline operation.
1.13
PIPELINE DEPENDENCIES
Modification of some registers has a delayed effect on processor behavior.
When developing assembly code, care must be taken to prevent unexpected behavior. The easiest of the delayed effects to remember is the one cycle that must follow
the use of an indirect pointer after having set it. This occurs most often with the register stack pointer. It cannot be used to access a local register in the instruction that follows the instruction that writes to gr1. An instruction that does not require gr1 (and
that means all local registers referenced via gr1) can be placed immediately after the
instruction that updates gr1.
Direct modification of the Current Processor Status (CPS) register must also be
done carefully. Particularly where the Freeze (FZ) bit is reset. When the processor is
frozen, the special-purpose registers are not updated during instruction execution.
This means that the PC1 register does not reflect the actual program counter value at
the current execution address, but rather at the point where freeze mode was entered.
When the processor is unfrozen, either by an interrupt return or direct modification of
the CPS, two cycles are required before the PC1 register reflects the new execution
address. Unless the CPS register is being modified directly, this creates no problem.
Consider the following examples. If the FZ bit is reset and trace enable (TE) is
set at the same time, the next instruction should cause a trace trap, but the PC–buffer
registers frozen by the trap will not have had time to catch up with the current execution address. Within the trap code the processor will have appeared to have stopped at
some random address, held in PC1. If interrupts and traps are enabled at the same
time as the FZ bit is cleared, then the next instruction may suffer an external interrupt
or an illegal instruction trap. Once again, the PC–buffer register will not reflect the
true execution address. An interrupt return would cause execution to commence at a
random address. The above problems can be avoided by resetting FZ two cycles before enabling the processor to once again enter freeze mode.
Instruction Memory Latency
The Branch Target Cache (BTC), or the Instruction Memory Cache, can be used
to remove the pipeline stalling that normally occurs when the processor executes a
branch instruction. For the purpose of illustrating memory access latency, the effects
of the BTC shall be illustrated. The address of a branch target appears on the address
Chapter 1
Architectural Overview
67
pins at the start of the write-back stage. Figure 1-24 shows the instruction flow
through the pipeline stages, assuming the external instruction memory returns the
target of a jump during the same cycle in which it was requested. This makes the Target instruction available at the fetch stage while the Delay instruction has to be stalled
before it can enter the execute stage. In this case, execution is stalled for two cycles
when the BTC is not used to supply the target instruction.
Am29000
Pipeline
Stages
Instruction
Fetch
Delay
Instruction
Decode
Jump
Instruction
Execution
Current
Target
Target+1
Target+2
Delay
Target
Target+1
Jump
Delay
Target
Jump
Delay
Result
Write-Back
Current Processor
cycle
Legend: Delay = Delay instruction
Jump = Jump instruction
Current = Current instruction
1–cycle fetch
future
cycles
Target = Target of jump instruction
Target + 1 = 1st instruction after target
Target + 2 = 2nd instruction after target
= Pipeline stall
Figure 1-24. Pipeline Stages for BTC Miss
The address of the fetch is presented to the BTC hardware during the execute
stage of the jump instruction, the same time the address is presented to the memory
management unit. When a hit occurs, the target instruction is presented to the decode
stage at the next cycle. This means no pipeline stalling occurs. The external instruction memory has up to three cycles to return the instruction four words past the target
address. That is, if single-cycle burst–mode can be established in three cycles (four
cycles for the Am29050 processor) or less, then continuous execution can be
achieved. The BTC supplies the target instructions and the following three instructions, assuming another jump is not taken. Figure 1-25 shows the flow of instructions through the pipeline stages.
Data Dependencies
Instructions that require the result of a load should not be placed immediately
after the load instruction. The Am29000 processor can overlap load instructions with
other instructions that do not depend on the result of the load. If 4-cycle data memory
is in use, then external data loads should (if possible) have four instructions
(4-cycles) between the load instructions and the first use of the data. Instructions that
68
Evaluating and Programming the 29K RISC Family
Am29000
Pipeline
Stages
Instruction
Fetch
Delay
Delay+1
Instruction
Decode
Jump
Delay
Target
Target+1
Target+2
Target+3
Instruction
Execution
Current
Jump
Delay
Target
Target+1
Target+2
Jump
Delay
Target
Target+1
Target+4
Result
Write-Back
Current Processor
cycle
3–cycle fetch
Legend: Delay = Delay instruction
Jump = Jump instruction
Current = Current instruction
future
cycles
Target = Target of jump instruction
Target + 1 = 1st instruction after target
Target + 2 = 2nd instruction after target
Figure 1-25. Pipeline Stages for a BTC Hit
depend on data whose loads have not yet completed, cause a pipeline stall. The stall is
minimized by forwarding the data to the execution unit as soon as it is available.
Consider the example of an instruction sequence shown in Figure 1-26. The
instruction at Load+1 is dependent on the data loaded at Load. The address of load
data appears on the address pins at the start of the write-back stage. At this point,
instruction Load+1 has reached the execution stage and is stalled until the data is forwarded at the start of the next cycle, assuming the external data memory can return
data within one cycle.
Am29000
Pipeline
Stages
Instruction
Fetch
Load+1
Load+2
Instruction
Decode
Load
Load+1
Load+2
Load+2
Instruction
Execution
Current
Load
Load+1
Load+1
Current
Load
Result WriteBack
1–cycle stall
Legend:
Load = Load instruction
Current = Current instruction
Load+2
Load+1
future
cycles
Figure 1-26. Data Forwarding and Bad–Load Scheduling
Chapter 1
Architectural Overview
69
If the instruction were not dependent on the result of the load, it would have
executed without delay. Because of data forwarding and a 1-cycle data memory, the
load data would be available for instruction Load+2 without causing a pipeline stall.
1.14
ARCHITECTURAL SIMULATION, sim29
AMD has for a long time made available a 29K simulator which accurately
models the processor operation. This simulator, known as the Architectural Simulator, can be configured to incorporate memory system characteristics. Since memory
system performance can greatly influence overall system performance, the use of the
simulator before making design decisions is highly recommended.
Simulation of all the 29K family members is supported, making the simulator
useful in determining processor choice [AMD 1991c][AMD 1993c]. For example,
does a floating–point intensive application require an Am29050 or will an Am29000
suffice? Alternatively, the performance penalties of connecting the data and instruction busses together on a 3–bus Harvard Architecture processor can be determined.
Because the simulator models detailed processor operation, such as pipeline
stages, cache memory, instruction prefetch, channel operation and much more, the
simulation run–times are longer than if the Instruction Set Simulator (ISS) were used.
Consequently, the Architectural Simulator is seldom used for program debugging.
The ISS simulator is described in Chapter 7 (Software Debugging). This is one of the
reasons that the Architectural simulator does not utilize the Universal Debugger Interface (see section 7.5). Without a UDI interface, the simulator can not support interactive debugging. Simulation results are directed to a log file. Interpretating their
meaning and dealing with log file format takes a little practice; more on this later.
When used with a HIF conforming operating system, the standard input and output for the simulated program use the standard input and output for the executable
simulator. Additionally, the 29K program standard output is also written to the simulation log file. AMD does not supply the simulator in source form; it is available in
binary for UNIX type hosts and 386 based PCs. The simulator driver, sim29, supports several command line options, as shown below. AMD updated the simulator
after version 1.1–8; the new version is compatible with the old and simulates at more
than four times the speed. The old simulator is still used with the Am29000 and
Am29050 processors. Only the new simulator models the Am2924x microcontrollers and newer 2–bus processors. The following description of command line options
covers both simulator versions.
sim29
70
[–29000 | –29005 | –29030 | –29035 | –29050 ... –29240]
[–cfg=xx] [–d] [–e eventfile] [–f freq] [–h heapsize] [–L] [–n]
[–o outputfile] [–p from–to] [–r osboot] [–t max_sys_calls]
[–u] [–v] [–x[codes]] [–dcacheoff] [–icacheoff] [–dynmem <val>]
execfile [... optional args for executable]
Evaluating and Programming the 29K RISC Family
OPTIONS
–29000|29005|29030|29035|29040|29050|29200|29205|29240|...
Select 29K processor, default is Am29000. Depending on the processor selected, the old or new simulator is selected.
Normally the simulator starts execution at address 0, with the processor Configuration Register (CFG) set to the hardware default value.
Its the application code or the osboot code responsibility to modify
the CFG registers as necessary. Alternatively, the CFG register can be
initialized from the command line. The –cfg option specifies the setting for CFG, where xx is a 1 to 5 digit HEX number. If the –cfg option
is used, no run–time change to CFG will take effect, unless an
Am292xx processor is in use. The –cfg option is seldom used; it
should be used where an osboot file is not supplied with the –r option.
Alternatively it can be used to override the cache enable/disable operation of osboot code. This can enable the effects of cache to be determined without the need to built a new osboot file. The –cfg option
is not supported by the newer simulator. Caches can be disabled using
the new –icacheoff and –dcacheoff options.
This option instructs the simulator to report the contents of processor
registers in the logfile at end of simulation.
This option is only available with the newer simulator. When used it
causes the Configuration Register (CFG) to be set for data cache disable.
–cfg=xx
–d
–dcacheoff
–dynmem <val>
–e eventfile
–f frequency
–h heapsize
Chapter 1
During execution a program may access a memory region out with
any loaded memory segment or heap and stack region. The simulator
can be instructed to automatically allocate (val=1) memory for the accessed region. Alternatively (default, val=0) an access violation is reported.
An event file is almost always used. It enables memory system characteristics to be defined and the simulation to be controlled (see section 1.14.1).
Specify CPU frequency in MHz; the default for the Am292xx and
Am29035 is 16 MHz; the Am2900x default is 25 MHz; and the default frequency for the Am29030 and Am29050 is 40 MHz.
This option specifies the amount of resource memory available to the
simulated 29K system. This memory is used for the register stack and
memory stack support as well as the run–time heap. The default size
is 32 K bytes; a heapsize value of 32.
Architectural Overview
71
–icacheoff
This option is only available with the newer simulator. When used it
causes the Configuration Register (CFG) to be set for instruction
cache disable.
–L
This option is similar in nature to the –cfg option. It can be used to select the large memory model for the Am292xx memory banks. Normally this selection is performed in the osboot file. However, the –L
option can be used to override the osboot settings, without having to
build a new osboot file. This option is currently not supported in the
newer simulator.
–n
Normally the simulator will allow access to the two words following
the end of a data section, without generating an access violation.
Some of the support library routines, such as strcpy(), used by 29K
application code, use a read–ahead technique to improve performance. If the read–ahead option is not supported, then the –n option
should be used. Only the older simulator supports this option. The
newer simulator always allows access to the words just past the end of
the data section.
–o outputfile The
simulator normally presents simulation results in file sim.out.
However an alternative result file can be selected with this option.
–p from–to
The simulator normally produces results of a general nature, such as
average number of instructions per second. It is possible, using this
option to examine the operation of specific code sequences within address range from to to.
–r osboot
The simulator can load two 29K executable programs via command–
line direction: osboot and program. It is normal to load an operating
system to deal with application support services; this is accomplished
with osboot. It is sometimes referred to as the romfile, because when
used with 29K family members which support separate ROM and
Instruction spaces, osboot is loaded into ROM space. AMD supplies
a HIF conforming operating system called OS–boot which is generally used with the –r option. Your simulation tool installation should
have a 29K executable file called osboot, romboot or even pumaboot
which contains the OS–boot code. Care should be taken to identify
and use the correct file. The newer simulator will automatically select
a default osboot file from the library directory if the –r option is not
used.
–t max_sys_calls
Specify maximum number of system call types that will be used during simulation This switch controls the internal management of the
72
Evaluating and Programming the 29K RISC Family
simulator; it is seldom used and has a default value of 256. This option
is not supported by the newer simulator.
–u
The Am292xx microcontroller family members have built–in ROM
and DRAM memory controllers. Programmable registers are used to
configure the ROM and DRAM region controllers. If the –u option is
used, application code in file program can modify the controller settings, otherwise only code in osboot can effect changes. This protects
application code from accidentally changing the memory region configuration.
–v
The OS–boot operating system, normally used to implement the osboot file, can modify its warm–start operation depending on the value
in register gr104 (see section 7.4). The –v switch causes gr104 to be
initialized to 0. When OS–boot is configured to operate with or without MMU support, a run–time gr104 value of 0 will turn off MMU
use.
–x[code]
If a 29K error condition occurs during simulation, execution is not
stopped. The –x option can be used to cause execution to stop under a
selected range of error conditions. Note, the option is not supported
by the newer simulator. Each error condition is given a code letter. If –
x is used with no selected codes, then all the available codes are assumed active. Supported code are:
A
Address error; data or instruction address out of bounds.
K
Kernel error; illegal operation in Supervisor mode.
O
Illegal opcode encountered.
F
Floating–point exception occurred; such as divide by zero.
P
A protection violation occurred in User mode
S
An event file error detected.
execfile
Name of the executable program to be loaded into memory; followed
by any command–line arguments for the 29K executable. It is important that the program be correctly linked for the intended memory system. This is particularly true for systems based on Am292xx processors. They have ROM and DRAM regions which can have very different memory access performance. If SRAM devices are to be used
in the ROM region, it is important that the application be linked for
the ROM region use rather than the DRAM.
It is best to run sim29 with the –r osboot option (this is the default operation with
the newer simulator). This is sometimes called cold–start operation. The osboot program must perform processor initialization, bringing the processor into what is
known as the warm–start condition. At this point, execution of the loaded program
commences. It is possible to run the older simulator without the use of an osboot file;
Chapter 1
Architectural Overview
73
this is known as warm–start simulation. When this is done the simulator initializes
the processor special registers CFG and CPS to a predefined warm–start condition.
AMD documentation explains the chosen settings; they are different for each processor. Basically, the processor is prepared to run in User mode with traps and interrupts
enabled and cache in use.
To support osboot operation, the simulator prepares processor registers before
osboot operation starts (see Figure 1-27).
gr105 address of end of physical memory
gr104
Operating system control info.
gr103 start of command line args (argv)
gr102
register stack size
g101
memory stack size
gr100 first instruction of User loaded code
gr99
end address of program data
gr98
gr97
start address of program data
gr96
start address of program text
lr3
argument pointer, argv
lr2
argument count, argc
end address of program text
Figure 1-27. Register Initialization Performed by sim29
The initial register information is extracted from the program file. Via the register data, the osboot code obtains the start address of the program code. If osboot code
is not used (no –r command–line switch when using the older simulator), the 29K
Program Counter is initialized to the start address of program code, rather than address 0. To support direct entry into warm–start code, the program argument information is duplicated in lr2 and lr3. Normally this information is obtained by osboot using the data structure pointed to by gr103.
The simulator intercepts a number of HIF service calls (see section 2.2). These
services mainly relate to operating system functions which are not simulated, but
dealt with directly by the simulator. All HIF services with identification numbers below 256 are intercepted. Additionally service 305, for querying the CPU frequency,
is intercepted. Operating services which are not intercepted, must be dealt with by the
osboot code. The simulator will intercept a number of traps if the –x[codes] command line option is used; otherwise all traps are directed to osboot support code, or
any other trapware installed during 29K run–time.
74
Evaluating and Programming the 29K RISC Family
1.14.1 The Simulation Event File
Simulation is driven by modeling the 29K processor pipeline operation. Instructions are fetched from memory, and make their way through the decode, execute and
write–back stages of the four–stage pipeline. Accurate modeling of processor internals enables the simulator to faithfully represent the operation of real hardware.
The simulator can also be driven from an event file. This file contains commands which are to be performed at specified time values. All times are given in processor cycles, with simulation starting at cycle 0. The simulator examines the event
file and performs the requested command at the indicated cycle time.
The syntax of the command file is very simple; each command is entered on a
single line preceded with a integer cycle–time value. There are about 15 to 20 different commands; most of them enable extra information to be placed in the simulation
results file. Information such as recording register value changes, displaying cache
memory contents, monitoring floating–point unit operation, and much more. A second group of commands are mainly used with microcontroller 29K family members. They enable the on–chip peripheral devices to be incorporated in the simulation. For example, the Am29200 parallel port can receive and transmit data from files
representing off–chip hardware.
In practice, most of these commands are little used; with one exception, the SET
command (see note below). Most users of sim29 simply wish to determine how a
code sequence, representative of their application code, will perform on different
29K family members with varying memory system configurations. The SET command is used to configure simulation parameters and define the characteristics of
system memory and buss arrangements. I will only describe the parameters used with
the MEM option to the SET command.The cycle–time value used with the commands of interest is zero, as the memory system characteristics are established before
simulation commences. One other option to the SET command of interest is
SHARED_ID_BUS; when used, it indicates the Instruction and Data buses are connected together. This option only makes sense with 3–bus members of the 29K family. All the 2–bus members already share a single bus for data and instructions, the
second bus being used for address values. The syntax for the commands of interest is
show below:
0
0
SET_SHARED_ID_BUS
SET MEM access TO value
Note, the SET command is accepted by both the older and newer versions of the
simulator. However, the newer version has an abbreviation to the SET command
shown below; the “SET MEM” syntax is replaced by a direct command and there is
no need for the “TO”.
Chapter 1
Architectural Overview
75
0
SET MEM IWIDTH TO 32
0
ROMWIDTH 32
romwidth 32
older syntax
newer syntax
newer syntax
Am29000 and Am29050
Note, when the Instruction bus and Data busses are tied together with 3–bus processors, the ROM space is still decoded separately from the Instruction space. Tying
the busses together will reduce system performance, because instructions can no
longer be fetched from Instruction space, or ROM space, while the Data bus is being
used.
Considering only the most popular event file commands simplifies the presentation of sim29 operation; and encourages its use. Those wishing to know more about
event file command options should contact AMD. They readily distribute the sim29
executable software for popular platforms and with relevant documentation.
Table 1-5 shows the allowed access and value parameters for 3–bus members of
the 29K family, that is, the Am29000 and Am29050 processors. Off–chip memory
can exist in three separately addressed spaces: Instruction, ROM , and Data. Memory
address–decode and access times (in cycles) must be entered for each address space
which will be accessed by the processor; default values are provided.
Instruction
Table 1-5. 3–bus Processor Memory Modeling Parameters for sim29
ROM
Data
Value Default
Operation
IDECODE
IACCESS
RDECODE
RACCESS
IBURST
IBACCESS
RBURST
RBACCESS
DDECODE
DRCCESS
DWACCESS
DBURST
DBRACCESS
DBWACCESS
0–n
1–n
1–n
T|F
1–n
1–n
0
1
1
false
1
1
Decode address
First read
First write
Burst–mode supported
Burst read
Burst write
If a memory system supports burst mode, the appropriate *BURST access parameter must be set to value TRUE. The example below sets Instruction memory accesses to two cycles; subsequent burst mode accesses are single–cycle. The example
commands only affect Instruction memory; additional commands are required to establish Data memory access characteristics. Many users of the simulator only require
memory modeling parameters from Table 1-5, even if DRAM is in use.
0
0
0
SET MEM IACCESS TO 2
SET MEM IBURST TO true
SET MEM IBACCESS TO 1
If DRAM memory devices are used, there are several additional access parameters which can be used to support memory system modeling (see Table 1-6). DRAM
76
Evaluating and Programming the 29K RISC Family
devices are indicated by the *PAGEMODE parameter being set. The 29K family internally operates with a page size of 256 words, external DRAM memory always operates with integer multiples of this value. For this reason, there is never any need to
change the *PGSIZE parameter setting from its default value. The first read access to
DRAM memory takes *PFACCESS cycles; second and subsequent read accesses
take *PSACCESS cycles. However, if the memory system supports burst mode, subsequent read accesses take *PBACCESS cycles rather than *PSACCESS.
If static column DRAM memories are used, then memory devices do not require
CAS signals between same–page accesses. Static column memory use is indicated by
the *STATCOL parameter. Initial page accesses suffer the additional *PRECHARGE access penalties, subsequent accesses all have same access latencies. Note, burst
mode access can also apply to static column DRAM memory. Table 1-7 shows
memory modeling parameters for static column memories.
Table 1-6. 3–bus Processor DRAM Modeling Parameters for sim29 (continued)
Instruction
ROM
Data
IPAGEMODE PAGEMODE DPAGEMODE
IPGSIZE
RPGSIZE
DPGSIZE
IPFACCESS RPFACCESS DPFRACCESS
DPFWACCESS
IPSACCESS RPSACCESS DPSRACCESS
DPSWACCESS
IPBACCESS RPBACCESS DPBRACCESS
DPBWACCESS
Value Default
T|F
1–n
1–n
1–n
1–n
1–n
1–n
1–n
false
256
1
1
1
1
1
1
Operation
Memory is paged
Page size in words
First read in page mode
First write in page mode
Secondary read within page
Secondary write within page
Burst read within page
Burst write within page
Table 1-7. 3–bus Processor Static Column Modeling Parameters for sim29 (continued)
Instruction
ROM
ISTATCOL
RSTATCOL
ISMASK
RSMASK
IPRECHARGE RPRECH
ISACCESS
RSACCESS
Data
Value Default
DSTATCOL
DSMASK
DPRECHARGE
DSRACCESS
DSWACCESS
T|F
false
0xffffff0 0
0–n 0
1–n 1
1–n 1
Operation
Static column memory used
Column address mask, 64–words
Precharge on page crossing
Read access within static column
Write access within static column
Separate regions of an address space may contain more than one type of
memory device and control mechanism. To support this, memory banking is provided for in the simulator (see Table 1-8). The [I|R|D]BANKSTART parameter is
used to specify the start address of a memory bank; a bank is a contiguous region of
memory of selectable size, within an indicated address space. Once the *BANKChapter 1
Architectural Overview
77
START command has been used, all following commands relate to the current bank,
until a new bank is selected. This type of command is more frequently used with microcontroller members of the 29K family.
Table 1-8. 3–bus Processor Memory Modeling Parameters for sim29 (continued)
Instruction
ROM
Data
IBANKSTART RBANK
IBANKSIZE BBAKSIZE
Value Default
DBANKSTART 0–n
DBANKSIZE
1–n
–
1
Operation
Start address of memory region
Size in bytes of memory region
Am29030 and Am29035
The parameters used with the SET command, when simulating 2–bus 29K family members are a little different from 3–bus parameters (see Table 1-9). The parameters shown are for the older simulator, but they are accepted by the new simulator. For
a list of alternative parameters, which are only accepted by the newer simulator, see
the following Am29040 section. There is no longer a ROM space, and although
instructions and data can be mixed in the same memory devices, separate modeling
parameters are provided for instruction and data accesses.
Table 1-9. 2–bus Processor Memory Modeling Parameters for older sim29
Instruction
Data
Value Default
IACCESS
DRACCESS
DWACCESS
DBURST
DBRACCESS
DBWACCESS
DWIDTH
DPRECHARGE
DPRACCESS
DPWACCESS
DBANKSTART
DBANKSIZE
HALFSPEED
2–n 2
2–n 2
T|F
true
1–n 1
1–n 1
8,16,32 32
0–n 0
2–n 2
2–n 2
0–n –
1–n 1
T|F
false
IBURST
IBACCESS
IWIDTH
IPRECHARGE
IPACCESS
IBANKSTART
IBANKSIZE
HALFSPEED
Operation
First read from SRAM
First write from SRAM
Burst–mode supported
Burst read within page
Burst write within page
Memory width
Precharge on page crossing
First access in page mode
First write in page mode
Start address of memory region
Size in bytes of memory region
Memory system is 1/2 CPU speed
Consider accessing memory for instructions; IACCESS gives the access time,
unless DRAM is used, in such case, access time is given by IPACCESS. The use of
DRAM is indicated by the *PRECHARGE parameter value being non zero. First accesses to DRAM pages suffer an addition access delay of *PRECHARGE. If burst
mode is supported, with all memory device types, the access times for instruction
memory, other than the first access, is given by IBACCESS.
78
Evaluating and Programming the 29K RISC Family
Both the current 2–bus 29K family members support Scalable Clocking, enabling a half speed external memory system. They also support narrow, 8–bit or 16–bit,
memory reads. The Am29035 processor also supports dynamic bus sizing. All external memory accesses can be 16–bit or 32–bit; processor hardware takes care of multiple memory accesses when operating on 32–bit data. As with the 3–bus 29K family
members, the simulator provides for memory banking. This enables different
memory devices to be modeled within specified address ranges.
Alternative Am29030, Am29035 and Am29040
As stated in the previous section, the newer sim29 can accept the memory modeling parameters used by the older sim29. However, the newer simulator can operate
with alternative modelling commands; these are shown on Table 1-10. Commands
can be in upper or lower case, but they are shown here in lower case. A list of available simulator commands can be had by issuing the command “sim29 –29040
–help”. An example of Am29040 processor simulation can be found in section 8.1.3
Table 1-10. 2–bus Processor Memory Modeling Parameters for newer sim29
Command
value
Operation
rombank
rambank
halfspeedbus
logging
<adds> <size>
<adds> <size>
true|false
true|false
ROM/SRAM
DRAM
Value Default
romread
romwrite
romburst
rombread
rombwrite
rompage
rompread
rompwrite
rompwidth
ramread
ramwrite
ramburst
rambread
rambwrite
rampage
rampread
rampwrite
ramwidth
ramprecharge
rampprecharge
ramrefrate
2–n
2–n
T|F
1–n
1–n
T|F
2–n
2–n
16,32
0–n
0–n
0–n
Size and address of ROM/SRAM
Size and address of DRAM
Scalable Clocking (default=false)
Loging to file sip.log (default=false)
2
2
true
1
1
true
2
2
32
0
0
0
Operation
First read
First write
Enable burst mode addressing
Burst read within page
Burst write within page
Enable page mode
Single read within page
Single write within page
Bit width of memory
DRAM precharge time
Page mode DRAM prechage
DRAM refresh rate (0=off)
ROM and SRAM memory types are modeled with the same set of commands.
The simulator allocates a default ROM/SRAM memory bank starting at address 0x0.
Unless a RAMBANK command is used to allocate a DRAM memory section at a low
memory address, all code and data linked for low memory addresses will be allocated
to the default ROM/SRAM memory bank.
Chapter 1
Architectural Overview
79
DRAM memory is modelled with the RAM* modelling commands. A default
DRAM memory section is established at address 0x4000,0000. Unless a
ROMBANK command is used to allocate a ROM/SRAM memory bank at this
address range, all accesses to high memory will be satisfied by the default DRAM
memory.
The default linker command files used with the High C 29K tool chain, typically
links programs for execution according the the above default memory regions. However, older release of the compiler tool chain (or other tool chains) may link for different memory models. This would require the use of RAMBANK–type commands to
establish the correct memory model. Alternatively, a compiler command file could
be used to ensure a program is linked for the default simulator memory mode (see
section 2.3.6).
Am29200 and Am29205
The simulator does not maintain different memory access parameters for
instruction and data access when modeling microcontroller members of the 29K
family. However, it does support separate memory modeling parameters for DRAM
and ROM address regions (see Table 1-11). Each of these two memory regions has its
own memory controller supporting up to four banks. A bank is a contiguous range of
memory within the address range accessed via the region controller. The DRAM region controller is a little more complicated than the ROM region controller. The parameters shown in Table 1-11 are for the older simulator, but they are accepted by the
new simulator. For a list of alternative parameters, which are only accepted by the
newer simulator, see the following Am29240 section.
The DRAM access is fixed at four cycles (1 for precharge + 3 for latency), it can
not be programmed. Subsequent accesses to the same page take four cycles unless
pagemode memories are supported. Note the first access is only three cycles rather
than four, as the RAS will already have met the precharge time. Basically, to precharge the RAS bit lines, all RAS lines need to be taken high in between each change of the
row addresses. A separate cycle is needed for precharge when back–to–back DRAM
accesses occurs. Use of pagemode memories is indicated by the PAGEMODE parameter being set; when used, the processor need not supply RAS memory strobe signals before page CAS strobes for same page accesses. This reduces subsequent page
access latency to three cycles. Additionally, when pagemode is used and a data burst
is attempted within a page, access latency is two cycles. The DRAM memory width
can be set to 16 or 32–bits. Of course when an Am29205 is used, all data memory
accesses are restricted by the 16–bit width of the processor data bus.
To explain further, access times to DRAM for none pagemode memories follow
the sequence:
X,3,4,4,4,X,3,4,4,4,X,X,3,X,3,...
80
Evaluating and Programming the 29K RISC Family
Where X is a non–DRAM access, say to ROM or PIA space. For DRAM systems supporting pagemode the sequence would be:
X,3,2,2,2,<boundary crossing>,4,2,2,<boundary crossing>,X,3,2,2,2
Memory devices located in ROM space can be modeled with a wider range of
parameter values. Both SRAM and ROM devices can be modeled in ROM space. Using the RBANKNUM parameter, the characteristics of each bank can be selectively
described. Burst–mode addressing is only supported for instruction or data reading.
When the burst option is used (RBURST set to TRUE), read accesses, other than the
first for a new burst, take RBACCESS cycles rather than the standard RRACCESS
cycles. Memory device widths can be 8, 16 or 32–bits. If an Am29205 microcontroller is being modeled, memory accesses wider than the 16–bit bus width always require the processor to perform multiple memory transfers to access the required
memory location.
Table 1-11. Microcontroller Memory Modeling Parameters for sim29
ROM/SRAM
value default DRAM
RRACCESS
RWACCESS
RBURST
RBACCESS
1–n
2–n
T|F
1–n
ROMWIDTH
8,16,32 32
RBANKNUM
0–3
1
2
false
1
–
Value Default (Am29200)
Operation
1
3
3
Precharge on page crossing
First read
First write
Burst address in ROM region
2
Burst read within page
2
Burst write within page
DRAMWIDTH 16,32 32
Width of memory
PAGEMODE
T|F
false Page mode supported
DBANKNUM 0–3 –
Select which memory bank
Preparing sim29 for modeling an Am29200 system is not difficult. The following commands configure the first two ROM banks to access non–burst–mode memories which are 32–bits wide, and have a 1–cycle read access, and a 2–cycle write access.
Chapter 1
Architectural Overview
81
0
0
0
0
COM ROM bank 0 parameters
SET MEM rbanknum to 0
SET MEM rraccess to 1
SET MEM rwaccess to 2
0
0
0
0
COM ROM bank 1 parameters
SET MEM rbanknum to 1
SET MEM rraccess to 1
SET MEM rwaccess to 2
The following DRAM parameters, like the ROM parameters above, are correct
for modeling an SA29200 evaluation board. The first DRAM bank is configured to
support pagemode DRAM access, giving access latencies of 4:3:2 (4 for first, 3 for
same–page subsequent, unless they are bursts which suffer only 2–cycle latency).
0
0
0
COM DRAM bank 0 parameters
SET MEM dbanknum to 0
SET MEM dpagemode to true
Alternative Am2920x and Am2924x
As stated in the previous section, the newer sim29 can accept the memory modeling parameters used by the older sim29. However, the newer simulator can operate
with alternative modelling commands; these are shown on Table 1-12. Commands
can be in upper or lower case, but they are shown here in lower case. A list of available simulator commands can be had by executing the command “sim29 –29240
–help”. An example of Am29200 microcontroller simulation can be found in section
8.1.
82
Evaluating and Programming the 29K RISC Family
Table 1-12. Microcontroller Processor Memory Modeling Parameters for newer sim29
Command
value
rombank
rambank
halfspeedbus
logging
parallelin
parallelout
serialin
serialout
<adds> <size>
<adds> <size>
true|false
true|false
<file> [<speed>]
<file> [<speed>]
a|b <file> [ [<baud>]
a|b <file> [<baud>]
ROM/SRAM
default
Size and address of ROM/SRAM
Size and address of DRAM
Scalable Clocking (default=false)
Loging to file sip.log (default=false)
Parallel port input file
Parallel port output file
Serial port, a or b, input file
Serial port, a or b, output file
DRAM
Value Default (Am29240) Operation
rampage
ramwidth
ramrefrate
1–n 1
2–n 2
T|F
false
1–n 1
T|F
true
8,16,32 32
0–n 255
romread
romwrite
romburst
rombread
rompwidth
Operation
First read
First write
Enable burst mode addressing
Burst read within page
Enable page mode
Bit width of memory
DRAM refresh rate (0=off)
ROM and SRAM memory types are modeled with the same set of commands.
The simulator automaticlay allocates ROM/SRAM memory bank 0. Using the
ROMBANK parameter, the characteristics of each bank can be selectively described. The default parameters are typically for a relatively fast memory system
The DRAM memory access times are fixed by the processor specification.
However, there are some DRAM modelling commands enabling selection of
memory system with and pagemode devices. The simulator automatically allocates
DRAM memory bank 0 at address 0x4000,0000. All accesses to memory above this
address will be satisfied by the DRAM memory bank.
It is usually less of a problem linking programs for execution on a 29K microcontroller; as the processor hardware dictates, to some extend, the allowed memory
regions. The default linker command files used with the High C 29K tool chain, typically link programs for execution according the the processor specificity memory regions. Compiler command files are described in section 2.3.6.
1.14.2 Analyzing the Simulation Log File
Running the architectural simulator is simple but rather slow. The inclusion of
detail about the processor pipeline results in slow simulation speeds. For this reason,
users typically select a portion of their application code for simulation. This portion
is either representative of the overall code or subsections whose operation is critical
to overall system performance.
Chapter 1
Architectural Overview
83
Older sim29 Log File Format
For demonstration purposes I have merely simulated the “hello world” program
running on an Am29000 processor. The C source file was compiled with the High C
29K compiler using the default compiler options; object file hello was produced by
the compile/link process. The memory model was the simulator default, single–cycle
operation. Given the selection of default memory parameter, there is no need for an
eventfile establishing memory parameters. However, I did use an eventfile with the
following contents:
0
log on channel
This option has not previously been described; it enables the simulator to produce an additional log file of channel activity. This can occasionally be useful when
studying memory system operation in detail. The simulator was started with the command:
sim29 –29000 –r /gnu/29k/src/osboot/sim/osboot –e eventfile
hello
Two simulation result files were produced; the most important of which, the default simulation output file, sim.out, we shall briefly examine. The channel.out file
reports all instruction and data memory access activity. The contents of the sim.out
file are shown below exactly as produced by the simulator:
AMD ARCHITECTURAL SIMULATOR, V# 1.0–17PC
### T=3267 Am29000 Simulation of ”hello” complete –– successful
–––––––––––––––––––––––––––––––––––––––––––––––––––––
<<<<< S U M M A R Y
S T A T I S T I C S >>>>>
CPU Frequency = 25.00MHz
Nops:50
total instructions = 2992
User Mode: 291 cycles (0.00001164 seconds)
Supervisor Mode:
2977 cycles (0.00011908 seconds)
Total:
3268 cycles (0.00013072 seconds)
Simulation speed:
22.89 MIPS (1.09 cycles per instruction)
–––––––––– Pipeline ––––––––––
8.45% idle pipeline:
6.46% Instruction Fetch Wait
0.46% Data Transaction Wait
0.18% Page Boundary Crossing Fetch Wait
0.00% Unfilled BTCache Fetch Wait
0.49% Load/Store Multiple Executing
0.03% Load/Load Transaction Wait
84
Evaluating and Programming the 29K RISC Family
0.83% Pipeline Latency
Total Wait:
276 cycles
(0.00001104 seconds)
–––––––––– Branch Target Cache ––––––––––
Partial hits:
0
Branch btcache access:
2418
Branch btcache hits:2143
Branch btcache hit ratio:
88.63%
–––––––––– Translation Lookaside Buffer ––––––––––
TLB access:
0
TLB hits:
0
TLB hit ratio:
0.00%
–––––––––– Bus Utilization ––––––––––
Inst Bus Utilization:
70.01%
2288 Instruction Fetches
Data Bus Utilization:
20 Loads
335 Stores
10.86%
–––––––––– Register File Spilling/Filling ––––––––––
0 Spills,
0 Fills
Opcode Histogram
ILLEGAL:
CONSTN:6
CONSTH:68
CONST:121
MTSRIM:5
CONSTHZ:
LOADL:
LOADL:
CLZ:
CLZ:
EXBYTE:
EXBYTE:
. . .
System Call Count Histogram
EXIT
1:1
GETARGS
260:1
SETVEC
289:2
. . .
–––––– M E M O R Y
S U M M A R Y ––––––
Memory Parameters for Non–banked Regions
I_SPEED: Idecode=0 Iaccess=1 Ibaccess=1
. . .
The simulator reports the total number of processor cycles simulated. Because
our example is brief, there are few User mode cycles. Most cycles are utilized by the
osboot operating system. The operating system runs in Supervisor mode and initializes the processor to run the “hello world” program in User mode. The fast memory
system has enabled the processor pipeline to be kept busy, an 8.45% idle pipeline is
reported. A breakdown of the activities contributing to pipeline stalling is shown.
Next reported is the Branch Target Cache (BTC) activity. If a processor incorporating an Instruction Cache Memory rather than a BTC had been simulated, the corresponding results would replace the BTC results shown. There were 2418 BTC accesses, of which 2143 found valid entries. This gives a hit ratio of 88.63%. Partial hits
refer to the number of BTC entries which were not fully used. This occurs when one
of the early entries in the four–entry cache block contains a jump.
Chapter 1
Architectural Overview
85
If the operating system had arranged for Translation Look–Aside Buffer (TLB)
use then the next section reports its activity. In the example, the application ran with
physical addressing which does not require TLB support. Next reported is bus activity. The large number of processor registers results in little off–chip data memory access, and hence Data Bus utilization. The Instruction Bus is used to fill the Instruction Prefetch Buffer and BTC, and shows much higher utilization. Typically, programs are more sensitive to instruction memory performance than data memory.
The simulator then produces a histogram of instruction and system call usage.
The listing above only shows an extract of this information, as it is rather large. Examining this data can reveal useful information, such as extensive floating–point
instruction use.
Finally reported is a summary of the memory modeling parameters used during
simulation. This information should match with the default parameters or any parameters established by the eventfile. It is useful to have this information recorded along
with the simulation results.
Newer sim29 Log FIle Format
As with the previous demonstration, the “hello world” program is used here to
show the output format of the newer architectural simulator. The selected processor
is this time an Am29240 microcontroller. The C source file was compiled with the
High C 29K compiler using the –O4 compiler options; object file hello was produced
by the compile/link process. The memory model was the simulator default. Given the
selection of default memory parameter, there is no need for an eventfile to establish
memory parameters. The simulator was started with the command shown below.
Note, there is no need to use the –r option and specify an osboot file.
sim29 –29240 hello
The simulation result file, sim.out, was produced. The contents of the sim.out
file are shown below exactly as produced by the simulator:
Am292xx Architectural Simulator, Version# 2.4
Command line: /usr/29k/bin/sim240 –29240 hello
Boot file: /usr/29k/lib/osb24x
Text section: 00000000 – 0000001f
Text section: 00000020 – 00000333
Text section: 00000340 – 0000035f
Text section: 00000360 – 00006b6b
BSS section: 40000400 – 400007df
Application file: hello
Text section:
Text section:
Text section:
Data section:
86
40010000
4001332c
4001333c
40014000
–
–
–
–
4001332b
4001333b
4001334b
40014993
Evaluating and Programming the 29K RISC Family
Lit section: 40014994 – 40014c63
BSS section: 40014c64 – 40014ca3
Argv memory: 400150a0 – 4001589f
Heap memory: 40015ca0 – 40035c9f
Memory stack: 40fbf7f0 – 40fdffef
Register stack: 40fdfff0 – 410007ef
Vector Area: 40000000 – 400003ff
ROM:
Address
0x0
RAM:
Address
0x40000000
Size
*
Size
*
Rd
1
Rd
2
Wr Bmd BRd Wid
1
0
1 32
Wr Pmd PRd PWr Wid Ref
2
1
1
1 32 255
Half speed memory = 0
Starting simulation...
hello world
HIF Exit: Value = 12
Simulation summary:
Cycles: 7101
Supervisor mode = 100.0%
User mode = 0.0%
MIPS = 18.8 (25.0 Mhz * ((5342 instructions)/(7101 cycles)))
Pipeline:
Average run length= 5.9 instructions between jumps taken
Fetches not used due to jumps = 299
PipeHold: 1759 cycles = 24.8%
Fetch waits: 1520 cycles = 21.4%
Load waits: 133 cycles = 1.9%
Store waits: 79 cycles = 1.1%
Load Multiple waits: 3 cycles = 0.0%
Store Multiple waits: 24 cycles = 0.3%
Channel:
Rom:
Rom:
Ram:
Ram:
Ram:
Ram:
accesses = 809
average cycles per access = 1.0
accesses = 1959
average cycles per access = 1.7
average cycles waiting for precharge = 0.2
average cycles waiting for refresh = 0.2
Instruction Cache Size = 4 Kbytes
Hit ratio = 66.4% (3766/5673)
Data Cache Size = 2 Kbytes
Hit ratio = 63.6% (136/214)
The format of Log File will appear familiar to those experienced with the older
architectural simulator; the total number of processor cycles simulated is reported.
There are no User mode cycles as the default osboot (osb24x) executed the hello
Chapter 1
Architectural Overview
87
program in Supervisor mode. Most cycles are utilized by the osboot operating system. The relatively fast memory system has enabled the processor pipeline to be kept
busy, a 24.8% idle pipeline is reported. A breakdown of the activities contributing to
pipeline stalling is shown. Most pipeline stalls are due to instruction fetching; the
DRAM memory has a 2–cycle first access time, rather than the ideal 1–cycle. The
newer simulator reports the average number of instructions executed between jump
or branch instructions. The run length is shown to be 5.9 instructions, which is typical
of a 29K program.
Next reported is Channel activity. All load and store instructions make use of the
Channel. Statistics are presented separately for the ROM/SRAM and DRAM
memory systems. Typically, performance is much more sensitive to instruction
memory access rather than accesses to data. This is particularly true with the 29K
family due to its large number of on–chip registers.
Next reported is on–chip cache activity. There were 5673 accesses to the
instruction cache, of which 66.4% found valid entries. The Am29240 has the benefit
of a data cache and the results are shown. The hello program is small and only 214
data cache accesses were made, of which 63.4% hit in the cache.
Reported in the sim.out file before simulation started are the memory modeling
parameters used during simulation. This information should match with the default
parameters or any parameters established by the eventfile. It is useful to have this information recorded along with the simulation results. The values reported are shown
again below:
ROM:
Address
0x0
RAM:
Address
0x40000000
Size
*
Size
*
Rd
1
Rd
2
Wr Bmd BRd Wid
2
0
1 32
Wr Pmd PRd PWr Wid Ref
2
1
1
1 32 255
Half speed memory = 0
The ROM section refers to both ROM and SRAM memory. The tokens used are
a little cryptic. For example, “Rd” refers to memory read cycles. And “BRd” refers to
burst mode read times. The option to use Scalable Clocking was not selected; “Half
speed memory” is set to false.
88
Evaluating and Programming the 29K RISC Family
Chapter 2
Applications Programming
Application programming refers to the process of developing task specific software. Typical 29K tasks are controlling a real–time process, processing communications data, processing real–time digital signal, and manipulating video images. There
are many more types of applications, such as word processing which the 29K is suited
for, but the 29K is better known in the embedded engineering community which typically deals with real–time processing.
This chapter deals with aspects of application programming which the Software
Engineer is required to know. Generally, computer professionals spend more time
developing application code, compared to other software development projects such
as operating systems. Additionally, applications are increasingly developed in a high
level language. Since C is the dominant language for this task, I shall present code
examples in terms of C. Assembly level programming is dealt with in a separate
chapter.
The first part of this chapter deals with the mechanism by which one C procedure calls another, and how they agree to communicate data and make use of processor resources [Mann et al. 1991b]. This is termed the Calling Convention. It is possible that different tool developers could construct their own calling mechanism, but
this may lead to incompatibilities in mixing routines compiled by different vendor
tools. AMD avoided this problem by devising a calling convention which was
adopted by all tool developers. Detailed knowledge, of say, individual register support tasks for the calling convention is not presented, except for the register and
memory stacks which play an important role in the 29K calling mechanism. In practice, C language developers typically do not need to be concerned about individual
register assignments, as it is taken care of by the compiler [Mann 1991c]. Chapter 3
expands on register assignment, and it is of concern here only in terms of understanding the calling convention concepts and run–time efficiencies.
89
Operating system support services (HIF services) are then dealt with. The transition from operating system to the application main() routine is described. Operating system services along with other support routines are normally accessed through
code libraries. These libraries are described for the predominant tool–chains. Using
the available libraries and HIF services, it is an easy task to arrange for interrupts to be
processed by C language handler routines; the mechanism is described. Finally, utility programs for operations such as PROM preparation are listed and their capabilities presented.
2.1
C LANGUAGE PROGRAMMING
Making a subroutine call on a processor with general-purpose registers is expensive in terms of time and resources. Because functions must compete for register
use, registers must be saved and restored through register-to-memory and memoryto-register operations. For example, a C function call on the MC68000 processor
[Motorola 1985] might use the statements:
char bits8;
short bits16;
printf (”char=%c short=%d”, bits8, bits16);
After they are compiled, the above statements would generate the assemblylevel code shown below:
L15:
.ascii
”char=%c short=%d\0”
MOVE.W
EXT.L
MOVE.L
MOVE.B
EXTB.L
MOVE.L
PEA
JSR
LEA
–4[A6],D0
D0
D0,–[A7]
–1[A6],D0
D0
D0,–[A7]
L15
_printf
12[A7],A7
;copy bits16 variable
; to register
;now push on stack
;copy bits8 variable
; to register
;now push on stack
;stack text string pointer
;repair stack pointer
The assembly listing above shows how parameters pass via the memory stack to
the function being called. The LINK instruction copies the stack pointer A7 to the
local frame pointer A6 upon entry to a routine. Within the printf() routine, the parameters passed and local variables in memory are referenced relative to register A6.
To reduce future access delays, the printf() routine will normally copy data to
general-purpose registers before using them. For instance, using a memory-tomemory operation when moving data from the local frame of the function call stack
would reduce the number of instructions executed. However, these are CISC instructions that require several machine cycles before completion.
In the example, the C function call passes two variables, bits8 and bits16, to the
library function printf(). The following assembly code shows part of the printf()
function for the MC68020.
90
Evaluating and Programming the 29K RISC Family
_printf:
LINK
LEA
. . .
UNLK
RTS
A6,#–32
8[A6],A0
;local variable space
;unstack string pointer
A6
;return
Several multi–cycle instructions (like LINK and UNLK) are required to pass
the parameters and establish the function context. Unlike the variable instruction format in the MC68020, the 29K processor family has a fixed 32–bit instruction format
(see section 1.11). The same C statements compiled for the Am29000 processor generate the following assembly code for passing the parameters and establishing the
function context:
L1:
.ascii “char=%c short=%d\0”
const
lr2,L1
consth lr2,L1
add
lr3,lr6,0
;move bits8 and bits16
add
lr4,lr8,0
;to bottom of the
;activation record
call
lr0,printf
;return address in lr0
The number of instructions required is certainly less, and they are all simple
single–cycle RISC instructions. However, to better understand just how parameters
are passed during a function call, explanation of the procedure activation records and
their use of the local register file is first required.
2.1.1 Register Stack
A register stack is assigned an area of memory used to pass parameters and allocate working registers to each procedure. The register cache replaces the top of the
register stack, as shown in Figure 2-1. All 29K processors have a 128–word local
register file; these registers are used to implement the cache for the top of the register
stack. Note, if desired only a portion of the 128–word register file need be allocated to
register cache use (see section 2.3.2).
The global registers rab (gr126) and rfb (gr127) point to the top and the bottom
of the register cache. Global register rsp (also known as gr1) points to the top of the
register stack. The register cache, or stack window, moves up and down the register
stack as the stack grows and shrinks. Use of the register cache, rather than the
memory portion of the register stack, allows data to be accessed through local registers at high speed. On–chip triple–porting of the register file (two read ports and one
write port for most 29K family members), enables the register stack to perform better
than a data memory cache, which cannot support read and write operations in the
same cycle.
Chapter 2
Applications Programming
91
High address
Memory-resident
part of the register stack
rfb points to
the bottom of
cache register
window
Register
Stack
Cache-resident
part of the register
stack
Cache
register
window
moves
up and
down
(grows down)
rsp points to the
top of the stack
empty
empty
rab points to the
top of the cache
register window
Register
Cache
Low address
External Memory
Figure 2-1. Cache Window
2.1.2 Activation Records
A 29K processor does not apply push or pop instructions to external memory
when passing procedure parameters. Instead each function is allocated an activation
record in the register cache at compile time. Activation records hold any local variables and parameters passed to functions.
The caller stores its outgoing arguments at the bottom of the activation record.The called function establishes a new activation record below the caller’s record. The top of the new record overlaps the bottom of the old record, so that the outgoing parameters of the calling function are visible within the called functions activation record.
92
Evaluating and Programming the 29K RISC Family
Although the activation record can be any size within the limits of the physical
cache, the compiler will not allocate more than 16 registers to the parameter-passing
part of the activation record. Functions that cannot pass all of their outgoing parameters in registers must use a memory stack for additional parameters; global register
msp (gr125) points to the top of the memory stack. This happens infrequently, but is
required for parameters that have their address taken (for example in C, &variable).
Data parameters at known addresses cannot be supported in register address space
because data addresses always refer to memory, not to registers.
The following code shows part of the printf() function for the 29K family:
printf:
sub
asgeu
add
. . .
jmpi
asleu
gr1,gr1,16
;function prologue
V_SPILL,gr1,rab ;compare with top of window
lr1,gr1,36
;rab is gr126
lr0
V_FILL,lr1,rfb
;return
;compare with bottom
;of window gr127
The register stack pointer, rsp, points to the bottom of the current activation record. All local registers are referenced relative to rsp. Four new registers are required
to support the function call shown, so rsp is decremented 16 bytes. Register rsp performs a role similar to the MC68000’s A7 and A6 registers, except that it points to data
in high-speed registers, not data in external memory.
The compiler reserves local registers lr0 and lr1 for special duties within each
activation record. The lr0 contains the execution starting address when it returns to
the caller’s activation record. The lr1 points to the top of the caller’s activation record, the new frame allocates local registers lr2 and lr3 to hold printf() function local
variables.
As Figure 2-2 shows, the positions of five registers overlap. The three printf()
parameters enter from lr2, lr3 and lr4 of the caller’s activation record and appear as
lr6, lr7 and lr8 of the printf() function activation record.
2.1.3 Spilling And Filling
If not enough registers are available in the cache when it moves down the register stack, then a V_SPILL trap is taken, and the registers spill out of the cache into
memory. Only procedure calls that require more registers than currently are available
in the cache suffer this overhead.
Once a spill occurs, a fill (V_FILL trap) can be expected at a later time. The fill
does not happen when the function call causing the spill returns, but rather when
some earlier function that requires data held in a previous activation record (just below the cache window) returns. Just before a function returns, the lr1 register, which
points to the top of the caller’s activation record, is compared with the pointer to the
Chapter 2
Applications Programming
93
higher addresses
top of activation
record
in–coming pram
in–coming pram
lr4
in–coming pram
lr2
lr5
frame pointer
lr1
lr4
return address
lr0
lr3
local
lr8
lr7
lr6
gr1 (rsp)
when printf()
executes
lr2
local
lr1
frame pointer
lr0
printf() activation record is
9 words. Register gr1 is
lowered 4 words (16 bytes)
in the prologue of printf().
lr3
Base of caller’s activation
record (gr1 before printf()
is called)
base of printf()
activation record
Figure 2-2. Overlapping Activation Record Registers
bottom of the cache window(rfb). If the activation record is not stored completely in
the cache, then a fill overhead occurs.
The register stack improves the performance of call operations because most
calls and returns proceed without any memory access. The register cache contains
128 registers, so very few function calls or returns require register spilling or filling.
Because most of the data required by a function resides in local registers, there is
no need for elaborate memory addressing modes, which increase access latency. The
function-call overhead in the 29K family consists of a small number of single-cycle
instructions; the overhead in the MC68020 requires a greater number of multi-cycle
instructions.
2.1.4 Global Registers
In the discussion of activation records (section 2.1.2), it was stated that functions can use activation space (local registers) to hold procedure variables. This is
true, but procedures can also use processor global registers to hold variables. Each
29K processor has a group of registers (global registers) which are located in the register file, but are not part of the register cache. Global registers gr96–gr127 are used
by application programs. When developing software in C, there is no need to know
just how the compiler makes use of these global registers; the Assembly Level Programming chapter, section 3.3, discusses register allocation in detail.
94
Evaluating and Programming the 29K RISC Family
Data held in global registers, unlike procedure activation records, do not survive
procedure calls. The compiler has 25 global registers available for holding temporary
variables. These registers perform a role very similar to the eight–data and eight–address general purpose registers of the MC68020. The first 16 of the global registers,
gr96–gr111, are used for procedure return value passing. Return objects larger than
16 words must use the memory stack to return data (see section 3.3).
An extension to some C compilers has been made (High C 29K compiler for
one), enabling a calling procedure to assume that some global registers will survive a
procedure call. If the called function is defined before calls are made to it, the compiler can determine its register usage. This enables the global register usage of the calling function to be restricted to available registers, and the calling function need only
save in local registers those global registers it knows are used by the by the callee.
2.1.5 Memory Stack
Because the register cache is limited in size, a separate memory stack is used to
hold large local variables (structs or arrays), as well as any incoming parameters beyond the 16th parameter. (Note, small structs can still be passed in local registers as
procedure parameters). Register msp is the memory stack pointer. (Note, having two
stacks generally requires several operating system support mechanisms not required
by a single stack CISC based system.)
2.2
RUN–TIME HIF ENVIRONMENT
Application programs need to interact with peripheral devices which support
communication and other control functions. Traditionally embedded program developers have not been well served by the tools available to tackle the related software
development. For example, performing the popular C library service printf(), using
a peripheral UART device, may involve developing the printf() library code and
then underlying operating system code which controls the communications UART.
One solution to the problem is to purchase a real–time operating system. They are
normally supplied with libraries which support printf() and other popular library services. In addition, operating systems contain code to perform task context switching
and interrupt handling.
Typically, operating system vendors have their own operating system interface
specification. This means that library code, like printf(), which ultimately makes operating system service requests, is not easily ported between different operating systems. In addition, compiler vendors which typically develop library code for the target processor for sale along with the compiler, can not be assured of a standard interface to the available operating system services.
AMD wished to relieve this problem and allow library code to be used on any
target 29K platform. In addition AMD wished to ensure a number of services would
Chapter 2
Applications Programming
95
be available. These operating system services were considered necessary to enable
performance benchmarking of application code (for example the cycles service returns a 56–bit elapsed processor cycle count). The result was the Host Interface specification, known as HIF. It specifies a number of operating system services which
must always be present. The list is very small, but it enables library producers to be
assured that their code will run on any 29K platform. The HIF specification states
how a system call will be made, how parameters will be passed to the operating system, and how results will be returned. Operating system vendors need not support
HIF conforming services if they wish; they could just continue to use their own operating system interface and related library routines. But to make use of the popular
library routines from the Metaware High C 29K compiler company, the operating
system company must virtualize the HIF services on top of the underlying operating
system services.
The original specification grew into what is now known as HIF 2.0. The specification includes services for signal handling (see following sections on C language
interrupt handlers), memory management support, run–time environment initialization and other processor configuration options. Much of this development was a result of AMD developing a small collection of routines known as OS–boot (see section 7.4). This code can take control of the processor from RESET, prepare the run–
time environment for a HIF conforming application program, and support any HIF
request made by the application. OS–boot effectively implements a single application–task operating system. It is adequate for many user requirements, which may be
merely to benchmark 29K applications. With small additions and changes it is adequate for many embedded products. However, some of the HIF 2.0 services, requested by the community who saw OS–boot as an adequate operating system, were
of such a nature that they often cannot be implemented in an operating system vendor’s product. For example the settrap service enables an entry to be placed directly
into the processor’s interrupt vector table; some operating systems, for example
UNIX, will not permit this to occur as it is a security risk and, if improperly used, an
effective way to crash the system.
There are standard memory, register and other initializations that must be performed by a HIF-conforming operating system before entry into a user program. In C
language programs, this is usually performed by the module crt0.s. This module receives control when an application program is invoked, and executes prior to invocation of the user’s main() function. Other high-level languages have similar modules.
The following three sections describe: what a HIF conforming operating system
must perform before code in crt0.s starts executing; what is typically achieved in
crt0.s code; and finally, what run–time services are specified in HIF 2.0.
96
Evaluating and Programming the 29K RISC Family
2.2.1 OS Preparations before Calling start In crt0
According to the HIF specification, operating system initialization procedures
must establish appropriate values for the general registers mentioned below before
execution of a user’s application code commences. Linked application code normally commences at address label start in module crt0.s. This module is automatically
linked with application code modules and libraries when the compiler is used to produce the final application executable. In addition, file descriptors for the standard input and output devices must be opened, and any Am29027 floating–point coprocessor support as well as other trapware support must be initialized.
Register Stack Pointer (gr1)
Register gr1 points to the top of the register stack. It contains the main memory
address in which the local register lr0 will be saved, should it be spilled, and from
which it will be restored. The processor can also use the gr1 register as the base in
base–plus–offset addressing of the local register file. The content of rsp is compared
to the content of rab to determine when it is necessary to spill part of the local register
stack to memory. On startup, the values in rab, rsp, and rfb should be initialized to
prevent a spill trap from occurring on entry to the crt0 code, as shown by the following relations:
((64*4) + rab) ≤ rsp < rfb
rfb = rab + 512
This provides the crt0 code with at least 64 registers on entry, which should be a
sufficient number to accomplish its purpose. Note, rab and rfb are normally set to be a
window distance apart, 128 words (512 bytes), but this is not the only valid settings,
see section 2.3.2 and 4.3.1.
Register Free Bound (gr127)
The register stack free–bound pointer, rfb, contains the register stack address of
the lowest-addressed word not contained within the register file. Register rfb is referenced in the epilog of most user program functions to determine whether a register
fill operation is necessary to restore previously spilled registers needed by the function’s caller. The rfb register should be initialized to point to the highest address of the
memory region allocated for register stack use. It is recommended that this memory
region not be less than 6k bytes.
Register Allocate Bound (gr126)
The register stack allocate–bound pointer, rab, contains the register stack address of the lowest-addressed word contained within the register file. Register rab is
referenced in the prolog of most user program functions to determine whether a regisChapter 2
Applications Programming
97
ter spill operation is necessary to accommodate the local register requirements of the
called function. Register rab is normally initialized to be a window distance (512 bytes) below the rfb register value
Memory Stack Pointer (gr125)
The memory stack pointer (msp) register points to the top of the memory stack,
which is the lowest-addressed entry on the memory stack. Register msp should be
initialized to point to the highest address in the memory region allocated for memory
stack use. It is recommended that this region not be less than 2k bytes.
Am29027 Floating–Point Coprocessor Support
The Am29027 floating–point coprocessor has a mode register which has a
cumbersome access procedure. To avoid accessing the mode register a shadow copy
is kept by the operating system and accessed in preference when a mode register read
is required. The operating system shadow mode value is not accessible to User mode
code, therefore an application must maintain its own shadow mode register value.
The floating–point library code which maintains and accesses the shadow mode value, is passed the mode setting, initialized by the operating system, when crt0 code
commences. Before entering crt0, the Am29027 mode register value is copied into
global registers gr96 and gr97. Register gr96 contains the most significant half of the
mode register value, and gr97 contains the least significant half.
Open File Descriptors
File descriptor 0 (corresponding to the standard input device) must be opened
for text mode input. File descriptors 1 and 2 (corresponding to standard output and
standard error devices) must be opened for text mode output prior to entry to the
user’s program. File descriptors 0, 1, and 2 are expected to be in COOKED mode (see
Appendix A, ioctl() service), and file descriptor 0 should also select ECHO mode, so
that input from the standard input device (stdin) is echoed to the standard output device (stdout).
Software Emulation and Trapware Support
A 29K processor may take a trap in support of the procedure call prologue and
epilogue mechanism. A HIF conforming operating system supports the associated
SPILL and FILL traps by normally maintaining two global registers (in the
gr64–gr95 range) which contain the address of the users spill and fill code. Keeping
these addresses available in registers reduces the delay in reaching the typically User
mode support code. A HIF conforming operating system also installs the SPILL and
FILL trap handler code which bounces execution to the maintained handler addresses.
98
Evaluating and Programming the 29K RISC Family
Table 2-1. Trap Handler Vectors
Trap
32
33
34
35
36
42
43
44
45
46
47
48
49
50
51
52
53
54
55
64
65
69
Description
MULTIPLY
DIVIDE
MULTIPLU
DIVID
CONVERT
FEQ
DEQ
FGT
DGT
FGE
DGE
FADD
DADD
FSUB
DSUB
FMUL
DMUL
FDIV
DDIV
V_SPILL (Set up by the user’s task through a setvec call)
V_FILL (Set up by the user’s task through a setvec call)
HIF System Call
Note: The V_SPILL (64) and V_FILL (65) traps are returned to the user’s code to perform the trap
handling functions. Application code normally runs in User mode.
Additionally, the trapware code enabling HIF operating system calls is
installed. Also, all HIF conforming operating systems provide unaligned memory
access trap handlers.
A number of 29K processors do not directly support floating–point instructions
in hardware (see section 3.1.7). However the HIF environment requires that all
Am29000 User mode accessible instructions be implemented across the entire 29K
family. This means that unless an Am29050 processor is being used, trapware must
be installed to emulate in software the floating–point instructions not directly supported by the hardware. Table 2-1 lists the traps which an HIF conforming operating
system must establish support for before calling crt0 code.
When a 29K processor is supported by an Am29027 floating–point coprocessor, the operating system may chose to use the coprocessor to support floating–point
instruction emulation. For example, the trapware routine used for emulating the
MULTIPLY instruction is know as Emultiply; however, if the coprocessor is available the E7multiply routine is used.
Chapter 2
Applications Programming
99
2.2.2 crt0 Preparations before Calling main()
Application code normally begins execution at address start in the crt0.s module. The previous section described the environment prepared by a HIF conforming
operating system before the code in crt0.s is executed. The crt0.s code makes final
preparations before the application main() procedure is called.
The code in crt0.s first copies the Am29027 shadow mode register value, passed
in gr96 and gr97, to memory location __29027Mode. If a system does not have an
Am29027 floating–point coprocessor then there is no useful data passed in these registers. However, application code linked with floating–point libraries which make
use of the Am29027 coprocessor, will access the shadow memory location to determine the coprocessor operating mode value.
The setvec system call is then used to supply the operating system with the addresses of the user’s SPILL and FILL handler code which is located in crt0.s. Because this code normally runs in User mode address space, and the user has the option
to tailor the operation of this code, an operating system can not know in advance
(pre–crt0.s) the required SPILL and FILL handler code operation.
When procedure main() is called, it is passed two parameters; the argc parameter indicates the number of elements in argv; the second parameter, argv, is a pointer
to an array of the character strings:
int
char*
main(argc, argv)
argc;
argv[];
The getargs HIF service is used to get the address of the argv array. In many
real–time applications there are no parameters passed to main(). However, to support
porting of benchmark application programs, many systems arrange for main() parameters to be loaded into a user’s data space. The crt0.s code walks through the
array looking for a NULL terminating string; in so doing, it determines the argc value. The register stack pointer was lowered by the start() procedure’s prologue code
to create a procedure activation record for passing parameters to main().
To aid run–time libraries a memory variable, __LibInit, is defined in uninitialized data memory space (BSS) by the library code. If any library code needs initialization before use, then the __LibInit variable will be assigned to point to a library
routine which will perform all necessary initialization. This is accomplished by the
linker matching–up the BSS __LibInit variable with an initialized __LibInit variable
defined in the library code. The crt0.s code checks to see if the __LibInit variable
contains a non zero address, if so, the procedure is called.
The application main() procedure is ready to be called by start(). It is not expected that main() will return. Real–time programs typically never exit. However,
benchmark programs do, and this is accomplished by calling the HIF exit service. If a
main() routine does not explicitly call exit then it will return to start(), where exit is
called should main() return.
100
Evaluating and Programming the 29K RISC Family
2.2.3 Run–Time HIF Services
Table 2-2 lists the HIF system call services, calling parameters, and the returned
values. If a column entry is blank, it means the register is not used or is undefined.
Table 2-3 describes the parameters given in Table 2-2 . Before invoking a HIF service, the service number and any input parameters passed to the operating system are
loaded into assigned global registers. Each HIF service is identified by its associated
service number which is placed in global register gr121. Parameters are passed, as
with procedure calls, in local registers starting with lr2. Application programs do not
need to issue ASSERT instructions directly when making service calls. They normally use a library of assembly code glue routines. The write service glue routine is
shown below:
__write:
const
asneq
jmpti
const
consth
store
jmpi
constn
;HIF assembly glue routine for write service
gr121,20
;tav is gr121
69,gr1,gr1
;system call trap
gr121,lr0
;return if sucessful
gr122,_errno
;pass errror number
gr122,_errno
0,0,gr121,gr122 ;store errnor number
lr0
;return if failure
gr96,–1
Application programs need merely call the _write() leaf routine to issue the service request. The system call convention states that return values are placed in global
registers starting with gr96; this makes the transfer of return data by the assembly
glue routine very simple and efficient. If a service fails, due to, say, bad input parameters, global register gr121 is returned with an error number supplied by the operating
system. If the service was successful, gr121 is set to Boolean TRUE (0x80000000).
The glue routines check the gr121 value, and if it is not TRUE, copy the value to
memory location errno. This location, unlike gr121 is directly accessible by a C language application which requested the service.
Run–time HIF services are divided into two groups, they are separated by their
service number. Numbers 255 and less require the support of complex operating system services such as file system management. Service numbers 256 and higher relate
to simpler service tasks. Note, AMD reserves service numbers 0–127 and 256–383
for HIF use. Users are free to extend operating system services using the unreserved
service numbers. Operating systems which implement HIF, OS–boot for example,
do not always directly support services 255 and lower. These HIF services are often
translated into native operating system calls which are virtualising HIF services. For
example, when a HIF conforming application program is running on a UNIX–based
system, the HIF services are translated into the underlying UNIX services. OS–boot
supports the more complex services by making use of the MiniMON29K message
system to communicate the service request to a debug support host processor (see
Chapter 7). For this reason, services 255 and lower are not always available. Services
Chapter 2
Applications Programming
101
Table 2-2. HIF Service Calls
Service
Title
Returned Values
Calling Parameters
exit
open
close
read
write
lseek
remove
rename
ioctl
iowait
iostat
tmpnam
time
getenv
gettz
gr121
1
17
18
19
20
21
22
23
24
25
26
33
49
65
67
sysalloc
sysfree
getpsize
getargs
clock
cycles
setvec
settrap
setim
query
257
258
259
260
273
274
289
290
291
305
signal
sigdfl
sigret
sigrep
sigskp
sendsig
321
322
323
324
325
326
lr2
exitcode
pathname
fileno
fileno
fileno
fileno
pathname
oldfile
fileno
fileno
fileno
addrptr
lr3
lr4
mode
pflag
buffptr
buffptr
offset
nbytes
nbytes
orig
newfile
mode
mode
count
iostat
filename
secs
addrptr
zonecode
name
nbytes
addrptr
gr96
gr97
Service does not return
fileno
retval
count
count
where
retval
retval
nbytes
trapno
funaddr
trapno
trapaddr
mask
di
capcode
capcode
capcode
capcode
capcode
newsig
[gr125 points to HIF signal frame]
[gr125 points to HIF signal frame]
[gr125 points to HIF signal frame]
[gr125 points to HIF signal frame]
sig
dstcode
addrptr
retval
pagesize
baseaddr
msecs
LSBs cycles MSBs cycles
trapaddr
trapaddr
mask
hifvers
cpuvers
027vers
clkfreq
memenv
oldsig
Service does not return
Service does not return
Service does not return
Service does not return
gr121
errcode
errcod
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
errcode
with numbers 256 and higher do not require the support of a remote host processor.
These services are implemented directly by OS–boot. If an underlying operating system, such as UNIX, is being used, then some of these services may not be available as
they may violate the underlying operating system’s security.
102
Evaluating and Programming the 29K RISC Family
When application benchmark programs use HIF services, care should be taken.
If a program requests a service such as time (service 49) it will suffer the delays of
communicating the service request to a remote host if the OS–boot operating system
is used. This can greatly effect the performance of a program, as execution will be
delayed until the remote host responds to the service request. It is better to use services such as cycles (service 274) or clock (service 273) which are executed by the
29K processor and do not suffer the delays of remote host communication.
The assembly level glue routines for HIF services 255 and lower are rarely requested directly by an application program. They are more frequently called upon by
library routines. For example, use of the library printf() routine is the typical way of
generating a write HIF service request. The mapping between library routines and
HIF services may not be always direct. The printf() routine, when used with a device
operating in COOKED mode, may only request write services when flushing buffers
supporting device communication. Appendix A contains a detailed description of
each HIF service in terms of input and output parameters, as well as error codes.
2.2.4 Switching to Supervisor Mode
Operating systems which conform to HIF normally run application code in User
mode. However, many real–time applications require access to resources which are
restricted to Supervisor mode. If the HIF settrap service is supported, it is easy to
install a trap handler which causes application code to commence execution in Supervisor mode. The example code sequence below uses the settrap() HIF library routine to install a trap handler for trap number 70. The trap is then asserted using assembly language glue routine assert_70().
extern int super_mode();/* Here in User mode */
_settrap(70,super_mode);/* install trap handler */
assert_70();
/* routine to assert trap */
. . .
/* Here in Supervisor mode */
The trap handler is shown below. Its operation is very simple; it sets the Supervisor mode bit in the old processors status registers (OPS) before issuing a trap return
instruction (IRET). Other application status information is not affected. For example, if the application was running with address translation turned on, then it will continue to run with address translation on, but now in Supervisor mode.
In fact the example relies on application code running with physical addressing;
or if the Memory Management Unit is used to perform address translation, then virtual addresses are mapped directly to physical addresses. This is because the Freeze
mode handler, super_mode(), runs in Supervisor mode with address translation
turned off. But the settrap system call, which installs the super_mode() handler address, runs in User mode and thus operates with User mode address values.
Chapter 2
Applications Programming
103
.global
_super_mode:
mfsr
or
mtsr
iret
_super_mode
gr64,ops
gr64,gr64,0x10
ops,gr64
;gr64 is an OS temporary
;read the OPS register
;set SM bit in OPS
;iret back to Supervisor mode
The super_mode() and assert_70() routines have to be written in assembly language. The IRET instruction in super_mode() starts execution of the JMPI instruction in the assert_70() routine shown below. The method shown of forcing a trap can
be used to test a systems interrupt and trap support software.
.global _assert_70
_assert_70:
asneq
70,gr96,gr96
jmpi
lr0
nop
;leaf routine
;force trap 70
;return
Table 2-3. HIF Service Call Parameters
Parameter
Description
027vers
addrptr
The version number of the installed Am29027 arithmetic accelerator chip (if any)
A pointer to an allocated memory area, a command-line-argument array, a pathname buffer, or a NULL-terminated environment variable name string.
The base address of the command-line-argument vector returned by the getargs
service.
A pointer to the buffer area where data is to be read from or written to during the
execution of I/O services, or the buffer area referenced by the wait service.
The capabilities request code passed to the query service. Code values are: 0 (request HIF version), 1 (request CPU version), 2 (request Am29027 arithmetic accelerator version), 3 (request CPU clock frequency), and 4 (request memory environment).
The CPU clock frequency (in Hertz) returned by the query service.
The number of bytes actually read from file or written to a file.
The CPU family and version number returned by the query service.
The number of processor cycles (returned value).
The disable interrupts parameter to the setim service.
The daylight savings time in effect flag returned by the gettz service.
The error code returned by the service. These are usually the same as the codes
returned in the UNIX errno variable.
The exit code of the application program.
baseaddr
buffptr
capcode
clkfreq
count
cpuvers
cycles
di
dstcode
errcode
exitcode
(continued)
104
Evaluating and Programming the 29K RISC Family
Table 2-4. HIF Service Call Parameters (Concluded)
(continued)
Parameter
Description
filename
A pointer to a NULL-terminated ASCII string that contains the directory path of a temporary filename.
The file descriptor which is a small integer number. File descriptors 0, 1, and 2 are
guaranteed to exist and correspond to open files on program entry (0 refers to the
UNIX equivalent of stdin and is opened for input; 1 refers to the UNIX stdout, and is
opened for output; 2 refers to the UNIX stderr, and is opened for output).
A pointer to the address of a spill or fill handler passed to the setvec service.
The version of the current HIF implementation returned by the query service.
The input/output status returned by the iostat service.
The interrupt mask value passed to and returned by the setim service.
The memory environment returned by the query service.
A series of option flags whose values represent the operation to be performed. Used
in the open, ioctl, and wait services to specify the operating mode.
Milliseconds returned by the clock service.
A pointer to a NULL-terminated ASCII string that contains an environment variable
name.
The number of data bytes requested to be read from or written to a file, or the number
of bytes to allocate or deallocate from the heap.
A pointer to a NULL-terminated ASCII string that contains the directory path of a new
filename.
The address of the new user signal handler passed to the signal service.
The number of bytes from a specified position (orig) in a file, passed to the lseek service.
A pointer to NULL-terminated ASCII string that contains the directory path of the old
filename.
The address of the previous user signal handler returned by the signal service.
A value of 0, 1, or 2 that refers to the beginning, the current position, or the position of
the end of a file.
The memory page size in bytes returned by the getpsize service.
A pointer to a NULL-terminated ASCII string that contains the directory path of a filename.
The UNIX file access permission codes passed to the open service.
The return value that indicates success or failure.
The seconds count returned by the time service.
A signal number passed to the sendsig service.
The trap address returned by the setvec and settrap services. A trap address
passed to and returned by the settrap service.
The trap number passed to the setvec and settrap services.
The current position in a specified file returned by the lseek service.
The time zone minutes correction value returned by the gettz service.
fileno
funaddr
hifvers
iostat
mask
memenv
mode
msecs
name
nbytes
newfile
newsig
offset
oldfile
oldsig
orig
pagesize
pathname
pflag
retval
secs
sig
trapaddr
trapno
where
zonecode
Chapter 2
Applications Programming
105
2.3
C LANGUAGE COMPILER
I know of six C language compilers producing code for the 29K family. The
most widely used of these are: the High C 29K compiler developed by Metaware Inc;
and GNU supported by the Free Software Foundation and Cygnus Support Inc. Developers of 29K software normally operate in a cross development environment,
editing and compiling code on one machine which is intended to run on 29K target
hardware. The High C 29K compiler is sold by a number of companies, including
AMD, and packaged along with other vendor tools. High C 29K can produce code
for both big– and little–endian 29K operation. The GNU compiler, gcc, currently
(version 2.5) produces big–endian code. This does not present a problem as the 29K
is used predominantly in big–endian.
2.3.1 Compiler Optimizations
A RISC chip is very sensitive to code optimization. This is not surprising since
the RISC philosophy gives software greater access to a processor’s internals relative
to most CISC processors. Compilers make use of a number of code optimization
techniques which it is difficult for the assembly language programmer to consistently
make use of. Some of these techniques are briefly described below. For example:
Common Sub–Expression Elimination
...
c=a+b;
...
d=a+b;
...
/* sub-expression used again */
The expression a+b is a common sub-expression, it does not need to be evaluated twice. A more efficient compiler would store the result of the first evaluation
in a local or global register and reuse the value in the second expression. Temporary
variables used during interim calculations are optimized by the compiler. These compiler-generated temporaries are allocated to register cache locations.
Strength Reduction
When ever possible “strength reduction” is performed. This refers to replacing
expensive instructions with less expensive ones. For example, replace multiplies by
factors of two with more efficient shift instructions.
Loop Invariant Code Motion
Sometimes a C programmer will place code in a loop which could have been
located outside of the loop. For example, variable initialization need not be repeated-
106
Evaluating and Programming the 29K RISC Family
ly executed in a loop. The loop invariant initialization would be located before the
loop code. Hence, the amount of code required to support each loop iteration is minimized.
Loop Unrolling
There are a number of optimization techniques applicable to code loops. The
objective is the same, to replace the loop with a sequence of faster executing code.
This often involves unrolling the loop partially or completely. For example, the
compiler may determine a loop is traversed, say, three times. It may be more effective
to replace the loop with three in–line versions of the loop. This would eliminate the
branching required by the loop. Additionally, when a loop is unrolled there are
generally increased opportunities to apply optimizations not available to the looped
alternative. Consequently, sections on the expanded loop need not be just
duplications of a single loop iteration, but something smaller and more register
efficient.
Dead–Code Elimination
Code which can never be executed is eliminated. This saves on memory usage.
Unexecutable code can result from a branch which can never be taken. Compilers
generally issues a warning when they detect “unreachable code”. Additionally, result
values which are never used can be eliminated; this can remove unneeded store
instructions.
Improved Register Allocation
A processor’s registers are a critical resource in determining performance.
Accessing registers is very much more efficient than accessing off–chip memory.
The ability of the compiler to devise schemes to keep data within the available
registers is critical. Additionally, given that the 29K compiler determines the size of a
procedure’s register window, it is important to minimize register allocation if spilling
and filling are to be avoided.
Constant Propagation And Folding
Variables are often assigned constant values. Later, the variable is used in a calculation. The 29K instruction format supports 8–bit immediate data constants. Applying constant variables as immediate data rather than holding the variable in a register can be more efficient. Additionally, propagating an immediate value may enable
it to be combined with another immediate value at compile time. This is better than
performing a run–time calculation.
Register–to–Register Copying (Copy Propagation)
When examining compiler generated code, particularly if the target is a CISC
processor, it is not unusual to see stores of register data to memory locations. This
Chapter 2
Applications Programming
107
makes the register available for reuse. Later, the stored data is reloaded for further
processing. The better RISC compilers try to keep data in registers longer; and use
register–to–register copying rather than register–to–memory.
Memory Shadowing
The performance impact of a memory access is reduced when the access is performed to a copy–back data cache. However, most processors do not have this advantage available to them. The term “memory shadowing” refers to the increased use of
registers for data variable storage. Again, directing accesses to registers rather than
off–chip memory has significant performance advantages. Of course, if a variable is
defined volatile it can not be held in a register.
Memory References Are Coalesced and Aligned
Data memory can be most efficiently accessed using burst–mode addressing.
This requires the use of load– and store–multiple instructions. When a sufficiently
large data object is being moved between memory and registers, it is best to use the
burst–mode supported instructions. The compiler can also arrange for frequently accessed data to be located (coalesced) in adjacent memory locations, even if the data
variables were not consecutively defined.
There are also performance benefits to be had by aligning target instructions on
cache block boundaries. For example, a procedure can be aligned to start on a 4–word
boundary. This improves cache utilization and performance –– particularly with
caches which do not support partially filled cache blocks.
Delay Slot Filling
The compilers perform “delay slot filling” (see section 3.1.8). Delay slots occur
whenever a 29K processor experiences a disruption in consecutive instruction execution. The processor always executes the instruction in the decode pipeline stage, even
if the execute stage contains a jump instruction. Delay slot is the term given to the
instruction following the jump or conditional branch instruction. Effectively, the
branch instruction is delayed one cycle. Unlike assembly language programmers, the
compiler easily finds useful instructions to insert after branching instructions. These
instructions, which are executed regardless of the branch condition, are effectively
achieved at no cost. Typically, an instruction that is invariant to the branch outcome is
moved into the delay slot just after the branch or jump instruction.
Jump Optimizations
Because of the pipeline stalling effects of jump instruction, scheduling these
instructions can achieve significant performance improvements. The objective is to
reduce the number of taken branches. For example, code loops typically have conditional tests at the top of the loop to test for loop completion. This results in branch
instructions at the top and the bottom of the loop. If the conditional branch is moved
to the bottom of the loop then the number of branches is reduced.
108
Evaluating and Programming the 29K RISC Family
Instruction Scheduling
The 29K allows load and store instructions to be overlapped with other instructions that do not depend on the load or store data. Ordinarily, a processor will load
data into a register before it makes use of it in the subsequent instruction. To enable
overlapping of the external memory access, the load instruction must be executed at
an earlier stage, before it is required. Best results are obtained if code motion techniques are used to push the load instruction back by as many instructions as there are
memory access delay cycles (another name for this technique is instruction prescheduling). This will prevent processor pipeline stalling caused by an operand value
not being available. Once again, code motion is best left to the compiler to worry
about.
Leaf Procedure Optimization
Leaf procedures are procedures which do not call other procedures; at least they
do not contain any C level procedure calls. However, they can contain transparent
routine calls inserted by the compiler. Because of this unique characteristic of leaf
routines, a number of optimizations can be applied. For example, simplified
procedure prologue and epilogue, alternative register usage. When a leaf is static in
scope (only known within the defining module) alternative parameter passing and
register allocation schemes can be applied.
With newer versions of the High C 29K compiler, it is possible to construct
simple procedures as transparent routines (see section 3.7). If a procedure qualifies
for a transparent–type implementation, then its parent (in the calling sequence) may
itself become a leaf procedure. This propagates the benefits obtained by leaf
procedures.
In–lining Simple Functions
The program may call a procedure but the compiler can replace the call with
equivalent in–line code. For very small procedures this can be a performance
advantage. However, as the called procedure grows in size and in–lining is frequently
applied, then code space requirements will increase. In–lining is frequently utilized
with C++ code which often has classes with small member functions. The register
requirements of a procedure can grow when it has to deal with in–line code rather
than a procedure call. This does not present much difficulty for a 29K processor as it
can individually tailor the register allocation to each procedure’s requirements with
dynamically sized register windows.
As stated above, it is possible to construct simple functions as transparent
routines (see section 3.7). This is not really in–lining, but it does further reduce the
overhead associated with even a leaf procedure. Additionally, placing code in a
transparent routine, which is shared, helps reduce the code expansion which occurs
with in–lining. For this reason using the C language key word _Transparent to define
Chapter 2
Applications Programming
109
the type of small procedures, may be a performance advantage when used with C++
object member functions.
Global Function In–lining
When code in–lining is applied, it is typically limited to functions defined and
used within an single module. More elaborate schemes enable a function to be
defined in one module and the related code to be inserted in–line even if the call to the
function appears in another file. Applying function in–lining in this global fashion
can greatly extend the benefits of in–lining.
Two–pass Code Compilation
Most compilers apply their optimization statically. That is entirely at compile
time. However, by observing the program in execution, optimizations can be further
refined. For example, branch prediction can be applied statically, but observing the
frequency of actual branching reveals the most traversed code paths. Additionally,
the data which is most frequently accessed can be determined. With this information
a second pass of the compiler can be applied and further code optimizations incorporated.
Superblock Formation
Software optimizations are normally only applied within a code block. A block
is a code sequence which is bounded by a single entry point (at the top –– a lower
address) and one or more exit points (a jump or call instruction). Instruction
scheduling and other optimizations can be better utilized if an instruction block is
large. For this reason techniques which enlarge a block’s size and create a superblock
are important
A superblock may contain a number of basic blocks, yet code optimizations can
be applied over the larger superblock code sequence. Creation of a superblock can
require duplication of code. Typically the tail of a superblock will be duplicated (tail
recursion) to eliminate side entry points to the superblock. Optimization techniques
which help superblock creation are: loop unrolling, function in–lining, jump
elimination, code duplication, code migration, and code profiling.
2.3.2 Metaware High C 29K Compiler
The Metaware Inc. compiler, invoked with the hc29 driver, has held the position
as the top performing 29K compiler for a number of years. It generally produces the
fastest code, which is of the smallest size. It is available on SUN and HP workstation
platforms as well as IBM PC–AT machines. It may be made available on other platforms depending on customer demand. A number of companies resell the compiler
along with other tools, such as debuggers and emulators.
110
Evaluating and Programming the 29K RISC Family
The compiler typically allocates about 12 registers for use by each new
procedure. However, a very large procedure could be allocated up to 128 registers.
This requires the register–stack cache be assigned the maximum window size of 128
registers. The “lregs=n” compiler switch (minimum n=36) enables the maximum
number of registers allocated to a procedure to be limited to less than 128. If the
“lregs” switch is used, it is possible to operate with a reduced window size. This
would increase the frequency of stack spilling and filling (and hence reduce effective
execution speeds) but would enable a faster task context switch time (see section
8.1.4). The maximum number of local registers which would require saving or
restoring would be limited to the reduced window size (window size = rfb – rab).
A number of the example code sequences shown in this book, and provided by
AMD, are configured to operate with a fixed window size of 512 bytes; in particular,
repair_R_stack in file signal.s and signal_associate in file sig_code.s. These files
should be modified to reflect the reduced window size. Ideally a Supervisor mode
accessible memory location, say WindowSize, should be initialized by the operating
system to the chosen window size, and all subsequent code should access
WindowSize to determine the window size in use. Additionally, the spill handler
routine must be replace with the code shown below. The replacement handler
requires three additional instructions. But, unlike the more frequently used spill
handler (section 4.4.4), it is not restricted to operating with a fixed window size of
512 bytes.
spill_handler:
sub
srl
sub
mtsr
sub
const
or
mtsr
add
storem
jmpi
add
tav,rab,rsp
gr96,tav,0x2
gr96,gr96,0x1
cr,gr96
tav,rfb,tav
gr96,0x200
gr96,tav,gr96
ipa,gr96
rab,rsp,0x0
0,0x0,gr0,tav
tpc
rfb,tav,0x0
;calculate size of spill
;number of words
;determine new rfb position
;point into register file
;adjust rab position
;move data
;adjust rfb position
The above spill handler code may fail if there is a procedure which does not use
the gr96 register. The compiler may hold a value in gr96 and expect it to survive the
function call; and the function call may result in spill handler execution. This is not
likely, but the use of gr96 above must be done with care.
A number of non–standard C features have been added to the compiler. These
features are often useful, but their use reduces the portability of code between
different C compilers. For example, the High C 29K compiler does not normally pack
data structures. The type modifier _Packed can be used to specify packing on a
per–structure bases. If structure packing is selected on the compiler command line,
Chapter 2
Applications Programming
111
unpacked structures can be selectively specified with the _Unpacked type modifier.
For example:
typedef _Packed struct packet_str /* packed structure */
{
char
A;
int
B;
. . .
} packet_t;
A HIF conforming operating systems provides unaligned memory access trap
handlers –– any 29K operating system may choose to do this. Hence, if an object larger than a byte is accessed and the object is not aligned to an object–sized boundary,
then a trap will be taken and the trap handler will perform the required access in
stages if necessary. The trap handler will require several processor cycles to perform
its task. To the programmer, the required data is accessed as if it were aligned on the
correct address boundary. In the example above, structure member B is of size int but
is not aligned on a int–sized boundary (given object A is a char and it is aligned on a
word–sized boundary).
Of course there is a performance penalty for use of trap handlers. For this reason,
packed data structures are seldom used. However, there use does reduce data
memory requirements, and for this reason data is often sent between processors in
packed data packets. When a data packet is received, its contents can be accessed as
bytes without any data alignment difficulties. Access of data larger than bytes may
require unaligned trap handler support, and thus suffer a performance penalty.
The High C 29K compiler offers a solution to the performance problem with the
type modifiers _ASSUME_ALIGNED and _ASSUME_UNALIGNED. They enable a
pointer to a unaligned structure to be declared. For example:
receive_packet(packet_p)
_ASSUME_UNALIGNED packet_t* packet_p;
{
int data = packet_p–>B;/* unaligned access */
. . .
The receive_packet() procedure is passed a pointer to a data structure which is
known to be unaligned. Normally, when member B of the packet structure is
accessed, an unaligned trap occurs. However, informing the compiler of the
unaligned nature of the data enables the compiler to replace the normal load
instruction used to read the B data with a transparent helper routine call (see section
3.7). The transparent helper routine performs the same task as the trap handler but
with a reduced overhead.
2.3.3 Free Software Foundation, GCC
The GNU compiler, gcc, can be obtained from any existing users who are in a
position, and has the time, to duplicate their copy. Alternatively, the Free Software
112
Evaluating and Programming the 29K RISC Family
Foundation can be contacted. For a small fee, Cygnus Support Inc. will ship you a
copy along with their documentation. The GNU compiler is available in source form,
and currently runs on UNIX type host machines as well as 386 based IBM PCs and
compatibles.
Considering the Stanford University benchmark suite, the gcc compiler (version 2.3) produces code which is on average 15–20% slower in execution compared
to hc29. The GNU compiler also used considerably more memory to contain the
compiled code. Of course your application program may experience somewhat different results.
2.3.4 C++ Compiler Selection
Programmers first started developing C++ code for the 29K in 1988; they used
the AT&T preprocessor, cfront, along with the High C 29K compiler. A number of
support utilities were developed at that time to enable the use of cfront: nm29,
munch29, and szal29, which gave the size and alignment of 29K data objects (required for cross development environments).
Because the GNU tool chains can support C++ code development directly with
the the GCC compiler there is little use being made of the AT&T cfront preprocessor.
Additionally, MRI and Metaware have recently announced upgrades to their products which now enable C++ code development. (C++ makes extensive use of dynamic memory resources, see section 2.4.1.)
2.3.5 Executable Code and Source Correspondence
The typically high levels of optimization applied by a compiler producing code
for RISC execution, can make it difficult to identify the relationship between 29K
instructions and the source level code. When looking at the 29K instructions
produced by the compiler, it is not always easy to identify the assembly instructions
which correspond to each line of C code. Optimizations such as: code motion,
sub–expression elimination, loop unrolling, instruction scheduling and more, all add
to the difficulty.
Fortunately, there is usually little need to study the resulting instructions
produced after compilation. However, it can occasionally be worth studying
compiler output when trying to understand the performance of critical code
segments. It is difficult to obtain a small example of C code which demonstrates all
the potential code optimizations. The example below is interesting, but illustrates
only a few of the difficulties of relating source code to 29K instructions.
int
char
{
Chapter 2
strcmp(s1, s2)
*s1,*s2;
Applications Programming
/* file strcmp.c */
113
int
cnt=0;
for(cnt=0;;cnt++);
{
if(s1[cnt]!=s2[cnt])
return –1;
if(s1[cnt]==’\0’ || s2[cnt]==’\0’) /* line 8 */
if(s1[cnt]==’\0’ && s2[cnt]==’\0’)
return 0;
else
return –1;
}
} /* line 14 */
The procedure, strcmp(), is similar to the ANSI library routine of the same
name. It is passed the address of two strings. The strings are compared to determine if
they are the same. If they are the same, zero is returned, otherwise –1 is returned. This
is not exactly the same behavior as the ANSI routine.
The procedure is based on a for–loop statement which compares characters in
the two strings until they are found to be different or one of the strings is terminated.
The algorithm used by the C code is not optimal. But this makes the example more
interesting as it challenges the compiler to produce the minimum code sequence. The
Metaware compiler was first used to compile the code with a high level of
optimization selected (–O7). The command line use was “hc29 –S –Hanno –O7
strcmp.c”. The “–S” switch causes the compiler to stop after it has produced 29K
assembly code –– no linking with libraries is performed. The “–Hanno” switch
causes the source C code to be embedded in the output assembly code. This helps
identify the assembly code corresponding to each line of C code. The assembly code
produced is shown below. Note that some assembly level comment statements have
been added to help explain the code operation.
.text
.word
0x40000 ; Tag: argcnt=2 msize=0
.global _strcmp
_strcmp:
;4
| int
cnt=0;
;5
| for(cnt=0;;cnt++)
jmp
L2
const gr97,0
;cnt=0
L3:
;top of for–loop
L2:
;6
| {
if(s1[cnt]!=s2[cnt])
add
gr96,lr2,gr97
load
0,1,gr99,gr96
;load s1[cnt]
add
gr96,lr3,gr97
load
0,1,gr98,gr96
;load s2[cnt]
cpeq
gr96,gr99,gr98 ;compare characters
jmpf
gr96,L4
; jump if different
cpeq
gr96,gr99,0
;test if s1[cnt] == ’\0’
;8
|
if(s1[cnt]==’\0’ || s2[cnt]==’\0’)
jmpt
gr96,L5
; jump if string end
cpneq gr96,gr98,0
;test s2[cnt]
114
Evaluating and Programming the 29K RISC Family
jmpt
add
L5:
;9
|
;for–loop if not end
;increment cnt
jmpi
const
if(s1[cnt]==’\0’ && s2[cnt]==’\0’)
gr96,gr99,0
;here is at end of string
gr96,L4
;jump if s1[]!=’\0’
gr96,gr98,0
gr96,L7
;jump if s2[]!=’\0’
gr96,–1
return 0;
lr0
;strings match
gr96,0
;return 0
constn
gr96,–1
jmpi
nop
lr0
cpneq
jmpt
cpneq
jmpt
constn
;10
gr96,L3
gr97,gr97,1
|
L4:
L7:
;12
|
;no match
return –1;
The body of the for–loop is contained between address labels L3 and L5. The
compiler has filled the delay slot of jump instructions with other useful instructions.
Within the for–loop, LOAD instruction are used to access the characters of each
string. Register gr97 is used to hold the loop–count value, cnt. The count value is
incremented each time round the for–loop. The value in gr97 is added to the base of
each string (lr2 and lr3) to obtain the address of each character required for comparison. The LOAD instructions have been scheduled to somewhat reduce conflict for
off–chip access and reduce the pipeline stalling affects of LOAD instructions.
Within the body of the loop three tests are applied: one to determine if the characters at the current position in the string match; the remaining two, to determine if
the termination character has been reached for either of the strings. The assembly
code after label L5 selects the correct return value when the tested characters do not
match or string termination is reached. There is unnecessary use of jump instructions
in the code following label L5 and also in the initial code jumping to label L2. It is
somewhat fortunate that this less optimal code does not appear within the more frequently executed for–loop body.
The same code was compiled with the GNU compiler using command “gcc –S
–O4 strcmp.c”. The assembly code produced is shown below; it is quite different
from the Metaware produced code.
.text
.align 4
.global _strcmp
.word 0x40000
_strcmp:
L2:
load 0,1,gr117,lr2
load 0,1,gr116,lr3
cpneq gr116,gr117,gr116
jmpf gr116,L5
Chapter 2
Applications Programming
;top of for–loop
;load s1[cnt]
;load s2[cnt]
;compare characters
;jump if match
115
cpneq gr116,gr117,0
jmpi lr0
constn gr96,65535
L5:
jmpfi gr116,lr0
const gr96,0
add lr3,lr3,1
jmp L2
add lr2,lr2,1
;test for s1[] end
;no match
; return –1
;here if s1[cnt]==s2[cnt]
;return if at string end
;next s2[] character
;for–loop
;next s1[] character
All of the code is contained in the body of the for–loop. A for–loop transition
consists of 10 instructions, a decrease of one compared to the Metaware code. However, LOAD instructions are now placed back–to–back, and loaded data is used immediately. Additionally, the normal path through the for–loop contains an additional
jump to label L5. This will increase the actual number of cycles required to execute a
single for–loop to more than 10 cycles. It is likely the Metaware produced code will
execute in a shorter time.
No register (previously gr97) is used to contain the cnt value. The pointers to the
passed strings, lr2 and lr3, are advanced to point to the next character within the for–
loop. Delay slot instructions are productively filled and there are no unnecessary
jump instructions.
Lines 8 through 12 of the source code are only applied if the tested characters are
found not to match. Consequently, it is redundant to test if either string has reached
the termination character –– if one has, they both have. This optimization should
have been reflected in the source code. However, the GNU compiler has identified
that it need only test string s1[] for termination. This results in the elimination of 29K
instructions relating to later C code lines. For example, there is no code relating to the
if–statement on line 9. If an attempt is made to place a breakpoint on source line 9
using the GDB source level debugger, then no breakpoint will be installed. Other debuggers may give a warning message or place a breakpoint at the first line before or
after the requested source line.
Programmers familiar with older generation compilers applied to CISC code
generation will notice the increased complexity in associating 29K instructions to
source C statements –– even for the simple example shown. As procedures become
larger and more complex, code association become increasingly more difficult. The
quality of 29K code produced by the better compilers available, make it very difficult
to consistently (or frequently) produce better code via hand crafting 29K instructions. Because of the difficulty of understanding the compiler generated code, it is
best to only incorporate hand–built code as separate procedures which comply with
the C language calling convention.
2.3.6 Linking Compiled Code
After application code modules have been compiled or assembled, they must be
linked together to form an executable file. There are three widely used linker tools:
116
Evaluating and Programming the 29K RISC Family
Microtec Research Inc. developed ld29; Information Processing Corp. developed
ld29i; and the GNU tool chain offers gld. Sometimes these tools are repackaged by
vendors and made available under different names. They all operate on AMD COFF
formatted files. However, they each have different command line options and link
command–file formats. A further limitation when mixing the use of these tools is that
ld29 operates with a different library format compared to the others. It uses an MRI
format which is maintained by the lib29 tool. The others use a UNIX System V format supported by the well known ar librarian tool.
It is best to drive the linker from the compiler command line, rather than invoking the linker directly. The compiler driver program, gcc or hc29 for example, can
build the necessary link command file and include the necessary libraries. This is the
ideal way to link programs, even if assembly language modules are to be named on
the compiler command line. Note that the default link command files frequently use
aligns text (ALIGN .text=8192) and data sections to 8k (8192) byte boundaries. This
is because the OS–boot operating system (see Chapter 7) normally operates with address translation turned on. The maximum (for the Am29000 processor) page size of
8k bytes is used to reduce run–time Memory Management Unit support overheads.
Different 29K evaluation boards can have different memory maps. AMD normally supplies the High C 29K linker in a configuration which produces a final code
image linked for a popular evaluation board –– many boards share the same memory
map. Additionally, AMD supplies linker command files for currently available
boards, such as the EZ030 and SA29200 boards. The linker command files are located in the installation/lib directory; each command file ends with the file extension
.cmd. For example, the mentioned boards have command files: ez030.cmd and
sa200.cmd, respectively. The linker command files can be specified when the compiler is invoked. For example, the command “hc29 –o file –cmdez030.cmd file.c”
will cause the final image to be linked using the ez030.cmd command file. Using the
supplied linker command files is a convenient way to ensure a program is correctly
linked for the available memory resources.
The GNU compiler also allows options to be passed to the linker via the
“–Xlinker” flag. For example, the command line “gcc –Xlinker –c –Xlinker
ez030.cmd –o file file.c” will compile and link file.c. The linker will be passed the
option “–c ez030.cmd”. The GNU linker documentation claims the linker can
operate on MRI formatted command files. In practice, at least for the 29K, this is not
the case. The GNU linker expects MRI–MC68000 formatted command files, which
are a little different from MRI–29K formatted command files. Known differences are
the use of the “*” character rather than “#” before comments, and the key word
PUBLIC must be upper case. Those using the GNU tool chain generally prefer to use
the GNU linker command file syntax rather than attempt to use the AMD supplied
command files.
Chapter 2
Applications Programming
117
When developing software for embedded applications there is always the problem of what to do with initialized data variables. The problem arises because variables must be located in RAM, but embedded programs are typically not loaded by an
operating system which prepares the data memory locations with initialized values.
Embedded programs are stored in ROM; this means there is no problem with program instructions unless a program wishes to modify its own code at run–time.
Embedded system support tools typically provide a means of locating initialized data in ROM; and transferring the ROM contents to RAM locations before program execution starts. The High C 29K linker, ld29, provides the INITDATA command for this purpose. Programs must be linked such that all references to writeable
data occur to RAM addresses. The INITDATA scans a list of sections and transfers
the data variables found into a new .initdat section. The list contains the names of
sections containing initialized data. The linker is then directed to locate the new .initdata section in ROM.The start address of the new section is marked with symbol
initdat.
Developers are provided with the source to a program called initcopy() which
must be included in the application program. This program accesses the data in ROM
starting at label initdat and transfers the data to RAM locations. The format of the
data located in the .initdat section is understood by the initcopy() routine. This routine must be run before the application main() program. A user could place a call to
the initialization routine inside crt0.s.
Note, because initcopy() must be able to read the appropriate ROM devices,
these devices must be placed in an accessible address space. This is not a problem for
2–bus members of the 29K family, but 3–bus members can have a problem if the .initdat section is located in a ROM device along with program code. Processors with
3–bus architectures, such as the Am29000, have separately addressed Instruction
and ROM spaces which are used for all instruction accesses. The Am29000 processor has no means of reading these two spaces to access data unless an external bridge
is provided. If program code and initialized data are located in the same ROM device,
the initcopy() program can only be used if an external bridge is provided. This bridge
connects the Am29000 processor data memory bus to the instruction memory bus. If
a 3–bus system does not have a bridge the romcoff utility can be used to initialize data
memory.
The romcoff utility can be used when the ld29 linker is not available and the
INITDATA linker command option is not provided. Besides being able to work with
3–bus architectures which have no bridge, it can be used to process program sections
other than just initialized data. Sections which ultimately must reside in RAM can be
initialized from code located in ROM.
Fully linked executables are processed by romcoff to produce a new linkable
COFF file. This new module has a section called RI_text which contains a routine
called RAMInit(). When invoked, this routine initializes the processed sections,dur-
118
Evaluating and Programming the 29K RISC Family
ing preparation of the relevant RAM regions. The new COFF file produced by romcoff must be relinked with the originally linked modules. Additionally, a call to RAMInit() must be placed in crt0.s or in the processor boot–up code (cold–start code) if
the linked executable is intended to control the processor during the processor RESET code sequence.
When romcoff is not used with the “–r” option, it assumes that the ROM
memory is not readable. This results in a RAMInit() function which uses CONST
and CONSTH instructions to produce the data values to be initialized in RAM. This
results in extra ROM memory requirements to contain the very much larger RAMInit() routine, but ensures that 3–bus architectures which do not incorporate a bridge
can initialize their RAM memory.
2.4
LIBRARY SUPPORT
2.4.1 Memory Allocation
The HIF specification requires that conforming operating systems maintain a
memory heap. An application program can acquire memory during execution by using the malloc() library routine. This routine makes use of the underlying sysalloc
HIF service. The malloc() call is passed the number of consecutive memory bytes
required; it returns a pointer to the start of the memory allocated from the heap.
Calls to malloc() should be matched with calls to library routine free(). This
routine is passed the start address of the previously allocated memory along with the
number of bytes acquired. The free() routine is supported by the sysfree HIF service.
The HIF specification states “no dynamic memory allocation structure is implied by
this service”. This means the sysfree may do nothing; in fact, this service with OS–
boot (version 0.5) simply returns. Continually using memory without ever releasing
it and thus making it reusable, will be a serious problem for some application programs, in particular C++ which frequently constructs and destructs objects in heap
memory.
For this reason the library routines which interface to the HIF services perform
their own heap management. The first call to malloc() results in a sysalloc HIF request for 8k bytes, even in the malloc() was for only a few bytes. Further malloc()
calls do not result in a sysalloc request until the 8k byte pool is used up. Calls to free()
enable previously allocated memory to be returned to the pool maintained by the library.
The alloca() library routine provides a means of acquiring memory from the
memory stack rather than the heap. A pointer to the memory region within the calling
procedure’s memory stack frame, is returned by alloca(). The advantage of this
method is that there is no need to call a corresponding free routine. The temporary
memory space is automatically freed when the calling procedure returns. Users of the
Chapter 2
Applications Programming
119
alloca() service must be careful to remember the limited lifetime of data objects
maintained on the memory stack. After returning from the procedure calling alloca(),
all related data variables cease to exist and should not be referenced.
2.4.2 Setjmp and Longjmp
The setjmp() and longjmp() library routines provide a means to jump from the
current procedure environment to a previous procedure environment. The setjmp()
routine is used to mark the position which a longjmp() will return to. A call to
setjmp() is made by a procedure, passing it a pointer to an environment buffer, as
shown below:
int
setjmp(env)
jmp_buf env;
The buffer definition is shown below. It records the value of register stack and
memory stack support registers in use at the time of the setjmp() call. The setjmp()
call returns a value zero.
typedef struct
{
int*
int*
int*
int*
} *jmp_buf;
jmp_buf_str
gr1;
msp;
lr0;
lr1;
The setjmp() routine is very simple. It is listed below to assist with the understanding of the longjmp() routine. It is important to be aware that setjmp(),
longjmp(), SPILL and FILL handlers, along with the signal trampoline code (see
section 2.5.3) form a matched set of routines. Their operation is interdependent. Any
change to one may require changes to the others to ensure proper system operation.
_setjmp:
store
add
store
add
store
add
store
jmpi
const
0,0,gr1,lr2
lr2,lr2,4
0,0,msp,lr2
lr2,lr2,4
0,0,lr0,lr2
lr2,lr2,4
0,0,lr1,lr2
lr0
gr96,0
;lr2 points to buffer
;copy gr1 to buffer
;copy msp
;copy lr0
;copy lr1
;return
When longjmp() is called it is passed a pointer to an environment buffer which
was initialized with a previous setjmp() call. The longjmp() call does not return directly. It does return, but as the corresponding setjmp() establishing the buffer data.
120
Evaluating and Programming the 29K RISC Family
The longjmp() return–as–setjmp() can be distinguished from a setjmp() return as
itself, because the longjmp() appears as a setjmp() return with a non–zero value. In
fact the value parameter passed to longjmp() becomes the setjmp() return value. A
C language outline for the longjmp() routine is shown below:
void
longjmp(env, value)
jmp_buf env;
int
value)
{
gr1 = env–>gr1;
lr2addr = env–>gr1 + 8;
msp = env–>msp;
/* saved lr1 is invalid if saved lr2address > rfb */
if (lr2addr > rfb) {
/*
* None of the registers are useful.
* Set rfb to lr2address–512 & rab to rfb–512
* the FILL assert will take care of filling
*/
lr1 = env–>lr1;
rab = lr2addr – WindowSize;
rfb = lr2addr;
}
lr0 = env–>lr0;
if (rfb < lr1)
raise V_FILL;
return value;
}
The actual longjmp() routine code, shown below, is written in assembly language. This is because the sequence of modifying the register stack support registers
is very important. An interrupt could occur during the longjmp() operation. That interrupt may require a C language interrupt handler to run. The signal trampoline code
is required to understand all the possible register stack conditions, and fix–up the
stack support registers to enable further C procedure call to be made.
_longjmp:
load
add
cpeq
srl
or
add
add
load
cpleu
jmpt
;{
add
add
load
Chapter 2
0,0,tav,lr2
gr97,lr2,4
gr96,lr3,0
gr96,gr96,31
gr96,lr3,gr96
gr1,tav,0
tav,tav,8
0,0,msp,gr97
gr99,tav,rfb
gr99,$1
;gr1 = env–>gr1
;gr97 now points to msp
;test return ”value”, it must
; be non zero
;gr96 has return value
; gr1 = env–>gr1;
;lr2address =env–>gr1+8
;msp = env–>msp
;if (lr2address > rfb)
gr97,gr97,4
gr98,gr97,4
0,0,lr1,gr98
;gr97 points to lr0
;gr98 points to lr1
;lr1 = value from jmpbuf
Applications Programming
121
sub
sub
add
gr99,rfb,rab
rab,tav,gr99
rfb,tav,0
;gr99 has WindowSize
;rab = lr2address–WindowSize
;rfb = lr2address
load
jmpi
asgeu
0,0,lr0,gr97
;lr0 = env–>lr0
lr0
;return
V_FILL,rfb,lr1 ;if (rfb < lr1) raise V_FILL;
; may fill from rfb to lr1
$1: ;}
2.4.3 Support Libraries
The GNU tool chain is supported with a single library, libc.a. However the High
C 29K tool chain is supported with a range of library options. It is best to use the compiler driver, hc29, to select the appropriate library. This avoids having to master the
library naming rules and build linker command files.
The GNU libraries do not support word–sized–access–only memory systems.
Originally, the Am29000 processor could not support byte–sized accesses and all
memory accesses were performed on word sized objects. This required read–
modify–write access sequences to manipulate byte sized objects located in memory.
Because all current 29K processors support byte–size access directly, there is no need
to have specialized libraries for accessing bytes. However, the High C 29K tool chain
still ships the old libraries to support existing (pre–Rev D, 1990) Am29000 processors.
The hc29 driver normally links with three libraries: the ANSI standard C support library (libs*.lib), the IEEE floating–point routine library (libieee.lib), and the
HIF system call interface library (libos.lib). There are actually eight ANSI libraries.
The driver selects the appropriate library depending on the selected switches. The
reason for so many libraries is due to the support of the old word–only memory systems, the option to talk with an Am29027 coprocessor directly, and finally, the option
to select Am29050 processor optimized code.
The ANSI library includes transcendental routines (sin(), cos(), etc.) which
were developed by Kulus Inc. These routines are generally faster than the transcendental routines developed by QTC Inc., which were at one time shipped with High C
29K. The QTC transcendentals are still supplied as the libq*.lib libraries, and must
now be explicitly linked. The Kulus transcendentals also have the advantage in that
they support double and single floating–point precision. The routines are named
slightly differently, and the compiler automatically selects the correct routine depending on parameter type. The GNU libraries (version 2.1) include the QTC transcendental routines.
Most 29K processors do not support floating–point instructions directly (see
section 3.1.7). When a non–implemented floating–point instruction is encountered,
the processor takes a trap, and operating system routines emulate the operation in
trapware code. If a system has an Am29027 floating–point coprocessor available,
122
Evaluating and Programming the 29K RISC Family
then the trapware can make use of the coprocessor to achieve faster instruction
emulation. This is generally five times faster than software based emulation. Keeping the presence of the Am29027 coprocessor hidden in operating system support
trapware, enables application programs to be easily moved between systems with
and without a coprocessor.
However, an additional (about two times) speed–up can be achieved by application programs talking to the Am29027 coprocessor directly, rather than via trapware.
When the High C 29K compiler is used with the “–29027” or “–f027” switches, inline
code is produced for floating–point operations which directly access the coprocessor.
Unfortunately the compiled code can not be run on a system which has no coprocessor. The ANSI standard C support libraries also support inline Am29027 coprocessor
access with the libs*7.lib library. When instructed to produce direct coprocessor access code, the compiler also instructs the linker to use this library in place of the standard library, libs*0.lib.
The Am29050 processor supports integer multiply directly in hardware rather
than via trapware. It also supports integer divide via converting operands to floating–
point before dividing and converting back to integer. The High C 29K compiler performs integer multiply and divide by using transparent helper routines (see section
3.7); this is faster than the trapware method used by the GNU compiler. When the
High C 29K compiler is used with the “–29050” switch, and the GNU compiler with
the “–m29050” switch, code optimized for the use for an Am29050 processor is
used. This code may not run on other 29K family members, as the Am29050 processor has some additional instructions (see sections 3.1.6 and 3.1.7).
2.5
C LANGUAGE INTERRUPT HANDLERS
Embedded application code developers typically have to deal with interrupts
from peripheral devices requiring attention. As with general code development there
is a desire to deal with interrupts using C language code rather than assembly language code. Compared to CISC type processors, which generally do not have a register stack, this is a little more difficult to achieve with the 29K family. In addition, 29K
processors do not have microcode to automatically save their interrupted context.
The interrupt architecture of a 29K processor is very flexible and is dealt with in detail in Chapter 4. This section presents two useful techniques enabling C language
code to be used for interrupts supported by a HIF conforming operating system.
The characteristics of the C handler function are important in determining the
steps which must be taken before the handler can execute. It is desirable that the C
handler run in Freeze mode because this will reduce the overhead costs. These costs
are incurred because interrupts may occur at times when the processor is operating in
a condition not suitable for immediately commencing interrupt processing. Most of
these overheads are concerned with register stack support and are described in detail
in section 4.4. This section deals with establishing an interrupt handler which can run
Chapter 2
Applications Programming
123
in Freeze mode. The following section 2.5.3 deals with all other types of C language
interrupt handlers.
A C language interrupt handler qualifies for Freeze mode execution if it meets
with a number of criteria:
It is a small leaf routine which does not attempt to lower the register stack
pointer. This means that, should the interrupt have occurred during a critical
stage in register stack management, the stack need not be brought to a valid
condition.
Floating–point instructions not directly supported by the processor are not used.
Many members of the 29K family emulate floating–point instructions in
software (see Chapter 3).
Instructions which may result in a trap are not used. All interrupts and traps are
disabled while in Freeze mode. This means the Memory Management Unit
cannot be used for memory access protection and address translation.
The handlers execution is short. Because the handler is to be run in Freeze mode
its execution time will add to the system interrupt latency.
The handler does not attempt to execute LOADM and STOREM instructions
while in Freeze mode. When a performance gain can be had, the High C 29K
compiler will use these instructions to move blocks of data; this does not
typically happen with short Freeze mode interrupt handlers. However, the High
C 29K compiler supports the _LOADM_STOREM pragma which can be used
to turn off or on (default) the use of LOADM and STOREM instructions.
Transparent procedure calls are not used (see section 3.7). They typically
require the support of indirect pointer which are not temporarily saved by the
code presented in this section.
The methods shown in this and the following section rely on application code
running with physical addressing; or if the Memory Management Unit is used to perform address translation, then virtual addresses are mapped directly to physical addresses. This is because the macros used to install the Freeze Mode trap handlers are
used to generate code in User mode and thus operate with User mode address values.
However, Freeze mode code runs in Supervisor mode with address translation turned
off.
The Metaware High C 29K and GCC compilers prior to release 3.2 have no C
language extension to aid with interrupt handling. Release 3.2, or newer, support the
key word _Interrupt as a procedure return type. Use of this C language extension
results in additional tag data (see section 3.6) preceding the interrupt handler routine.
Without the interrupt tag data, the only way to identify if a handler routine qualifies
for the above Freeze mode handler status, is to compile it with the “–S” option and
examine the assembly language code. Alternatively, handler routines which make
124
Evaluating and Programming the 29K RISC Family
function calls can be immediately eliminated as unsuitable for operation in Freeze
mode. Examining the assembly language code would enable the nregs value used in
the following code to be determined. Small leaf routines operate with global registers
only. Starting with gr96, nregs is the number of global registers used by a C leaf handler routine.
The interrupt_handler macro defined below can be used to install a C level
interrupt handler which is called upon when the appropriate trap or interrupt occurs.The code is written in assembly language because it must use a carefully crafted
instruction sequence; the first part of which uses the HIF settrap service to install, in
the processor vector table, the address ($1) which will be vectored to when the interrupt occurs. The necessary code is written as a macro rather than a procedure call because the second part of the macro contains the start of the actual interrupt handler
code. This code, starting at address $1, is unique to each interrupt and can not be
shared. Note, the code makes use of push and pop macro instructions to transfer data
between registers and the memory stack. These assembly macros are described in
section 3.3.1.
.reg
.reg
it0,gr64;freeze mode interrupt
it1,gr65;temporary registers
;install interrupt handler
.macro interrupt_handler, trap_number, C_handler, nregs
$1:
$3:
sub
asgeu
add
add
const
const
const
consth
asneq
add
add
add
jmp
asleu
gr1,gr1,4*4
V_SPILL,gr1,rab
lr1,gr121,0
lr0,gr96,0
gr121,290
lr2,trap_number
lr3,$1
lr3,$1
69,gr1,gr1
gr121,lr1,0
gr96,lr0,0
gr1,gr1,4*4
$2
V_FILL,lr1,rfb
push
pushsr
const
const
mtsr
add
sub
jmpfdec
store
msp,lr0
msp,it1,ipa
it0,nregs–2
it1,96<<2
ipa,it1
it1,it1,1<<2
msp,msp,4
it0,$3
0,0,gr0,msp
const
consth
calli
lr0,C_handler
lr0,C_handler
lr0,lr0
;get lr0–lr3 space
;check for stack spill
;save gr121
;save gr96
;HIF 2.0 SETTRAP service
;trap number, macro parameter
;trap handler address
;HIF service request
;restore gr121
;restore gr96
;restore stack
;macro code finished
;check for stack fill
;start of Interrupt handler
;save special reg. ipa
;number or regs. to save
;starting with gr96
;increment ipa
;decrement stack pointer
;save global registers
;
Chapter 2
Applications Programming
;call C level handler
125
nop
;
$4:
const
const
mtsr
load
sub
jmpfdec
add
popsr
pop
iret
it0,nregs–2
;number of global registers
it1,(96+nregs–1)<<2
ipa,it1
0,0,gr0,msp
;restore global register
it1,it1,1<<2
;decrement ipa
it0,$4
msp,msp,4
;increment stack pointer
ipa,it0,msp
;restore ipa
lr0,msp
;restore lr0
$2:
.endm
Because the C level handler is intended to run in Freeze mode, there is very little
code before the required handler, C_handler, is called. Registers lr0 and IPA are
saved on the memory stack before they are temporarily used. Then the required number of global registers (nregs) starting with gr96 are also saved on the stack. The programmer must determine the nregs value by examining the handler routine assembly
code.
The interrupt_handler macro must be used in an assembly language module.
Alternatively, a C language compiler extension can be used. The High C 29K compiler supports an extension which enables assembly code to be directly inserted into C
code modules. This enables a C macro to be defined which will call upon the assembly language macro code. The example code below shows the C macro definition.
#define interrupt_handler(tap_number, C_handler, nregs) \
/*int trap_number; \
void (*C_handler)(); \
int
nregs; */ \
_ASM(” interrupt_handler ”#trap_number”,”#C_handler”,”#nregs);
Alternatively the C macro could contain the assembly macro code directly. Using the technique shown, C modules which use the macro must be first compiled with
the “–S” option; this results in an assembly language output file. The assembly language file (.s file) is then assembled with an include file which contains the macro
definition. Note, C modules which use the macro must use the _ASM(“assembly–
string”) C extension to include the assembly language macro file (shown below) for
its later use by the assembler. The GCC compiler supports the asm(“assembly–
string”) C extension which achieves the same result as the High C 29K _ASM(“assembly–string”) extension.
_ASM(” .include \”interrupt_macros.h\””);
/* int2_handler uses 8 regs. and is called
when hardware trap number 18 occurs */
interrupt_handler(18,int2_handler,8);
126
Evaluating and Programming the 29K RISC Family
2.5.1 An Interrupt Context Cache with High C 29K
The interrupt_handler macro code, described in the previous section, prepares the processor to handle a C language interrupt handler which can operate within
the processor Freeze mode restrictions. The code saves the interrupted processor
context onto the current memory stack position before calling the C handler.
The interrupt_cache macro shown below can be used in place of the previously
described macro. Its use is also restricted to preparing the processor to handle a C
level handler which meets the Freeze mode execution criteria. However, its operation
is considerably faster due to the use of an Interrupt Context Cache. Section 4.3.9 describes context caching in more detail. A cache is used here only to save sufficient
context to enable a non–interruptable C level handler to execute.
The cache is implemented using operating system registers gr64–gr80. These
global registers are considered operating system temporaries, at least gr64–gr79 are
(also known as it0–it3 and kt0–kt11). Register gr80 (known as ks0) is generally used
to hold operating system static data (see section 3.3). Processors which do not directly support floating–point operations contain instruction emulation software (trapware) which normally uses registers in the gr64–gr79 range to support instruction
emulation. Given application code can perform a floating–point operation at any
time, an operating system can not consider these registers contents remain static after
application code has run. For this reason and others, floating–point trapware normally runs with interrupts turned off, it is convenient to use these registers for interrupted
context caching.
The interrupt_handler macro uses a loop to preserve the global registers used
by the Freeze mode interrupt handler. The interrupt_cache macro unrolls the loop
and uses register–to–register operations rather than register–to–memory. In place of
traversing the loop nregs times, the nregs value is used to determine the required
entry point to the unrolled code. These techniques reduce interrupt preparation times
and interrupt latency.
.macro interrupt_cache, trap_number, C_handler, nregs
sub
asgeu
add
add
const
const
const
consth
asneq
add
add
add
jmp
asleu
Chapter 2
gr1,gr1,4*4
;get lr0–lr3 space
V_SPILL,gr1,rab ;check for stack spill
lr1,gr121,0
;save gr121
lr0,gr96,0
;save gr96
gr121,290
;HIF 2.0 SETTRAP service
lr2,trap_number ;trap number, macro parameter
lr3,$1–(nregs*4) ;trap handler address
lr3,$1–(nregs*4)
69,gr1,gr1
;HIF service request
gr121,lr1,0
;restore gr121
gr96,lr0,0
;restore gr96
gr1,gr1,4*4
;restore stack
$2
;macro code finished
V_FILL,lr1,rfb ;check for stack fill
Applications Programming
127
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
gr80,gr111,0
gr79,gr110,0
gr78,gr109,0
gr77,gr108,0
gr76,gr107,0
gr75,gr106,0
gr74,gr105,0
gr73,gr104,0
gr72,gr103,0
gr71,gr102,0
gr70,gr101,0
gr69,gr100,0
gr68,gr99,0
gr67,gr98,0
gr66,gr97,0
gr64,lr0,0
const
consth
calli
add
lr0,C_handler
lr0,C_handler
lr0,lr0
gr65,gr96,0
jmp
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
iret
$2–4–(nregs*4)
lr0,gr64,0
gr111,gr80,0
gr110,gr79,0
gr109,gr78,0
gr108,gr77,0
gr107,gr76,0
gr106,gr75,0
gr105,gr74,0
gr104,gr73,0
gr103,gr72,0
gr102,gr71,0
gr101,gr70,0
gr100,gr69,0
gr99,gr68,0
gr98,gr67,0
gr97,gr66,0
gr96,gr65,0
;save gr111 to interrupt
; context cache
;the interrupt handler starts
;somewhere in this code range
;depending on the register
;usage of the C level code
;save gr97
;save lr0
$1:
;call C level handler
;save gr96
;
;determine registers used
;restore lr0
;restore gr111 from interrupt
; context cache
;retsore gr96
$2:
.endm
2.5.2 An Interrupt Context Cache with GNU
The previous section presented interrupt context caching when using the Metaware High C 29K compiler. Global register assignment with the Free Software
Foundation compiler, GCC, is very different from High C 29K. Global registers
gr96–gr111 are little used, except for return values. GCC has very frugal global register usage. It mainly uses global registers gr116–gr120. This effects the interrupt
128
Evaluating and Programming the 29K RISC Family
preparation code required for Freeze mode C level handlers. High C 29K uses global
registers in the gr96–gr111 range as temporaries before starting to use gr116–gr120.
The reduced use of global registers might make GCC a better choice for building
Freeze mode C–level interrupt handlers.
The assembler, as29, supplied with the GCC compiler chain does not support
macros directly. But it is possible to use the C preprocessor, CPP, to do macro instruction expansion. The interrupt_cache macro shown below demonstrates the use of
CPP with 29K assembly code. The macro is used to install a C handler for the selected
trap_number. The early part of the macro code requests the HIF settrap service be
used to insert the interrupt handler address into the processor vector table. The actual
address inserted depends on the register usage of the C handler.
The handler must be examined to determine the registers used. Parameter nregs
is used to specify the number of registers used in the gr116–gr120 range. The handler
preparation code saves the necessary global registers in an interrupt context cache
before calling the C code. Global registers gr96–gr111 are not saved in the cache, as it
is likely that they are not used by the handler –– it certainly has no return value.
The context cache is formed with global registers gr64–gr80. Registers
gr64–gr79 are used by floating–point emulation routines, and hence their contents
are available for use between floating–point trap instructions. This assumes that the
trapware runs with interrupts turned off which is normally the case. For more details
see section 2.5. Saving the registers used by the handler in this way is much faster
than pushing the registers onto an off–chip memory stack.
#define interrupt_cache(trap_number, C_handler, nregs)\
;start of interrupt_cache macro, nregs must be >=1 _CR_\
nop
;delay slot protection _CR_\
sub
gr1,gr1,4*4
;get lr0–lr3 space _CR_\
asgeu
V_SPILL,gr1,rab ;check for stack spill _CR_\
add
lr1,gr121,0
;save gr121 _CR_\
add
lr0,gr96,0
;save gr96 _CR_\
const
gr121,290
;HIF 2.0 SETTRAP service _CR_\
const
lr2,trap_number ;trap number, macro parameter_CR_\
const
lr3,cache_##trap_number–(nregs*4) ;handler adds._CR_\
consth lr3,cache_##trap_number–(nregs*4) ; _CR_\
asneq
69,gr1,gr1
;HIF service request _CR_\
add
gr121,lr1,0
;restore gr121 _CR_\
add
gr96,lr0,0
;restore gr96 _CR_\
add
gr1,gr1,4*4
;restore stack _CR_\
jmp
cache_end_##trap_number ;install code finished _CR_
asleu V_FILL,lr1,rfb ;check for stack fill _CR_\
add
gr70,gr120,0
add
gr69,gr119,0
add
gr68,gr118,0
add
gr67,gr117,0
add
gr64,lr0,0
cache_##trap_number:
Chapter 2
Applications Programming
;START of interrupt handler code_CR_\
;save gr120 _CR_\
;save gr119 _CR_\
;save gr118 _CR_\
;save gr117 _CR_\
;save lr0 _CR_\
;gr96–gr111 not saved in cache _CR_\
129
const
lr0,C_handler
consth lr0,C_handler
calli
lr0,lr0
add
gr66,gr116,0
; _CR_\
jmp
L2–4–(nregs*4)
add
lr0,gr64,0
add
gr120,gr70,0
add
gr119,gr69,0
add
gr118,gr68,0
add
gr117,gr67,0
add
gr116,gr66,0
iret
cache_end_##trap_number:
;call C–level handler_CR_\
; _CR_\
;call C level handler _CR_\
;save gr116 _CR_\
;determine registers used _CR_\
;restore lr0 _CR_\
;restore gr120 from cache _CR_\
; _CR_\
; _CR_\
; _CR_\
; _CR_\
; _CR_\
;end of interrupt cache macro _CR_
The code example below shows how the macro can be invoked. The routine
install_handlers() is written in assembly code. It includes a macro for a C level interrupt handler, int2_handler(), assigned to 29K interrupt INTR2. The C level handler
was examined and found to be a qualifying leaf routine using only two global registers.
.text
.extern _int2_handler
.global _install_handlers
_install_handlers:
sub
gr1,gr1,2*4
;prologue not realy needed
asgeu
V_SPILL,gr1,gr126
;lower stack pointer
interrupt_cache(18,_int2_handler,2) ;macro example
add
gr1,gr1,2*4
;raise stack pointer
constn gr96,–1
;return TRUE value
jmpi
lr0
;return
asleu V_FILL,lr1,rfb ;procedure epilogue
The C preprocessor is invoked with the app shell script program shown below.
This is a convenient way of directing CPP to process an assembly program source
file. The use of CPP has one problem; macros are expanded into long lines. The carriage returns in the macro source file do not appear in the expanded code. To reinsert
the carriage returns and make the assembly code lines compatible with assembler
syntax, each assembly line in the macro is marked with the token _CR_. The UNIX
stream editor, sed, is then used to replace the _CR_ with a carriage return.
#!/bin/sh
#start of app shell script
#example, ”app file_in.s”
prams=$*
tmp=/tmp/expand.$$
cpp –P $prams > $tmp
sed ’s/_CR_/\
/g’ $tmp
rm $tmp
130
#invoke CPP
Evaluating and Programming the 29K RISC Family
2.5.3 Using Signals to Deal with Interrupts
Some C language interrupt handlers will not be able to run in Freeze mode.; because (as described in section 2.5) they are unsuitable leaf routines, or are not leaf
routines and thus require use of the register stack. In this case the signal trampoline
code described in section 4.4 and Appendix B must be used. The trampoline code is
called by the Freeze mode interrupt handler after critical registers have been saved on
the memory stack. The C language handler is called by the trampoline code after the
register stack is prepared for further use. Note that interrupts can occur at times when
the register stack condition is not immediately usable by a C language handler.
The signal mechanism works by registering a signal handler function address
for use when a particular signal number occurs. This is done with the library routine
signal(). Signals are normally generated by abnormal events and the signal() routine
allows the operating system to call a user supplied routine which will be called to deal
with the event. The signal() function uses the signal HIF service to supply the address
of a library routine (sigcode) which will be called for all signals generated. (Note,
only the signal, settrap and sigret–type subset of HIF services are required.) The library routine is then responsible for calling the appropriate C handler from a table of
handlers indexed by the signal number. When signal() is used a table entry is
constructed for the indicated signal.
int
void
signal(sig_number, func)
sig_number;
(*func)(sig_number);
A signal can only be generated for an interrupt if the code vectored to by the interrupt calls the shared library routine known as the trampoline code. It is known as
the trampoline code because signals bounce from this code to the registered signal
handler. To ensure that the trampoline code is called when an interrupt occurs, the
Freeze mode code vectored to by the interrupt must pass execution to the trampoline
code, indicating the signal which has occurred. The signal_associate macro shown
below can be used to install the Freeze Mode code and associate a signal number with
the interrupt or trap hardware number.
.reg
.reg
it0,gr64;freeze mode interrupt
it1,gr65;temporary registers
.macro signal_associate, trap_number, sig_number
sub
asgeu
add
add
const
const
const
Chapter 2
gr1,gr1,4*4
V_SPILL,gr1,rab
lr1,gr121,0
lr0,gr96,0
gr121,290
lr2,trap_number
lr3,$1
Applications Programming
;get lr0–lr3 space
;check for stack spill
;save gr121
;save gr96
;HIF 2.0 SETTRAP service
;trap number, macro parameter
;trap handler address
131
$1:
consth
asneq
add
add
add
jmp
asleu
lr3,$1
69,gr1,gr1
gr121,lr1,0
gr96,lr0,0
gr1,gr1,4*4
$2
V_FILL,lr1,rfb
;HIF service request
;restore gr121
;restore gr96
;restore stack
;macro code finished
;check for stack fill
const
push
push
push
const
sub
it0,sig_number
msp,it0
msp,gr1
msp,rab
it0,512
rab,rfb,it0
;start of Interrupt handler
;push sig_number on
; interrupt context frame.
;use push macro,
; see section 3.3.1
;set rab = rfb–WindowSize
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
push
msp,it0,pc0
msp,it0,pc1
msp,it0,pc2
msp,it0,cha
msp,it0,chd
msp,it0,chc
msp,it0,alu
msp,it0,ops
msp,tav
;push special registers
mfsr
or
mtsr
it0,ops
it0,it0,0x2
ops,it0
;push tav (gr121)
set DI in CPS, but timer
; interrupts are still on
;this disables interrupts
; in signal trampoline code
mtsrim
const
consth
load
cpeq
jmpt
add
mtsr
mtsr
iret
chc,0
it1,RegSigHand
it1,RegSigHand
0,0,it1,it1
it0,it1,0
it0,SigDfl
it0,it1,4
pc1,it1
pc0,it0
;the trampoline code is
; described in section 4.4.1
;RegSigHand is a library
; variable
;test for no handler
;jmup if no handler(s)
;it1 has trampoline address
;IRET to signal
; trampoline code
;
;
;
$2:
.endm
;end of macro
The above macro code does not disable the interrupt from the requesting device.
This is necessary for external interrupts; reenabling interrupts without having first
removed the current interrupt request, shall cause the interrupt to be immediately taken again. The code sets the the DI–bit in the OPS special register; this means interrupts will remain disabled in the trampoline code. It will be the responsibility of the C
language handler to clear the interrupt request; this may require accessing an off–
chip peripheral device. An alternative is to clear the interrupt request in the above
Freeze mode code and not set the DI–bit in the OPS. This would enable the trampoline and C language handler code to execute with interrupts enabled. This would lead
132
Evaluating and Programming the 29K RISC Family
to the possibility of nested signal events; however, the signal trampoline code is able
to deal with such complex events.
With the example signal_associate macro the trampoline code and the C handler run in the processor mode at the time the interrupt occurred. They can be forced
to run in Supervisor mode by setting the Supervisor mode bit (SM–bit) when OR–ing
the DI–bit in the OPS register. Supervisor mode may be required to enable accessing
of the interrupting device when disabling the interrupt request. The address translation bits (PA and PD) may also be set at this time to turn off virtual addressing during
interrupt processing. To make these changes to the above example code, the value
0x72 should be OR–ed with the OPS register rather than the 0x2 value shown.
As described in section 2.5, a C language macro can be used to access the assembly level macro instruction. When the High C 29K compiler is being used, the definition of the C macro is shown below. Users of the GCC compiler should replace the
_ASM() call with the equivalent asm() C language extension.
#define signal_associate(tap_number, sig_number) \
/*int trap_number; \
int
sig_number; */ \
_ASM(” signal_associate ”#trap_number”,”#sig_number);
When the macro is used to associate a signal number with a processor trap number, it is also necessary to supply the address of the C language signal handler called
when the signal occurs. The following example associates trap number 18 (floating–
point exception) with signal number 8. This signal is known to UNIX and HIF users
as SIGFPE; when it occurs, the C handler sigfpe_handler is called.
_ASM(” .include \”interrupt_macros.h\””);
signal_associate(18,8);
/* trap 18, F–P */
signal(8,sigfpe_handler);
/* signal 8 handler */‘
C language signal handlers are free of many of the restrictions which apply to
Freeze mode interrupt handlers. However, the HIF specification still restricts their
operation to some extent. Signal handlers can only use HIF services with service
numbers greater than 256. This means that printf() cannot be used. The reason for
this is HIF services below 256 are not reentrant, and a signal may occur while just
such a HIF service request was being processed. Return from the signal handler must
be via one of the signal return services: sigdft, sigret, sigrep or sigskp. If the signal
handler simply returns, the trampoline code will issue a sigdfl service request on behalf of the signal handler.
A single C level signal routine can be used to dispatch several C language interrupt handlers. Section 4.3.12 describes an interrupt queuing method, where interrupt
handlers run in Freeze mode and build an interrupt descriptor (bead). Each descriptor
is placed in a list (string of beads) and a Dispatcher routine is used to process descriptors. The signal handling method described above can be used to register a C level
Chapter 2
Applications Programming
133
Dispatcher routine. This results in C level context being prepared only once and the
Dispatcher routine calling the appropriate C handler.
2.5.4 Interrupt Tag Words
Release 3.2, or newer, of the High C 29K compiler supports routines of defined
return–type _Interrupt. The use of this C language extension causes an additional tag
word to be placed ahead of the procedure code. Section 3.6 explains the format of the
interrupt tag in detail. Note, to use the _Interrupt key word with a PC hosted compiler,
it is necessary to add the line “#define _Interrupt _CC(_INTERRUPT)” to file
29k/bin/hc29.pro. The interrupt key word in conjunction with some simple support
routines presented below make optimizing of interrupt preparation very easy. By
examining the interrupt tag word it is possible to determine if a handler routine
qualifies for Freeze mode execution or will require HIF signal processing. The
example code shown below is for a HIF conforming operating system. However, a
different operating system may choose to respond to interrupt tag information in a
somewhat different manner. Only the signal, settrap and sigret–type subset of HIF
services are required. A different operating system may have equivalent support
services.
When an interrupt occurs, it would be possible to examine the interrupt tag word
of the assigned handler. However, this would be an overhead encountered at each
interrupt and it would increase interrupt processing time. It is better to examine the
tag word at interrupt installation time and determine the necessary interrupt
preparation code. Preceding sections have described interrupt context caching and
signal processing. It would be possible to examine the tag word in more detail than
the following example code undertakes. This would produce additional intermediate
performance points in the spectrum of interrupt preparation code; context caching
being the fastest point on the spectrum and signal processing the slowest. However,
signal processing can always be used and is free of the restrictions which apply to the
use of interrupt context caching, and context caching is frequently adequate. This
renders the chosen spectrum points as most practicable.
The example below shows two C language interrupt handler routines. The first,
f_handler(), looks like it will qualify for Freeze mode execution. The key word
_Interrupt has been used during the procedure definition and this will result in a
interrupt tag word. The second function, s_handler(), is not a leaf procedure and this
fact will be reported in its interrupt tag word. Being a non leaf routine, it will be
processed as a signal handler. Such routines receive a single parameter –– the signal
number.
extern int
extern int
extern char
134
sig_sig
sig_intr0
*UART_p;
/* defined in library code */
/* signal for INTR0 */
/* pointer to UART */
Evaluating and Programming the 29K RISC Family
char
recv_data[50];
_Interrupt f_handler()
/* Freeze mode handler */
{
static int count=0;
recv_data[count]=*uart_p;
/* read from UART */
if(recv_data[count]==’\n’)
{
sig_sig=sig_intr0;
count=0;
}
else count++;
/* test for end */
/* signal #30 */
/* reset counter */
}
_Interrupt s_handler(sig_number)
/* signal handler */
int
sig_number;
/* for sig_intr0 */
{
printf(”in signal handler number=%d\n”, sig_number);
printf(”received string=%s\n”, recv_data);
_sigret();
}
Most programmers do not want to become concerned with the details of
interrupt preparation. They simply wish to call an operating system service routine
which will examine the interrupt tag word and select the appropriate interrupt
preparation code. The library procedure, interrupt(), shown below, is just such a
service routine. The operation of this procedure will be described a little later. The
procedure ensures that either interrupt context caching or signal processing will be
applied for the supplied handler and selected 29K trap number. The interrupt()
routine must be executed during the system initialization stage, before traps or
interrupts are expected to occur. An example initialization sequence is shown below:
int
sig_intr0;
main()
{
. . .
sig_intr0=interrupt(16,s_handler);
interrupt(17,f_handler);
. . .
/* INTR0 */
/* INTR1 */
Interrupt tag words are dealt with at interrupt installation time, and not at
program assembly or link time. There have been discussions about adding a compiler
pragma option to High C 29K release 4.0 which, when switched on, will cause a
macro instruction to be placed in output assembly code rather than an interrupt tag
word. This requires that the relevant C code be compiled, then assembled with an
include file which defines the replacement code for the interrupt macro instruction.
This technique has some disadvantages, principally that the macro must understand
the capabilities of the operating system and how it intends dealing with interrupts. In
Chapter 2
Applications Programming
135
particular; if the interrupt should be processed in User or Supervisor mode, with
interrupts enabled or disabled; with or without address translation and so on. Use of a
macro does have the advantage that the interrupt preparation code appears in the final
linked program image. The tag word methods relies on the preparation code being
built in heap memory during interrupt installation. The preparation code is built in
consultation with the operating system and is thus more portable between different
operating systems which support somewhat different interrupt processing
environments.
Fortunately for the user, library routines are responsible for installing the
executable code into heap memory. The code itself is similar to the code of previous
sections. A portion of the code is linked into text space of the program image. At
installation time the code is copied into heap memory and further optimized. The
code sequence below is used for interrupt context caching.
.text
.align 4
.global _interrupt_cache_code
.global _interrupt_cache_end
.extern _sig_sig
_interrupt_cache_code:
add
gr80,gr111,0
;save gr111 to interrupt
add
gr79,gr110,0
; context cache
add
gr78,gr109,0
add
gr77,gr108,0
add
gr76,gr107,0
add
gr75,gr106,0
add
gr74,gr105,0
add
gr73,gr104,0
add
gr72,gr103,0
add
gr71,gr102,0
add
gr70,gr101,0
add
gr69,gr100,0
add
gr68,gr99,0
add
gr67,gr98,0
add
gr66,gr97,0
add
gr64,lr0,0
;save lr0
;
const
lr0,0
;const and consth
consth lr0,0
; need to be modified
calli
lr0,lr0
;call C handler
add
gr65,gr96,0
;
add
gr111,gr80,0
;restore gr111 from
add
gr110,gr79,0
; context cache
add
gr109,gr78,0
add
gr108,gr77,0
add
gr107,gr76,0
add
gr106,gr75,0
add
gr105,gr74,0
add
gr104,gr73,0
add
gr103,gr72,0
136
Evaluating and Programming the 29K RISC Family
add
gr102,gr71,0
add
gr101,gr70,0
add
gr100,gr69,0
add
gr99,gr68,0
add
gr98,gr67,0
add
gr97,gr66,0
add
gr96,gr65,0
add
lr0,gr64,0
;restore lr0
const
gr64,_sig_sig
;the following eight
consth gr64,_sig_sig
; instructions deal with
load
0,0,gr64,gr64
; sig_sig testing
cpeq
gr65,gr64,0
;test for zero
const
gr66,_signal_associate_code + 4 ;no relative
consth gr66,_signal_associate_code + 4 ; addressing
jmpfi
gr65,gr66
;jump if sig_sig != 0
nop
iret
_interrupt_cache_end:
The context cache code is a little different from the code shown in section 2.5.1.
Eight extra instruction have been added to support a memory variable called sig_sig.
It supports a very useful technique of two–level interrupt processing. Predominantly
a Freeze mode interrupt handler is used alone. However, when the sig_sig variable is
set to a signal number before the Freeze mode handler completes, a signal is
generated causing a second signal handler routine to execute after the Freeze mode
handler returns.
Examine the example handler routines. When interrupt INTR1 (vector 17)
occurs, the Freeze mode handler, f_handler(), normally accesses the interrupting
UART and receives a character; it then increments the count value and returns. The
processes of accessing the UART causes the interrupt request to be deasserted. This
results in a very fast interrupt handler written in C. However, when the received
character is a ‘\n’ (carriage return), sig_sig is set to the signal number allocated to the
INTR0 signal handler. This causes the s_handler() to be executed in response to the
signal. The occurrence of interrupt INTR0 (vector 16) also causes s_handler() to
execute as a signal handler associated with trap 16. The example interrupt() service
automatically allocates signal numbers, starting with SIGUSR1, to handler routines
which are to be processed via signal trampoline code. The interrupt() procedure
returns the selected signal number; zero is returned if a Freeze mode handler is
selected. An interrupt handler can be restricted to fast Freeze mode processing and
when more extensive processing is required the sig_sig variable can be set and a
second level handler invoked. (Note, the s_handler() routine calls the printf()
library routine. This is not permitted with the High C 29K library routines as the
printf() routine is not reentrant. However, the use of printf() helps illustrate the
two–stage principle.)
To perform signal processing, the trampoline code shown below is placed in
heap memory. It is similar to the code of section 2.5.3. Interrupts are disabled before
Chapter 2
Applications Programming
137
signal processing commences; this is not necessary if a Freeze mode handler has
already requested the interrupting device to deassert the interrupt request. If a Freeze
mode handler is always executed before the associated signal handler, the three
indicated lines of code can be removed. Doing so enables nested interrupts to be
supported without explicitly reenabling interrupts in the signal hander. However, if
the signal preparation code is called directly from the interrupt vector table (via an
interrupting device) then interrupts must be initially disabled by the shared signal
preparation code.
.global _signal_associate_code
.global _signal_associate_end
.reg
it0,gr64
.reg
it1,gr65
_signal_associate_code:
;signal number in it0
const
gr64,0
;push signal number on stack
const
it1,0
;clear sig_sig variable
const
it2,_sig_sig
; need not do this if signal
consth it2,_sig_sig
; handler is called directly
store
0,0,it1,it2
; from vector table entry
push
msp,it0
;interrupt context stack
push
msp,gr1
;use ’push’macro’
push
msp,rab
; see section 3.3.1
const
it0,512
sub
rab,rfb,it0
;set rab=rfb–WindowSize
;
pushsr msp,it0,pc0
;push special registers
pushsr msp,it0,pc1
pushsr msp,it0,pc2
pushsr msp,it0,cha
pushsr msp,it0,chd
pushsr msp,it0,chc
pushsr msp,it0,alu
pushsr msp,it0,ops
push
msp,tav
;push tav (gr121)
;
set DI in CPS, but timer
mfsr
it0,ops
; interrupts are still on
or
it0,it0,0x2
;this disables interrupts
mtsr
ops,it0
; in signal trampoline code
;
mtsrim chc,0
;the trampoline code is
const
it1,RegSigHand ; described in section 4.4.1
consth it1,RegSigHand ;RegSigHand is a library
load
0,0,it1,it1
; variable
add
it0,it1,4
;IRET to signal
mtsr
pc1,it1
; trampoline code
mtsr
pc0,it0
iret
_signal_associate_end:
All of the code presented is available from AMD in source and linkable library
form. Now to the interrupt() install routine itself, it is listed below and is
surprisingly short. Its operation is simple, it examines the interrupt tag word of the
138
Evaluating and Programming the 29K RISC Family
supplied C handler. Note that it assumes that the interrupt procedure has a one–word
procedure tag preceded by an interrupt tag word –– this is almost always the case. If
no interrupt tag is found then signal handling is selected. This would be the case if the
handler routine had been built with the GNU compiler which does not currently
support interrupt tag words.
Depending on the tag word, Freeze mode or signal processing is selected and the
appropriate code copied into heap memory space. For Freeze mode processing, only
the required number of global registers is saved in the interrupt context cache
(gr64–gr80). Additionally, only the minimum required amount of heap memory is
requested via the HIF–library malloc() service. After copying code into heap
memory, some instruction patching is performed to correctly reference the assigned
C handler. Finally the HIF–library _settrap() service is used to assign a trap handler
address to the requested trap number. Note that when the copying is performed, the
heap memory is only written to and never read. This will prevent the code being
placed into on–chip data cache, as 29K family data caches only allocate cache blocks
on data reads. Avoiding caching of the relevant heap memory ensures that the new
code will be fetched from instruction memory (see sections 5.13.2 and 5.14.4).
int interrupt(trap_number, C_handler)
int trap_number;
void (*C_handler)();
{
int *tag_p=(int*)C_handler – 2;
int ret_sig;
/* return signal value */
int tag_word = *tag_p;
int glob_regs, *trap_handler, i, size;
_LOCK volatile int *code_p, *mem_p;
/* see section 5.14.1 */
if((tag_word & 0xff000000) != 0)
tag_word = –1;
/* no interrupt tag word */
if((tag_word & 0xffff00ff)==0)
{
glob_regs=(tag_word & 0xff00) >> 8;
code_p=&interrupt_cache_code;
8 for sig_sig code support
size=4*((2*glob_regs)+6+8);
mem_p=(int*)malloc(size)
/* get heap memory */
trap_handler=mem_p;
code_p=code_p+(16–glob_regs); /* find start of save */
for(i=1; i <=glob_regs; i++) /* copy save code */
*mem_p++=*code_p++;
/* supply address to CONST instruction *
*mem_p++ =*code_p++ | ( (((int)C_handler&0xff00)<<8)
+ ((int)C_handler&0xff) );
/* supply address to CONSTH inst. */
*mem_p++ =*code_p++ | ( (((int)C_handler&0xff000000) >>8)
+ (((int)C_handler&0xff0000) >>16) );
for(i=1; i <=(4–2); i++)
/* copy the call code */
*mem_p++=*code_p++;
code_p=code_p + (16–glob_regs); /* find start of restore */
for(i=1;i<=(glob_regs+2+8);i++) /* copy restore code */
*mem_p++=*code_p++;
ret_sig=0;
8 required for sig_sig code support
Chapter 2
Applications Programming
139
}
else
{static int sig_number=30;
/* SIGUSR1 in SigEntry */
ret_sig=sig_number;
signal(sig_number,C_handler);
size=4*(signal_associate_end–signal_associate_code);
mem_p=(int*)malloc(size);
/* get heap memory */
trap_handler=mem_p;
code_p=signal_associate_code;
/* supply sig_number to CONST instruction */
*mem_p++ = *code_p++ | ( ((sig_number&0xff00)<<8)
+ (sig_number&0xff) );
for(i=1; i <=(size–1); i++) /* copy rest of code */
*mem_p++ = *code_p++;
sig_number++;
}
_settrap(trap_number,(void(*)())trap_handler); /* HIF service */
return ret_sig;
}
Users of the above code which do not want to make use of the two–level
interrupt processing supported via the sig_sig variable, can remove the extra eight
instructions in the interrupt_cache_code and should also remove the extra code
copying indicated in the listing above. This will slightly improve interrupt
processing times for Freeze mode handlers. Other users who want to further exploit
the two–level approach can assign a single handler for all second level interrupt
processing, this is discussed in section 4.3.12. Interrupts are first dealt with in Freeze
mode by building an interrupt descriptor bead; then a second level Dispatcher routine
is responsible for popping beads off a string and calling the assigned second level
handler. Alternatively, a signal dispatcher technique can be applied; section 2.5.6
describes the method. Signal dispatching can be achieved entirely with support
routines accessible from C level –– this makes signal dispatching particularly
attractive.
If the interrupt() routine is used extensively for multiple signal handlers, it will
be necessary to increase the size of the signal handler array (SigEntry, described in
Appendix B). The array is normally large enough to hold signal numbers 1 through
32). Unless signal allocation is started at a number less than SIGUSR1 (30), there is
normally only sufficient space for two signal handlers.
2.5.5 Overloaded INTR3
The microcontroller members of the 29K family contain several on–chip peripherals. These peripherals can generate interrupts which are all directed to the core
29K processor via interrupt line INTR3. This causes overloading of the INTR3 vector handler. When a microcontroller receives an INTR3 interrupt, it must examine its
Interrupt Control Register (ICR) to determine the source of the interrupt. This re-
140
Evaluating and Programming the 29K RISC Family
quires all interrupts to initially be processed via the INTR3 vector handler. The
INTR3 handler must call the appropriate device service routine. The service routine
first clears the interrupt request by writing a one to the correct bit in the ICR; it can
then reenable interrupts and service the current request. The general format of the
ICR is shown on Figure 2-3.
31
23
15
15
reserved
7
0
9 8
IOPI
res
Processor Specific
res
VDI
vector 220
DMA0I
224
228
237
Figure 2-3. 29K Microcontroller Interrupt Control Register
The overloading of INTR3 adds complexity to the task of building a Freeze
mode interrupt handler for each interrupting device. The problem can be resolved by
allocating a region of the vector table for use by the interrupting devices sharing
INTR3. The code below (intr3) reserves 33 vector table entries starting with vector
220 –– these vectors are not normally used by a 29K based system. When an INTR3
occurs, the code examines the ICR register with a Count Leading Zeros (CLZ)
instruction. This assigns the highest priority to the bit (interrupt) which is most–left
in the ICR register. The value produced by the CLZ instruction is added to the base
value of 220 and the result used to obtain the correct vector entry from the vector
table.
.reg
.reg
.global
_intr3:
const
consth
load
clz
;
const
add
sll
mfsr
add
load
Chapter 2
it0,gr64
it1,gr65
_intr3
it0,0x80000028
it0,0x80000028
0,0,it1,it0
it1,it1
;Interrupt Control register address
it0,220
it1,it0,it1
it1,it1,2
it0,vab
it1,it0,it1
0,0,it1,it1
;base vector number
;add offset to base
;convert to word offset
;get vector table base
;get handler address
; from vector table
Applications Programming
;priority order index
141
jmpi
nop
it1
;jump to interrupt
; handler
The intr3 code completes by jumping to the selected vector handler. Note, the
code makes use of the four interrupt temporary registers (it0–it3, gr64–gr67) normally reserved by an operating system for interrupt handling. Each peripheral device
which can set a bit in the ICR register is assigned a unique vector handler number in
the range 220–252. If no bit is found to be set in the ICR register, vector 252 is selected.
Using the intr3 code, it is possible to use the previously described interrupt()
library routine to deal with interrupts. A call to the the HIF library procedure
_settrap() is required to install the intr3 code for INTR3 handling. After this is done,
the interrupt() routine can be used to assign interrupt handlers for the selected vector
numbers in the 220–252 range, as shown below.
main()
{
. . .
_settrap(19,intr3);
/* INTR3 handler */
interrupt(224,VD_handler); /* VDI */
interrupt(237,DMA_handler); /* DMA0I */
. . .
The intr3 code does not clear the interrupt request in the ICR register; this is left
to the specific interrupt handler. However, this is insufficient for level sensitive I/O
port interrupts. In this case the interrupting condition must first be cleared for the corresponding PIO signal before the ICR bit is cleared. Consequently, the clearing of the
bit in the ICR register is redundant.
AMD evaluation boards are normally supplied with a combined OS–boot operating system and MiniMON29K DebugCore in the ROM memory. When the target
processor is a Microcontroller, the message system used to support OS–boot and Debugcore communication with MonTIP, typically uses an on–chip UART. All on–chip
peripheral generated interrupts are handled via INTR3. MiniMON29K bundle 3.0,
and earlier versions, are built using OS–boot version 2.0. This version of OS–boot
assigned the INTR3 handler for MiniMON29K’s sole use. This makes it very difficult to add additional interrupt handlers for on–chip peripherals. The problem can be
solved by applying the code shown below.
main()
{
void
(*V_minimon)();
. . .
V_minimon=(void(*)())_settrap(19,intr3);/* INTR3 */
_settrap(220+24,V_minimon); /* RXSI interrupt */
_settrap(220+25,V_minimon); /* RXDI interrupt */
142
Evaluating and Programming the 29K RISC Family
_settrap(220+26,V_minimon); /* TXDI interrupt */
. . .
The _settrap() HIF library service is used to install a new INTR3 handler; the
address of the old handler is returned. The MiniMON29K code is used to process
three peripheral interrupts via INTR3. The _settrap() service is used again to separately reinstall the handlers required by MiniMON29K. New interrupt handlers for
additional on–chip peripherals can then be installed with further calls to _settrap() or
interrupt().
2.5.6 A Signal Dispatcher
Release 3.2, or newer, of the High C 29K compiler supports routines of defined
return–type _Interrupt. The use of this non–standard keyword was explained in section 2.5.4. The keyword is used here to support a signal dispatcher. The method relies
on interrupts being processed in two stages. The first stage operates in Freeze mode.
It responds immediately to the interrupting device, captures any critical data and
deactivates the interrupt request. The second stage, if required, takes the form of a
signal handler. The sig_sig variable is used by the Freeze mode handler to request
signal handler execution. A signal handler can not be executed without a freeze mode
handler making the necessary request. This is because interrupts are not disabled in
the signal associate code.
The technique has a number of benefits: It is seldom necessary to disable interrupts for long periods, as asynchronous interrupt events are only initially dealt with in
Freeze mode. This reduces interrupt latency. Signal handlers can be queued for processing when nested interrupts would occur. This eliminates the need to prepare a C
level interrupt processing environment for each interrupt. A C level environment
need only be built for a Signal Dispatcher routine. The Signal Dispatcher is then responsible for calling the appropriate signal handler for all signals generated by interrupts. The Signal Dispatcher is started in response to the first signal occurring. The
dispatcher causes execution of the first signal handler, then determines if other signal
handlers have been requested while the current signal handler was executing. The
dispatcher continues to processes signals until there are none remaining. At this point
the original interrupted state is restored. The original state being the processor state at
the time the first interrupt in the sequence occurred. The first interrupt occurred while
no interrupt or signal handler was being processed; and it caused the Signal Dispatcher to start execution.
Avoiding nested interrupts, other than for Freeze mode handling, is most
beneficial when large numbers of multiply nested interrupts are expected, and the
cost of preparing C level context for interrupt processing is high. For example, using
interrupt context caching, the processor can be prepared for Freeze mode interrupt
Chapter 2
Applications Programming
143
processing in 1–2 micro seconds (at 16Mhz). However, with an Am29205
microcontroller which has a 16–bit off–chip bus and relatively slow DRAM memory,
as much as 40 micro seconds can be required to prepare the processor for a C level
signal handler. In this case it is best to prepare for C level interrupt handling only
once. Nested interrupts are avoided by adding new interrupts to a stack when further
interrupts occur while the Signal Dispatcher is executing.
As explained in section 2.5.4, a signal handler is requested when the sig_sig
variable is set by a Freeze mode handler. Previous example code showed how the signal handler could be started immediately after the Freeze mode handler completes.
The alternative code, shown here, causes the signal to be added to a stack of signals
waiting for processing. Both methods can coexist, setting the sig_sig variable to a
signal number ORed with 0x8000,0000 indicates the signal should be queued (if necessary) rather than processed immediately.
First, examine the two interrupt handlers shown below. The Freeze mode handlers, uart_handler() and timer_handler(), use the _Interrupt keyword. They both
qualify for Freeze mode execution. The UART handler, is similar to the example of
section 2.5.4. However, this time sig_sig is set to the signal number held in uart_sig
and the most significant bit is also set when the end of a string is encountered. This
will request the associated signal handler to be placed in the signal queue.
_Interrupt uart_handler()
{
static int count=0;
/* Freeze mode interrupt handler */
recv[count]=*uart_p;
/* access UART */
if(recv[count]==’\n’)
/* end of string ? */
{
count=0;
sig_sig=0x80000000 | uart_sig;
}
else
count++;
}
The Freeze mode timer handler reloads the on–chip timer control registers for
repeated timer operation. Each timer interrupt causes the tick variable to be increment, and when a tick value of 100 is reached, signal timer_sig is added to the signal
queue. The Freeze mode handler is written in C. However, it needs to access special
register 9 (TMR, the Timer Reload register) which is not normally accessible from C.
The problem is overcome by using the C language extensions _mfsr() and _mtsr().
They enable special register to be read and written.
_Interrupt timer_handler()
{
static int tick=0;
int
tmr;
/* Freeze mode interrupt handler */
tmr=_mfsr(9);
tmr=tmr&(–1–0x02000000)
144
/* read TMR special register */
/* clear IN bit–field */
Evaluating and Programming the 29K RISC Family
_mtsr(9,tmr);
/* write to TMR register */
if(tick++ > 100)
{
tick=0;
sig_sig=0x80000000 | timer_sig;
}
}
The second stage of the UART interrupt handler, the signal handler, is shown
below. Note, the sig_uart() routine calls the printf() library routine. This is not permitted with the High C 29K library routines as the printf() routine is not reentrant.
However, the use of printf() helps illustrate the operating principle. Normally a signal handler must use the _sigret() signal return service, at least with a HIF conforming operating system. However, when a signal handler is called from the dispatcher,
the signal return service should not be used. It is possible to determine if the dispatcher is in use by testing the variable dispatcher_running; it becomes non zero when
the dispatcher is in use. However, testing the dispatcher_running flag may be insufficient in some circumstances. It is possible that the Signal Dispatcher is running and
initiating signal handler execution. At the same time a signal handler may be requested directly by, say, an interrupt. The Dispatcher is running but the directly requested signal handler must use the signal return service.
Signals need not always be queued for processing. If a very high priority (immediate) interrupt occurs and further signal processing is necessary, sig_sig should
be simply set to the signal number. In this case it is important that the signal handler
use the _sigret() service.
_Interrupt sig_uart(sig_number) /* signal handler for UART */
int
sig_number;
{
printf(”in signal handler number=%d\n”, sig_number);
printf(”received string=%s\n”, recv_data);
if(!dispatcher_running)_sigret(); /* no _sigret() service call *
}
The Signal Dispatcher is implemented as a signal handler. The dispatcher removes signals from a stack and calls the appropriate signal handler. When a signal
handler is requested by a Freeze mode handler, and the Signal Dispatcher is not currently executing, the requested signal (sig_sig value) is not immediately started. In its
place the dispatcher signal handler is initiated.
Shown on Figure 2-4 is an example of the Signal Dispatcher in operation. The
first interrupt is from the UART. It is dealt with entirely in Freeze mode; the sig_sig
variable is not set such as to request a second stage signal handler. The UART generates the second interrupt. This time the sig_sig variable is set to request the sig_uart()
signal handler be started by the Signal Dispatcher. While the second stage handler is
running, a timer interrupt occurs. The Freeze mode timer handler requests a second
stage handler be started by the Signal Dispatcher. When the dispatcher completes the
currently executing second stage handler (the UART’s), it initiates the timer’s second
Chapter 2
Applications Programming
145
uart_handler()
sig_sig =0
Main Program
Freeze mode code
Full C–context code
uart_handler()
sig_sig=uart_sig|0x8..
UART
interrupt
“signal”=uart_sig
signal_associate_code
Push uart_sig on stack
“signal”=dispatcher_sig
timer
interrupt
timer_handler()
sig_sig=timer_sig|0x8..
sig_uart()
2nd stage signal
handler
sig_dispatcher()
Signal Dispatcher
Pop uart_sig off stack
call
“signal”=timer_sig
signal_associate_code
Push timer_sig on stack
Signal Dispatcher
Pop timer_sig off stack
sig_timer()
2nd stage signal
handler
call
Signal Dispatcher
call _sigret()
Main Program
signal return service
End
Figure 2-4.
Processing Interrupts with a Signal Dispatcher
stage handler. When there are no remaining second stage handler requests, the dispatcher issues a signal–return service request. The original programs context is then
restored and its execution restarted.
Integer variable dispatcher_sig holds the signal number used by the Signal
Dispatcher. The user must select a signal number. The example code below uses 7
(SIGEMT). The signal() library routine is used to assign procedure sig_dispatcher()
to signal number 7. Before signal and trap handlers can be installed, the procedures
and variables defined in the support libraries must be declared external; as shown
below.
extern void
extern int
extern void
146
signal(int, void (*handler)(int));
interrupt(int, _Interrupt (*C_handler)(int));
sig_dispatcher(int);
Evaluating and Programming the 29K RISC Family
extern int
extern int
int
sig_sig;
dispatcher_sig;
/* dispatcher signal number */
uart_sig, timer_sig;
During program initialization, after main() is called, the handler routines and
other support services must be installed. The code below uses the interrupt() library
routine to install a signal handler (sig_timer() not shown) for timer interrupt support.
The call to interrupt() returns the allocated signal number, and this number is saved
in timer_sig. The timer Freeze mode handler uses the timer_sig value to request the
timer signal handler be executed. The interrupt() service is called a second time to
install the Freeze mode handler, timer_handler(). The second call causes vector
table entry 14 to be reassigned the address of the Freeze mode handler.
The UART handlers are installed using an alternative method. The signal() service rather than the interrupt() service is used to assign the sig_uart() signal handler
to signal number SIGUSR2. This method allows a specific signal number to be selected, rather than using the interrupt() service to allocate the next available signal
number. Most users will prefer the previous method used to automatically select signal numbers.
main()
{
_settrap(218,_disable);
_settrap(217,_enable);
_settrap(216,_timer_init);
dispatcher_sig=7;
/* select signal number for dispatcher */
signal(dispatcher_sig,sig_dispatcher);
timer_sig=interrupt(14,sig_timer); /* install signal handler */
if(interrupt(14,timer_handler)) /* install Freeze handler */
printf(”ERROR: Freeze mode handler not built for trap 14\n”);
if(interrupt(15,uart_handler)
/* install Freeze handler */
printf(”ERROR: Freeze mode handler not built for trap 15\n”);
uart_sig=SIGUSR2;
/* chose a signal number */
signal(uart_sig,sig_uart); /* install signal handler */
timer_init();
. . .
/* initialize the timer */
The sig_dispatcher() requires two helper services, disable() and enable().
They are described in more detail shortly, but are simply used to enable and disable
processor interrupts. The _settrap() service is used above to install trap handlers for
these services. The timer_init() routine is not required by the Signal Dispatcher. It is
included to simply make the example more complete.
The interrupt() routine uses the signal_associate method of assigning a trap
number to a signal handler. The code was described in section 2.5.4, but a few small
Chapter 2
Applications Programming
147
additions are required to support the Signal Dispatcher. The modified code is shown
below. There are two changes: Interrupts are not disabled (requiring that a Freeze
mode handler always be used for interrupt deactivation). A call to queue_sig is made
if the most significant bit of the signal number is set.
.reg
it0,gr64
.reg
it1,gr65
_signal_associate_code:
const
gr64,0
;
const
it1,0
const
it2,_sig_sig
consth it2,_sig_sig
store
0,0,it1,it2
;
const
it1,_queue_sig
consth it1,_queue_sig
jmpti
gr64,it1
nop
push
msp,it0
push
msp,gr1
push
msp,rab
const
it0,512
sub
rab,rfb,it0
;
pushsr msp,it0,pc0
pushsr msp,it0,pc1
pushsr msp,it0,pc2
pushsr msp,it0,cha
pushsr msp,it0,chd
pushsr msp,it0,chc
pushsr msp,it0,alu
pushsr msp,it0,ops
push
msp,tav
;
mtsrim chc,0
const
it1,RegSigHand
consth it1,RegSigHand
load
0,0,it1,it1
add
it0,it1,4
mtsr
pc1,it1
mtsr
pc0,it0
iret
_signal_associate_end:
;signal number in it0
;push signal number on stack
;clear sig_sig variable
; need not do this if signal
; handler is called directly
; from vector table entry
;jump if msb–bit set
;interrupt context stack
;use ’push’ macro’
; see section 3.3.1
;set rab=rfb–WindowSize
;push special registers
;push tav (gr121)
;the trampoline code is
; described in section 4.4.1
;RegSigHand is a library
; variable
;IRET to signal
; trampoline code
The queue_sig routine is shown below. It pushes the signal number on a signal
stack and advances a stack pointer, sig_stack_p. The operation is performed while
still in Freeze mode and is therefor not interruptible. The variable
dispatcher_running is then tested. If it is set to TRUE, an interrupt return (IRET)
instruction is issued. If it is FALSE, the dispatcher_sig number is obtained and the
signal_associate code continues the process of starting a signal handler; but the
148
Evaluating and Programming the 29K RISC Family
signal number now in use will cause the Signal Dispatcher (sig_dispatcher()) to
commence execution.
_queue_sig:
;jump here from signal_associate
and
it0,it0,0xff
;clear msb–bit
;
const
it3,_sig_stack_p
consth it3,_sig_stack_p
load
0,0,it2,it3
;get pointer value
store
0,0,it0,it2
;store signal number on stack
add
it2,it2,4
;advance stack pointer
store
0,0,it2,it3
;
const
it3,_dispatcher_running
consth it3,_dispatcher_running
load
0,0,it2,it3
;test if signal dispatcher
cpeq
it2,it2,0
; already running
jmpt
it2,_start_dispatcher
constn it2,–1
iret
;IRET if running
;
_start_dispatcher:
store
0,0,it2,it3
;set dispatcher_running
const
it3,_dispatcher_sig
consth it3,_dispatcher_sig
const
it1,_signal_associate_code+5*4
consth it1,_signal_associate_code+5*4
jmpi
it1
;start signal handler
load
0,0,it0,it3
;signal=dispatcher_sig
Before the signal_associate code starts the dispatcher signal handler, the
dispatcher_running variable is set to TRUE. Until this variable is cleared, further
signal requests (if the most significant bit of the signal number is set) will be added to
the queue of signals waiting for processing. The process of adding a signal to the
queue is kept simple –– a stack is used. Reducing the amount of code required results
in less interrupt latency as the queue_sig code runs in Freeze mode.
The signal handler which performs the dispatch operation is written in C. The
code is shown below. It requires some simple assembly–level support routines which
are described later. Having the code in C is a convenience as it simplifies the task of
modifying the code. Modification is necessary if a different execution schedule is
required for signals waiting in the signal stack. The variables used in the Signal
Dispatcher routine are described below. Note, that sig_stack_p and
dispatcher_running are defined volatile. This is because they may also be modified
by a Freeze mode interrupt handler. It is important that the C compiler be informed
about this possibility. Otherwise it may perform optimizations which prevent value
changes from being observed, such as holding a copy of sig_dispatcher_p in
register, and repeatedly accessing the register.
Chapter 2
Applications Programming
149
extern void (*_SigEntry[])(int); /* defined in HIF libraries */
int
sig_stack[200];
/* signal stack */
volatile int *sig_stack_p=&sig_stack[0];
volatile int dispatcher_running; /* dispatcher running flag */
int
sig_sig=0;
int
dispatcher_sig;
/* dispatcher signal number */
The example sig_dispatcher() is relatively simple but effective. It first disables
interrupts before removing all current signals from the stack. The signal values are
transferred to an array. Interrupts are then reenabled. Performing this procedure with
interrupts disabled prevents other signals being added to the stack while the transfer
operation is being performed. Signals are transferred to the array in the reverse order
they were placed on the stack. This ensures that signals are ultimately processed in
the order in which they were originally requested.
No attempt is made to apply a priority order to pending signals. The necessary
code can be applied after the signals have been removed from the stack. Performing
priority ordering at C–level rather than in the sig_queue code has the advantage of
reducing interrupt latency. Due to the fast operation of 29K processors the need to
priority order signals is not high, as a signal request is not likely to be kept waiting
very long.
void sig_dispatcher(sig)
/* Signal Dispatcher */
int
sig;
{
int
cps;
int
*sig_p;
/* array of signals */
static int
sig_array[20];
/* needing processing */
cps=disable(0x20002);
/* set DI and TD in CPS */
for(;;)
{
sig_p=&sig_array[0];
/* mark array empty */
while(sig_stack_p!=&sig_stack[0])/* remove signals from
{
––sig_stack_p;
/* stack */
*sig_p++=*(int*)sig_stack_p; /* copy from
}
/* stack to array */
enable(cps);
/* enable interrupts */
while(sig_p!=&sig_array[0]) /* process signals removed */
{
––sig_p;
/* from stack */
(*_SigEntry[(*sig_p)–1])(*sig_p);
}
cps=disable(0x20002);
/* disable interrupts */
if(sig_stack_p==&sig_stack[0])
/* stack empty ? */
break;
}
dispatcher_running=0;
enable(cps);
/* enable interrupts */
_sigret();
/* _sigret() HIF service */
}
/* would restore interrupted cps */
When there are no remaining signals to process, the dispatcher requests the
_sigret() signal–return service. The dispatcher_running flag is also cleared. It is
150
Evaluating and Programming the 29K RISC Family
possible that a new signal arrives just after the flag is cleared but before the
signal–return service is complete, this can not be avoided. It does not create a
problem (other than a loss of performance) as a new dispatcher signal handler is
simply started.
The disable() and enable() support routines are used by the Signal Dispatcher to
enable and disable interrupts around critical code. Interrupts are disabled by setting
the DI bit in the Current Processor Status (CPS) register. Freeze mode handler routines can use the _mtsr() C language extensions to modify special registers. However,
they can not be used by the dispatcher routine as it may be operating in User mode.
Accessing special register space from User mode would create a protection violation.
The problem is overcome by installing assembly level trap handlers which perform
the necessary special register access. The _settrap() HIF service is used to install the
trap handlers. Further assembly routines are required to assert the selected trap number. The code for disable() is shown below.
.global _disable
_disable:
asneq
218,gr96,gr96
jmpi
lr0
nop
.global
__disable:
mfsr
or
mtsr
iret
__disable
gr96,ops
gr97,gr96,lr2
ops,gr97
;read OPS
;OR with passed value
;copy OPS to CPS
A single parameter is passed to disable(). The parameter is ORed with the CPS
value and the CPS register updated. Since this task is performed by a trap handler, the
OPS register is actually modified; and OPS is copied to CPS when an IRET is issued.
There is a further advantage of using a trap handler to perform the task; the operation
can not be interrupted –– the read/modify/write of the the CPS is atomic.
The code for enable() is similar to disable(). In this case the passed parameter is
simply copied to the CPS. The disable() routine returns the CPS value before modifying it. The value is normally stored and later passed to enable(). In this way only the
DI and TD (timer disable) bits in the CPS are temporarily modified. Note, older
members of the 29K family do not support the TD bit. In such case, the interrupt disable code used by the example sig_dispatcher() routine does not prevent interrupts
being generated by the on–chip timer. The the problem can be resolved by modifying
the __enable and __disable assembly routines to clear and set the interrupt enable
(IE) bit in the Timer Reload register.
.global _enable
_enable:
asneq
217,gr96,gr96
Chapter 2
Applications Programming
151
jmpi
nop
lr0
.global __enable
__enable:
mtsr
ops,lr2
iret
2.5.7
Minimizing Interrupt Latency
Interrupt latency is minimized if interrupts are never disabled. In practice this
can be difficult to achieve. There are often critical code sections which must run to
completion without interruption. Traditionally, interrupts are disabled before entering such code sections and reenabled upon critical section completion. However, if
interrupts are processed using the two–stage method described in section 2.5.6 (A
Signal Dispatcher), interrupt disabling can be eliminated.
In place of disabling interrupts around a critical code section, the Signal Dispatcher is effectively disabled. This allows a first stage interrupt handler to interrupt a
critical code section. Second stage interrupt handlers (signal handlers) are not initiated during the critical code section, as the Dispatcher is disabled. It is easy to disable
the Dispatched by simply indicating that it is already active; this will prevent its activation which can occur when the first stage handler is completed (if the sig_sig variable is set). First stage handlers execute in Freeze mode and can be configured to
avoid access to the shared resource being accessed by critical code sections. The example below shows how the Signal Dispatcher can be deactivated around a critical
code section.
#define
#define
TRUE –1
FALSE 0
. . . interruptible code
dispatcher_running=TRUE;
/* disable Dispatcher */
. . . start of critical code section
/* code only interruptible by
Freeze mode handler */
. . . end of critical code section
dispatcher_running=FALSE;
/* enable Dispatcher */
if(sig_stack_p!=&sig_stack[0])
_sendsig(dispatcher_sig);
. . .
When the critical task has been accomplished, the Dispatcher is reenabled by
clearing the dispatcher_running variable. It is possible that one or more signal numbers were pushed on the signal stack during the critical stage. Hence, when the Dispatcher is reenabled, the signal stack must be tested to determine if there are any
pending signals. If there are, then the Signal Dispatcher must be started using the
_sendsig() HIF service.
152
Evaluating and Programming the 29K RISC Family
The method minimizes the latency in starting a Freeze mode interrupt handler
since their commencement is never disabled –– unless by another Freeze mode handler. The latency in starting a second stage handler is not reduced. Further restrictions
may have to be applied to first stage handlers to disallow access to resources which
must be atomically manipulated within critical code sections –– such as linked–list
data structures.
2.5.8
Signal Processing Without a HIF Operating System
A signal processing technique is recommended for dealing with complex C level interrupt handlers. The previous sections have described in detail how signal processing can be performed. AMD and other tool providers supply the necessary support code which has been well tested and is known to be reliable. However, some developers may select an operating system which does not support the HIF services required by the previous example code. Additionally, many embedded system are dependant on simple home–made boot–up code, which provides few support services.
A commercial operating system will implement its own interrupt processing
services. It is likely these services will be somewhat based on the signal processing
code described in this book. However, the provided services should be used in preference to the HIF services. In fact, the chosen operating system may not provide any
support for HIF services.
When building simple boot–up and run–time support code for a small
embedded system, it is best to provide the necessary HIF services required for signal
processing. If the boot–up code is based on AMD’s OS–boot product, then all HIF
services will be provided. If OS–boot is not used, it is important that limited HIF
support be included in the developed code. Only the signal, settrap, sysalloc and
sigret–type subset of HIF services are required. A trap handler for HIF trap number
69 should be installed, and the code required to process the HIF service request
installed. Very little code is required and example code can be taken from OS–boot.
2.5.9
An Example Am29200 Interrupt Handler
The following example makes use of the code presented in the previous sections
of this chapter. The Programmable I/O (PIO) port of an Am29200 microcontroller is
configured such that PIO signal–pin PIO0 is an output, and PIO signal–pin PIO15 an
input. The system hardware ensures that the two pins are wired together. A two stage
interrupt handler is assigned to processing interrupts generated by a rising edge on
pin PIO15. By first clearing pin PIO0 and then setting it to one, an interrupt will be
generated.
First, a number of include files must be accessed to declare external data and
procedure type information. Newer versions of file signal.h contain the extern
declarations listed below. Hence, only when using an older signal.h file need the
extern statement be explicitly included.
Chapter 2
Applications Programming
153
#include <hif.h>
#include <signal.h>
extern
extern
extern
extern
extern
extern
extern
extern
extern
int interrupt(int, _Interrupt (*C_handler)(int));
int sig_sig;
int dispatcher_sig;
void intr3(void);
void _enable(void);
void _disable(void);
void enable(int);
int disable(int);
void sig_dispatcher(int);
It is best to access the Programmable I/O port via support macros or procedures.
Macros have a speed advantage (unless in–line procedures are used), and below are a
number of macros and support data structures which simplify control of the PIO port
typedef volatile struct PIO_str
{
unsigned int poct;
unsigned int pin;
unsigned int pout;
unsigned int poen;
} PIO_t;
PIO_t *PIO_p=(PIO_t*)0x800000d0;
volatile unsigned
int*
ICR_p=(unsigned
/* PIO class */
/* PIO object */
/* ICR pntr. */
int*)0x80000028;
#define PIO_enable_m(port) PIO_p–>poen |= (1 << (port))
#define PIO_disable_m(port) PIO_p–>poen &= ~(1 << (port))
#define PIO_rising_m(port) \
PIO_p–>poct |= (0x2 << (2* (port))); \
PIO_p–>poct &= ~(1 << (port));
#define PIO_falling_m(port) \
PIO_p–>poct |= (0x2 << (2* (port))); \
PIO_p–>poct |= (1 << (port));
#define PIO_high_m(port) \
PIO_p–>poct |= (0x1 << (2* (port))); \
PIO_p–>poct |= (1 << (port));
#define PIO_out_m(port, val) \
{ unsigned int tmp = PIO_p–>pout; \
tmp &= ~(1 << (port)); \
tmp |= (((val) & 1) << (port)); \
PIO_p–>pout = tmp; \
}
#define ICR_clear_m(vec) *ICR_p |= (1<<(251–(vec)))
154
Evaluating and Programming the 29K RISC Family
Using the _Interrupt keyword, first and second stage interrupt handlers are
defined below for the PIO15 interrupt. No real work is performed by the example
second stage handler, but it does demonstrate how a full–C–context handler can be
reached. The second stage handler does not qualify as a Freeze mode interrupt
handler because it is not a leaf routine.
int
PIO15_sig;
/* signal number allocated to second stage */
_Interrupt PIO15_handler()
/* first stage interrupt handlers */
{
ICR_clear_m(228);
/* clear interrupt request */
PIO_out_m(0,0);
/* clear PIO0 port bit */
sig_sig=0x80000000|PIO15_sig;
/* request secnd stage */
}
_Interrupt sig_PIO15(sig_number)
/* second stage handlers */
int
sig_number;
{
printf(”Running PIO15 signal handler\n”);
}
Before the interrupt mechanism can be put to work, the various support handlers
must be installed as shown below. The program is being developed with the
MiniMON29K DebugCore and this requires that the OS–boot support interrupt
handlers be preserved before the new interrupt handlers are added. The PIO support
macros are then used to establish the correct PIO port operation before the an
interrupt is generated by forcing a 0–1 level transition on PI0 pin PIO0.
int main()
{
void (*V_minimon)();
V_minimon=(void(*)())_settrap(19,intr3);
/* INTR3 */
_settrap(220+24,V_minimon); /* MiniMON support interrupts */
_settrap(220+25,V_minimon); /* see section 2.5.5 */
_settrap(220+26,V_minimon);
_settrap(218,_disable);
/* signal dispatcher support */
_settrap(217,_enable);
/* see section 2.5.6 */
dispatcher_sig=7;
/* signal number for dispatcher */
signal(dispatcher_sig,sig_dispatcher);
/* application interrupt handlers for I/O port PIO15 */
PIO15_sig = interrupt(228,sig_PIO15); /* second stage */
if(interrupt(228,PIO15_handler))
/* first stage */
printf(”ERROR installing Freeze mode handler\n”);
Chapter 2
PIO_p–>poct=0;
PIO_enable_m(0);
PIO_rising_m(15);
/*
/*
/*
/*
PIO_out_m(0,0);
/* PIO0 = 0 */
Applications Programming
configure PIO port operation */
clear control register */
enable PIO0 output */
PIO15 edge sensitive */
155
PIO_out_m(0,1);
/* generate an interrupt */
}
Users of the High C 29K tool chain could test the interrupt handling mechanism
without first building the necessary hardware by asserting the assigned trap number
as shown below.
_ASM(” asneq 228,gr1,gr1”);
2.6
/* test interrupt mechanism */
SUPPORT UTILITY PROGRAMS
There are a number of important utility programs available to the software developer. These tools are generally available on all development platforms and are
shared by different tool vendors. Most of the programs operate on object files produced by the assembler or linker. All linkable object files and executable files are
maintained in AMD Common Object File Format (COFF). This standard is very
closely based on the AT&T standard used with UNIX System V. Readers wishing to
know more about the details of the format may consult the High C 29K documentation or the AT&T Programmer’s Guide for UNIX System V. The coff.h include file
found on most tool distributions, describes the C language data structures used by the
COFF standard –– often described as the COFF wrappers.
2.6.1 Examining Object Files (Type .o And a.Out)
nm29
The nm29 utility program is used to examine the symbol table contained in a
binary COFF file produced by the compiler, assembler or linker. The format is
very much like the UNIX nm utility. Originally nm29 was written to supply
symbol table information to the munch29 utility in support of the AT&T C++
cfront program. A number of command line options have been added to enable
additional information to be printed, such as symbol type and section type.
One useful way to use nm29 is to pipe the output to the sort utility, for example:
“nm29 a.out | sort | more”; each symbol is printed preceded by its value. The sort
utility arranges for symbol table entries to be presented in ascending value.
Since most symbols are associated with address labels, this is a useful way to
locate an address relative to its nearest address labels.
munch29
This utility is used with the AT&T C++ preprocessor. This program is known as
cfront and converts C++ programs into C. After the C++ program has been
converted and linked with other modules and libraries, it is examined with
156
Evaluating and Programming the 29K RISC Family
nm29 to determine the names of any static constructor and destructor functions.
The C++ translator builds these functions as necessary and tags their names with
predefined character sequences. The output from nm29 is passed to munch29
which looks for constructor and destructor names. If found, munch29 builds C
procedures which call all the identified object constructors and destructors.
Because the constructor functions must execute before the application main()
program, the original program is relinked with the constructor procedures being
called before main(). The main() entry is replaced with _main(). This also
enables the call to destructor procedures to be made in _main() when main()
returns.
Because G++ is now available for C++ code development (note, G++ is
incorporated into the GCC compiler), there is little use being made of the AT&T
cfront preprocessor. Additionally, MRI and Metaware are expected to shortly
have commercial C++ compilers available.
rdcoff
The rdcoff utility is only available to purchasers of the High C 29K product.
This utility prints the contents of a COFF conforming object file. Each COFF
file section is presented in an appropriate format. For example, text sections are
disassembled. If the symbol table has not been striped from the COFF file, then
symbol values are shown. The utility is useful for examining COFF header
information, such as the text and data region start addresses. Those using GNU
tools can use the coff and objdump utilities to obtain this information.
coff This utility is a shorthand way of examining COFF files. It reports a summary of
COFF header information, followed by similar reports for each of the sections
found in the object file. The utility is useful for quickly checking the link
mapping of a.out type files; especially when a project is using a number of
different 29K target systems which have different memory system layouts,
requiring different program linkage.
objdump
This utility is supplied with the GNU tool chain. It can be used to examine
selected parts of object files. It has an array of command line options which are
compatible with the UNIX System V utility of the same name. In a similar way
to the rdcoff utility it attempts to format selected information in a meaningful
way.
swaf
This utility is used to produce a General–Purpose ASCII (PGA) symbols file for
use with Hewlett–Packard’s B3740A Software Analyzer tool. This tool enables
a 16500B card cage along with a selection of logic analyzer cards to support
high level software debugging. The swaf utility builds a GPA symbols file from
Chapter 2
Applications Programming
157
information extracted from a linked COFF file. When the GPA file is loaded into
the analyzer it is possible to display address values in symbol format rather than,
say, hex based integers. Via a remote computer, the HP16500B can be used to
support execution trace at source level
mksym
This utility is required to build symbol table information for the UDB debugger.
The UDB debugger does not directly operate with COFF symbol information. A
mksym command is typically placed in a makefile; after the 29K program has
been linked a new symbol table file should be built.
2.6.2 Modifying Object Files
cvcoff
The COFF specification states that object file information is maintained in the
endian of the host processor. This need not be the endian of the target 29K
processor. As described in Chapter 1, 29K processors can run in big– or
little–endian but are almost exclusively used in big–endian format. Endian
refers to which byte position in a word is considered the byte of lowest address.
With big–endian, bytes further left have lower addresses. Machines such as
VAXs and IBM–PCs operate with little–endian; and machines from SUN and
HP tend to operate with big–endian.
What this means to the 29K software developer is that COFF files on, say, a PC
will have little–endian COFF wrappers. And COFF files on, say, a SUN
machine will have big–endian wrapers, regardless of the endianness of the 29K
target code. When object files or libraries containing object files are moved
between host machines of different endianness, the cvcoff utility must be used
to convert the endianness of the COFF wraper information. The cvcoff utility
can also be used to check the endianess of an object file. Most utility programs
and software development tools expect to operate on object files which are in
host endian; however, there are a few tools which can operate on COFF files of
either host endianness. In practice this reduces the need to use the cvcoff utility.
strpcoff
This utility can be used to remove unnecessary information from a COFF file.
When programs are compiled with the “–g” option, additional symbol
information is added to the COFF file. The strpcoff utility can be used to
remove this information and any other details such as relocation data and
line–number pointers. Typically linkers have an option to automatically strip
this information after linking. (ld29 has the “–s” option.) The COFF file header
information needed for loading a program is not stripped.
158
Evaluating and Programming the 29K RISC Family
2.6.3 Getting a Program into ROM
After a program has been finally linked, and possibly adjusted to deal with any
data initialization problems (see section 2.3.6), it must be transferred into ROM devices. This is part of the typical software development cycle for embedded processor
products. A number of manufacturers make equipment for programming PROM devices. They normally operate with data files which must be appropriately formatted.
Tektronix Hex format and Motorola S3 Records are two of the commonly used file
formats. The coff2hex utility can be used to convert the COFF formatted executable
file produced by the linker into a new file which is correctly formatted for the selected
PROM programmer. If more than one PROM is to required to store the program,
coff2hex can be instructed to divide the COFF data into a set of appropriate files. Alternatively, this task can be left to more sophisticated programming equipment. The
utility has a number of command line options; the width and size of PROM devices
can be chosen, alternatively specific products can be selected by manufacture part
number.
Chapter 2
Applications Programming
159
160
Evaluating and Programming the 29K RISC Family
Chapter 3
Assembly Language Programming
Most developers of software for the 29K family will use a high level language,
such as C, for the majority of code development. This makes sense for a number of
reasons: Using a high level language enables a different processor to be selected at
some future date. The code, if written in a portable way, need only be recompiled for
the new target processor. The ever increasing size of embedded software projects
makes the higher productivity achievable with a high level language attractive. And
additionally, the 29K family has a RISC instruction set which can be efficiently used
by a high level language compiler [Mann et al 1991b].
However, the software developer must resort to the use of assembly code programming in a number of special cases. Because of the relentless efficiency of the
current C language compilers for the 29K, it is difficult for a programmer to out–perform the code generating abilities of a compiler for any reasonably sized program.
For this reason it is best to limit the use of assembly code as much as possible. Some
of the support tasks which do require assembly coding are:
Low–level support routines for interrupts and traps (see Chapter 4).
Operating system support services such as system calls and application–task
context switching (see Chapter 5). Also, taking control of the processor during
the power–up and initialization sequence.
Memory Management Unit trapware (see Chapter 6).
Floating–point and complex integer operation trapware, where the 29K family
member does not support the operation directly in hardware.
High performance versions of critical routines. In some cases it may be possible
to enhance a routines performance by implementing assembly code short–cuts
not identified by a compiler.
161
This chapter deals with aspects of assembly level programming. There are some
differences between 29K family members, particularly in the area of on–chip peripherals for microcontrollers. The chapter does not go into details peculiar to individual
family members; for that it is best to study the processor User’s Manual.
The material covered is relevant to all 29K family members.
3.1
INSTRUCTION SET
The Am29000 microprocessor implements 112 instructions. All hardware implemented instructions execute in a single–cycle, except for IRET, IRETINV,
LOADM and STOREM. Instruction format was discussed in section 1.11. All
instructions have a fixed 32–bit format, with an 8–bit opcode field and 3, 8–bit, operand fields. Field–C specifies the result operand register (DEST), field–A and field–B
supply the source operands (SRCA and SRCB). Most instructions operate on data
held in global or local registers, and there are no complex addressing modes supported. Field–B, or field–B and field–A combined, can be used to provide 8–bit or
16–bit immediate data for instructions. Access to external memory can only be performed with the LOAD[M] and STORE[M] instructions. There are a number of
instructions, mostly used by operating system code, for accessing the processor special registers.
The following sections deal with the different instruction classes. Some of the
instructions described are not directly supported by all members of the 29K family.
In particular, many of the floating–point instructions are only directly executed by
the Am29050 processor. If an instruction is not directly supported by the processor
hardware, then a trap is generated during instruction execution. In this case, the operating system uses trapware to implement the instruction’s operation in software.
Emulating nonimplemented instructions in software means some instruction execution speeds are reduced, but the instruction set is compatible across all family members.
3.1.1 Integer Arithmetic
The Integer Arithmetic instructions perform add, subtract, multiply, and divide
operations on word–length (32–bit) integers. All instructions in this class set the
ALU Status Register. The integer arithmetic instructions are shown Tables 3–1 and
3–2.
The MULTIPLU, MULTIPLY, DIVIDE, and DIVIDU instructions are not implemented directly on most 29K family members, but are supported by traps. To determine if your processor directly supports these instructions, check with the processor User’s Manual or the tables in Chapter 1. The Am29050 microprocessor supports
the multiply instructions directly but not the divide instructions.
162
Evaluating and Programming the 29K RISC Family
Table 3-1. Integer Arithmetic Instructions
Mnemonic
Operation Description
ADD
DEST <– SRCA + SRCB
ADDS
DEST <– SRCA + SRCB
IF signed overflow THEN Trap (Out Of Range)
ADDU
DEST <– SRCA + SRCB
IF unsigned overflow THEN Trap (Out Of Range)
ADDC
DEST <– SRCA + SRCB + C (from ALU)
ADDCS
DEST <– SRCA + SRCB + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
ADDCU
DEST <– SRCA + SRCB + C (from ALU)
IF unsigned overflow THEN Trap (Out Of Range)
SUB
DEST <– SRCA – SRCB
SUBS
DEST <– SRCA – SRCB
IF signed overflow THEN Trap (Out Of Range)
SUBU
DEST <– SRCA – SRCB
IF unsigned underflow THEN Trap (Out Of Range)
SUBC
DEST <– SRCA – SRCB – 1 + C (from ALU)
SUBCS
DEST <– SRCA – SRCB – 1 + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
SUBCU
DEST <– SRCA – SRCB – 1 + C (from ALU)
IF unsigned underflow THEN Trap (Out Of Range)
SUBR
DEST <– SRCB – SRCA
SUBRS
DEST <– SRCB – SRCA
IF signed overflow THEN Trap (Out Of Range)
SUBRU
DEST <– SRCB – SRCA
IF unsigned underflow THEN Trap (Out Of Range)
SUBRC
DEST <– SRCB – SRCA – 1 + C (from ALU)
(continued)
Chapter 3
Assembly Language Programming
163
Table 3-2. Integer Arithmetic Instructions (Concluded)
(continued)
Mnemonic
Operation Description
SUBRCS
DEST <– SRCB – SRCA – 1 + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
SUBRCU
DEST <– SRCB – SRCA – 1 + C (from ALU)
IF unsigned underflow THEN Trap (Out Of Range)
MULTIPLU
Q//DEST <– SRCA * SRCB (unsigned)
MULTIPLY
Q//DEST <– SRCA * SRCB (signed)
MUL
Perform one–bit step of a multiply operation (signed)
MULL
Complete a sequence of multiply steps
MULU
Perform one–bit step of a multiply operation (unsigned)
DIVIDE
DEST <– (Q//SRCA)/SRCB (signed)
Q <– Remainder
DEST <– (Q//SRCA)/SRCB (unsigned)
Q <– Remainder
DIVIDU
DIV0
Intitialize for a sequence of divide steps (unsigned)
DIV
Perform one–bit step of a divide operation (unsigned)
DIVL
Complete a sequence of divide steps (unsigned)
DIVREM
Generate remainder for divide operation (unsigned)
3.1.2 Compare
The Compare instructions test for various relationships between two values.
For all Compare instructions except the CPBYTE instruction, the comparisons are
performed on word–length signed or unsigned integers. There are two types of
compare instruction. The first writes a Boolean value into the result register (selected
by the instruction DEST operand) depending on the result of the comparison. A
Boolean TRUE value is represented by a 1 in the most significant bit position. A
Boolean FALSE is defined as a 0 in the most significant bit. The 29K uses a global or
local register to contain the comparison result rather than the ALU status register.
This offers a performance advantage as there is less conflict over access to a single
164
Evaluating and Programming the 29K RISC Family
shared resource. Compare instructions are frequently followed by conditional Jump
or Call instructions which depend on the contents of the compare result register.
The second type of compare instruction incorporates a conditional test in the
same instruction cycle accomplishing the comparison. These type of instructions,
known as Assert instructions, allow instruction execution to continue only if the result of the comparison is TRUE. Otherwise a trap to operating system code is taken.
The trap number is supplied in the field–C (DEST) operand position of the instruction. Trap numbers 0 to 63 are reserved for Supervisor mode program use. If an Assert instruction, with trap number less than 64 is attempted while the processor is operating in User mode, a protection violation trap will be taken. Note, this is will occur
even if the assertion would have been TRUE. Assert instructions are used in procedure prologue and epilogue routines to perform register stack bounds checking (see
Chapter 2). Their fast operation makes them ideal for reducing the overhead of register stack support. They are also used as a means of requesting an operating system
support service (system call). In this case a condition known to be FALSE is asserted,
and the trap number for the system call is supplied in instruction field–C. The
Compare instructions are shown in Tables 3–3 and 3–4.
The CPBYTE performs four comparisons simultaneously. The four bytes in the
SRCA operand are compared with the SRCB operand and if any of them match then
Boolean TRUE is placed in the DEST register. The instruction can be very efficiently
used when scanning character strings. In particular, the C programming language
marks the end of character strings with a 0 value. Using the CPBYTE instruction with
SRCB supplying an immediate value 0, the string length can be quickly determined.
3.1.3 Logical
The Logical instructions perform a set of bit–by–bit Boolean functions on
word–length bit strings. All instructions in this class set the ALU Status Register.
These instructions are shown in Table 3-5.
3.1.4 Shift
The Shift instructions (Table 3-6) perform arithmetic and logical shifts on global and local register data. The one exception is the EXTRACT instruction which operates on double–word data. When EXTRACT is used, SRCA and SRCB operand
registers are concatenated to form a 64–bit data value. This value is then shifted by
the funnel shifter by the amount specified by the Funnel Shift Count register (FC).
The high order 32–bits of the shifted result are placed in the DEST register. The funnel shifter can be used to perform barrel shift and rotate operations in a single cycle.
Note, when the SRCA and SRCB operands are the same register, the 32–bit operand
is effectively rotated. The result may be written back to the same register or placed in
a different global or local register (see Figure 3-1). The funnel shifter is useful for
fixing–up unaligned memory accesses. The two memory words holding the unChapter 3
Assembly Language Programming
165
Table 3-3. Compare Instructions
Mnemonic
Operation Description
CPEQ
IF SRCA = SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPNEQ
IF SRCA <> SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPLT
IF SRCA < SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPLTU
IF SRCA < SRCB (unsigned) THEN DEST <– TRUE
ELSE DEST <– FALSE
CPLE
IF SRCA <= SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPLEU
IF SRCA <= SRCB (unsigned) THEN DEST <– TRUE
ELSE DEST <– FALSE
CPGT
IF SRCA > SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPGTU
IF SRCA > SRCB (unsigned) THEN DEST <– TRUE
ELSE DEST <– FALSE
(continued)
aligned data can be loaded into global registers, and then aligned by the EXTRACT
instruction into the destination register. A code example showing the rotate operation
of the funnel shifter is given below:
mtsrim fc,8
extract gr96,gr97,gr97
SRCA operand
Funnel shift
count, FC
;rotate 8–bits left
;source in gr97
SRCB operand
DEST register
Figure 3-1. The EXTRACT Instruction uses the Funnel Shifter
166
Evaluating and Programming the 29K RISC Family
Table 3-4. Compare Instructions (Concluded)
(continued)
Mnemonic
Operation Description
CPGE
IF SRCA >= SRCB THEN DEST <– TRUE
ELSE DEST <– FALSE
CPGEU
IF SRCA >= SRCB (unsigned) THEN DEST <– TRUE
ELSE DEST <– FALSE
CPBYTE
IF (SRCA.BYTE0 = SRCB.BYTE0) OR
(SRCA.BYTE1 = SRCB.BYTE1) OR
(SRCA.BYTE2 = SRCB.BYTE2) OR
(SRCA.BYTE3 = SRCB.BYTE3)THEN DEST <– TRUE
ELSE DEST <– FALSE
ASEQ
IF SRCA = SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASNEQ
IF SRCA <> SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASLT
IF SRCA < SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASLTU
IF SRCA < SRCB (unsigned) THEN Continue
ELSE Trap (Vector Number – in field–C)
ASLE
IF SRCA <= SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASLEU
IF SRCA <= SRCB (unsigned) THEN Continue
ELSE Trap (Vector Number – in field–C)
ASGT
IF SRCA > SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASGTU
IF SRCA > SRCB (unsigned) THEN Continue
ELSE Trap (Vector Number – in field–C)
ASGE
IF SRCA >= SRCB THEN Continue
ELSE Trap (Vector Number – in field–C)
ASGEU
IF SRCA >= SRCB (unsigned) THEN Continue
ELSE Trap (Vector Number – in field–C)
Chapter 3
Assembly Language Programming
167
Table 3-5. Logical Instructions
Mnemonic
Operation Description
AND
DEST <– SRCA & SRCB
ANDN
DEST <– SRCA & ~ SRCB
NAND
DEST <– ~ (SRCA & SRCB)
OR
DEST <– SRCA | SRCB
NOR
DEST <– ~ (SRCA | SRCB)
XOR
DEST <– SRCA ^ SRCB
XNOR
DEST <– ~ (SRCA ^ SRCB)
Table 3-6. Shift Instructions
Mnemonic
Operation Description
SLL
DEST <– SRCA << SRCB (zero fill)
SRL
DEST <– SRCA >> SRCB (zero fill)
SRA
DEST <– SRCA >> SRCB (sign fill)
EXTRACT
DEST <– high–order word of (SRCA//SRCB << FC)
3.1.5 Data Movement
The Data Movement instructions (Tables 3–7 and 3–8) move bytes, half–words,
and words between processor registers. In addition, the LOAD[M] and STORE[M]
instructions move data between general–purpose registers and external devices, memories or coprocessor. The Am29050 processor has two additional instructions not
shown in Table 3-7. They are MFACC and MTACC; and are used to access the four
double–word floating point accumulators (see section 3.3.5).
168
Evaluating and Programming the 29K RISC Family
Table 3-7. Data Move Instructions
Mnemonic
Operation Description
LOAD
DEST <– EXTERNAL WORD [SRCB]
LOADL
DEST <– EXTERNAL WORD [SRCB]
assert *LOCK output during access
LOADSET
DEST <– EXTERNAL WORD [SRCB]
EXTERNAL WORD [SRCB] <– h’FFFFFFFF’,
assert *LOCK output during access
LOADM
DEST.. DEST + COUNT <–
EXTERNAL WORD [SRCB] ..
EXTERNAL WORD [SRCB + COUNT * 4]
STORE
EXTERNAL WORD [SRCB] <– SRCA
STOREL
EXTERNAL WORD [SRCB] <– SRCA
assert *LOCK output during access
STOREM
EXTERNAL WORD [SRCB] ..
EXTERNAL WORD [SRCB + COUNT * 4] <–
SRCA .. SRCA + COUNT
EXBYTE
DEST <– SRCB, with low–order byte replaced
by byte in SRCA selected by BP
EXHW
DEST <– SRCB, with low–order half–word replaced
by half–word in SRCA selected by BP
EXHWS
DEST <– half–word in SRCA selected by BP,
sign–extended to 32 bits
INBYTE
DEST <– SRCA, with byte selected by BP replaced
by low–order byte of SRCB
INHW
DEST <– SRCA, with half–word selected by BP replaced
by low–order half–word of SRCB
MFSR
DEST <– SPECIAL
MFTLB
DEST <– TLB [SRCA]
MTSR
SPDEST <– SRCB
(continued)
Chapter 3
Assembly Language Programming
169
Table 3-8. Data Move Instructions (Concluded)
(continued)
Mnemonic
Operation Description
MTSRIM
SPDEST <– 0I16 (16–bit date formed with SRCA and SCRB fields
MTTLB
TLB [SRCA] <– SRCB
The LOAD and STORE instructions are most interesting (see Figure 3-2 for the
instruction format). Instruction field–C is assigned a number of bit–field tasks which
control the external access operation. Bit CE, when set, indicates that the data transfer is to coprocessor space. AMD makes a floating–point coprocessor, Am29027,
which was frequently used with the Am29000 processor before the Am29050 processor became available. Because the Am29050 directly supports floating–point
instructions there are no new designs making use of the Am29027 coprocessor.
31
23
15
XX X X X X X M 0
CE
OPT
PA
7
RA
0
RB or I
UA
AS SB
Figure 3-2. LOAD and STORE Instruction Format
Bit field AS when set is used to indicate that the access is to Input/Output (I/O)
space. I/O space is little used as there is no convenient means of accessing it from a
high level language such as C. For this reason peripheral devices are typically
mapped into external data memory space rather than I/O space.
The PA and UA bits are used by Supervisor mode code; PA is used by operating
systems which run with address translation turned on, but need to to access an external memory physical address. When bit PA is set, address translation is turned off for
the LOAD or STORE data access. This is useful when accessing peripheral devices.
When operating system code wishes to access a User’s code space, it sets the UA bit.
This causes the data transfer operation to execute with User rather than Supervisor
permission. If the User mode program was running with address translation on then,
the PID field of the MMU register is used when checking TLB access permissions.
Normally Supervisor mode code operates with a fixed PID value of zero.
170
Evaluating and Programming the 29K RISC Family
The original versions of the Am29000 processor (rev–A to rev–B) did not support byte sized access to external memory. For this reason bytes and half–words had
to be extracted from words after they had been read from memory; the Extract Byte
(EXBYTE) and Extract half–word (EXHW) instructions are supported by the processor for just this purpose. Additionally, when data objects smaller than a word were
written to external memory, a read–modify–write process had to be used. The Insert
Byte (INBYTE) and Insert half–word (INHW) instructions supported the process.
Rev–C and later versions of the Am29000 processor and all other 29K family
members directly support byte and half–word accesses to memory. The instructions
described above need no longer be used. To enable current versions of the Am29000
processor to be compatible with the original processor, the DW bit was added to the
processor configuration register (CFG). When the DW bit is clear the processor performs rev–A type memory accesses. All new designs operate with the DW bit set; and
other 29K family members operate with an implied DW bit set.
The OPT field bits specify the size of the data object being moved. They are also
used to indicate a word sized access to Instruction ROM space is requested. External
logic must be incorporated in a memory system design if this option is to be supported. The OPT field appears on the OPT(2:0) output pins during the memory access. It is important that the object size is consistent with the address boundaries defined by the lower bits of the memory address. For example, if a word sized access
(OPT filed value is 0) is attempted with lower address bits aligned to a byte boundary
(A[1:0] not equal 0) then an unaligned access trap may occur. The Unaligned Access
(UA) bit of the Current Processor Status (CPS) register must be set for the trap to be
taken. Additionally, alignment checking is only performed for instruction and data
memory, not for I/O or coprocessor space accesses.
The SB bit is used when reading bytes or half–words from external memory.
Sub–word sized accesses are determined by the OPT field; the processor right–justifies the accessed data within the destination register. The SB bit when set causes the
remainder of the destination to be sign extended with the sign of the loaded data object. When SB is clear, the destination register value is zero–extended. The SB bit has
no effect during external memory writes. During write operations, the data object is
replicated in all positions of the data bus. For example, a byte write would result in the
stored byte appearing in all four positions of the stored word. It is the responsibility of
external memory to decode the OPT field and lower address bits when determining
which byte position should be written. Note, the micorcontroller members of the 29K
family implement the memory glue logic on–chip.
Instruction field–B (SRCB) supplies the external memory address for LOAD
and STORE instructions. Typically a CONST, or CONST and CONSTH, instruction
sequences precedes the LOAD or STORE instruction and establishes the access address for memory. However, the first 256 bytes of memory can be accessed with immediate addressing, where the 8–bit SRCB value contains the address. Some systems
Chapter 3
Assembly Language Programming
171
may be able to make use of this feature where performance is critical and the use of
CONST type instructions is to be avoided.
As described in Chapter 1, the use of LOAD and STORE instructions can effect
the processor pipeline utilization. Members of the 29K family which support a Harvard memory architecture, or contain on–chip instruction memory cache, can perform LOAD and STORE operations in parallel with other instructions. This prevents
pipeline stalling, as the instruction execution sequence can continue in parallel with
the external memory access. However, if the instruction following a LOAD operates
on the accessed data then pipeline stalling will still occur. For this reason LOAD
instructions should be positioned early in the instruction sequence, enabling the data
memory access latency to be hidden. Pipeline stalling will also occur if LOAD and
store type instructions are placed back–to–back, as this can result in channel access
conflicts. For this reason, LOAD and Store instructions should be separated with other instructions as much as possible
3.1.6 Constant
The Constant instructions (Table 3-9) provide the ability to place half–word and
word constants into registers. Most instructions in the instruction set allow an 8–bit
constant as an operand. The Constant instructions allow the construction of larger
constants. The Am29050 processor has an additional instruction, CONSTHZ, not
supported in other 29K family members. It places a 16–bit constant in the upper half–
word position while the lower 16–bits are zero filled.
Table 3-9. Constant Instructions
172
Mnemonic
Operation Description
CONST
DEST <– 0I16 (16–bit date formed with SRCA and SCRB fields
CONSTH
Replace high–order half–word of SRCA by I16
CONSTN
DEST <– 1I16
Evaluating and Programming the 29K RISC Family
3.1.7 Floating–point
The Floating–Point instructions (Tables 3–10 and 3–11) provide operations on
single–precision (32–bit) or double–precision (64–bit) floating–point data. In addition, they provide conversions between single–precision, double–precision, and integer number representations. In most 29K family members, these instructions cause
traps to routines which perform the floating–point operations in software. The
Am29050 processor supports all floating–point instructions directly in hardware. It
also has four additional instructions not shown in Tables 3–10 and 3–11. They are
MFAC ,DMAC and FMSM, DMSM; and are used to to perform single and double–
precision multiply–and–accumulate type instructions (see section 3.3.5).
3.1.8 Branch
The Branch instructions (Table 3-12) control the execution flow of instructions.
Branch target addresses may be absolute, relative to the Program Counter (with the
offset given by a signed instruction constant), or contained in a general–purpose register (indirect addressing). For conditional jumps, the outcome of the jump is based
on a Boolean value in a general–purpose register. Only the most significant bit in the
specified condition register is tested, Boolean TRUE is defined as bit–31 being set.
Procedure calls are unconditional, and save the return address in a general–purpose
register. All branches have a delayed effect; the instruction following the branch is
executed regardless of the outcome of the branch.
The instruction following the branch instruction is referred to as the delay slot
instruction. Assembly level programmers may have some difficulty in always finding a useful instruction to put in the delay slot. It is best to find an operation required
regardless of the outcome of the branch. As a last resort a NOP instruction can be
used, but this makes no effective use of the processor pipeline. When programming
in a high level language the compiler is responsible for making effective use of delay
slots. Programmers not familiar with delayed branching often forget the delay slot is
always executed, with unfortunate consequences. For this reason, the example code
throughout this book shows delay slot instructions indented one space compared to
other instructions. This has proven to be a useful reminder.
The delay slots of unconditional branches are easier to fill than conditional
branches. The instruction at the target of the branch can be moved to, or duplicated at,
the delay slot; and the jump address can be changed to the instruction following the
original target instruction.
The JMPFDEC instruction is very useful for implementing control loops based
on a decrementing loop. The counter register (SRCA) is first tested to determine if the
value is FALSE, then it is decremented. The jump is then taken if a FALSE value was
detected. The code example below shows how count words of external memory can
be written with zero. Note how the address pointer is incremented in the delay slot of
the jump instruction. Additionally, the SRCA register must be initialized to count–2;
Chapter 3
Assembly Language Programming
173
Table 3-10. Floating–Point Instructions
Mnemonic
Operation Description
FADD
DEST (single–precision) <– SRCA (single–precision)
+ SRCB (single–precision)
DADD
DEST (double–precision) <– SRCA (double–precision)
+ SRCB (double–precision)
FSUB
DEST (single–precision) <– SRCA (single–precision)
– SRCB (single–precision)
DSUB
DEST (double–precision) <– SRCA (double–precision)
– SRCB (double–precision)
FMUL
DEST (single–precision) <– SRCA (single–precision)
* SRCB (single–precision)
DMUL
DEST (double–precision) <– SRCA (double–precision)
* SRCB (double–precision)
FDIV
DEST (single–precision) <– SRCA (single–precision)/
SRCB (single–precision)
DDIV
DEST (double–precision) <– SRCA (double–precision)/
SRCB (double–precision)
FEQ
IF SRCA (single–precision) = SRCB (single–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
DEQ
IF SRCA (double–precision) = SRCB (double–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
FGE
IF SRCA (single–precision) >= SRCB (single–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
DGE
IF SRCA (double–precision) >= SRCB (double–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
FGT
IF SRCA (single–precision) > SRCB (single–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
(continued)
174
Evaluating and Programming the 29K RISC Family
Table 3-11. Floating–Point Instructions (Concluded)
(continued)
Mnemonic
Operation Description
IF SRCA (double–precision) = SRCB (double–precision)
THEN DEST <– TRUE
ELSE DEST <– FALSE
DGT
SQRT
DEST (single–precision, double–precision, extended–precision)
<–SQRT[SRCA (single–precision, double–precision, extended–precision)
CONVERT
DEST (integer,single–precision, double–precision)
<–SRCA (integer, single–precision, double–precision)
CLASS
DEST (single–precision, double–precision, extended–precision)
<–CLASS[SRCA (single–precision, double–precision, extended–precision)]
this is because the loop is taken when the count value is 0 and –1, because the count
decrement is performed after the condition test. In practice, memory systems supporting burst–mode accesses could alternatively use a STOREM instruction to more
efficiently clear data memory.
const
gr97,count–2
;establish loop count
const
gr98,0
const
gr96,address
;establish memory address
consth gr96,address
clear: store
0,0,gr98,gr96
;write zero to memory
jmpfdec gr97,clear
;test and decrement count
add
gr96,gr96,4
;advance pointer
;arrive here when loop finished, gr97=–2
3.1.9 Miscellaneous Instructions
The Miscellaneous instructions (Table 3-13) perform various operations which
cannot be grouped into other instruction classes. In certain cases, these are control
functions available only to Supervisor–mode programs.
The Count Leading Zeros instruction can be very useful to assembly level programmers. It determines the position of the most–significant one bit in the SRCB
operand. If all bits are zero, then 32 is returned. The instruction is useful when determining priorities for, say, queues of interrupt requests, where each interrupt may set a
bit in the register operated on. The highest priority interrupt in the queue can be
quickly determined by the CLZ instruction.
Chapter 3
Assembly Language Programming
175
Table 3-12. Branch Instructions
Mnemonic
Operation Description
CALL
DEST <– PC//00 + 8
PC <– TARGET
Execute delay instruction
CALLI
DEST <– PC//00 + 8
PC <– SRCB
Execute delay instruction
JMP
PC <– TARGET
Execute delay instruction
JMPI
PC <– SRCB
Execute delay instruction
JMPT
IF SRCA = TRUE THEN PC <– TARGET
Execute delay instruction
JMPTI
IF SRCA = TRUE THEN PC <– SRCB
Execute delay instruction
JMPF
IF SRCA = FALSE THEN PC <– TARGET
Execute delay instruction
JMPFI
IF SRCA = FALSE THEN PC <– SRCB
Execute delay instruction
JMPFDEC
IF SRCA = FALSE THEN
SRCA <– SRCA – 1
PC <– TARGET
ELSE
SRCA <– SRCA – 1
Execute delay instruction
3.1.10 Reserved Instructions
The remaining operation codes are reserved for instruction emulation. These
instructions cause traps, much like the unimplemented floating–point instructions,
but currently have no specified interpretation. The relevant operation codes, and the
corresponding trap vectors are given in the processor User’s Manual.
These instructions are intended for future processor enhancements, and users
desiring compatibility with future processor versions should not use them for any
purpose.
The software developer should be aware of the trap taken with the reserved
instruction opcode 0xff. When execution is attempted with this opcode a trap 63 is
176
Evaluating and Programming the 29K RISC Family
Table 3-13. Miscellaneous Instructions
Mnemonic
Operation Description
CLZ
Determine number of leading zeros in a word
SETIP
Set IPA, IPB, and IPC with operand register–numbers
EMULATE
Load IPA and IPB with operand register–numbers, and Trap Vector
Number in field–C
INV [ID]
INV
reset all Valid bits in instruction and data caches
INV 1; reset all Valid bits in instruction cache
INV 2; reset all Valid bits in data cache
IRET
perform an interrupt return sequence
IRETINV [ID]
IRETINV perform an interrupt return and invalidate all caches
IRETINV 1; perform an interrupt return and invalidate instruction cache
IRETINV 2; perform an interrupt return and invalidate date cache
HALT
Enter Halt mode on next cycle
taken. This can occur when a program goes out–of–control and attempts to fetch
instructions from nonexistent memory.
3.2
CODE OPTIMIZATION TECHNIQUES
When a high level programming language is used for software development,
code optimization is left to the compiler. With assembly language programming, care
must be taken to avoid code sequences which impact upon the processor’s performance. Section 3.1.5 described how LOAD and STORE instruction must be carefully positioned if pipeline stalling is to be avoided. Section 3.1.8 discussed the delay
slot of branch instructions, and the importance of finding a useful instruction for
delay slot filling. This section describes a few more useful coding techniques which
can improve code execution performance.
Common Subexpression Elimination is a technique where a frequently occurring code sequence is eliminated to only one occurrence. This usually requires the
Chapter 3
Assembly Language Programming
177
result of the code sequence to be held in register space for frequent and fast access.
The trade–off between expression reevaluation and consuming additional register
resources is easily made with the 29K family because of the large number of general
purpose registers available. Code subexpressions need not be large. They may be as
short as an address calculation using a pair of CONST – CONSTH instructions. The
calculation can be done once and the address kept around in a register for reuse.
Moving code out of loops is another technique frequently used to improve performance. However, the typically small number of registers in a CISC processor can
often mean loop invariant code results have to be held in external memory. This can
lead to trade–offs between adding code within a loop or suffering the external
memory access penalties. Again, the large number of general purpose registers in the
29K assist the programmer to achieve improved code.
Branch instructions are to be avoided as their use impacts badly on performance. Processors supporting burst–mode addressing operate most efficiently when
instruction bursting is not disrupted with a branch instruction. This is particularly
true for memory systems which have a high first–access latency. The Branch Target
Cache incorporated in some 29K family members can help hide the effects of branching, but as the number of branch instructions is increased the chance of a hit occurring
in the cache is reduced.
Loop Inversion is a useful technique at reducing the use of branch instructions.
Often programmers will construct loops which have the loop condition test at the top
of the loop. This requires a branch be used at the bottom of the loop. If the conditional
branch is moved to the bottom of the loop then the number of branch instructions is
reduced
3.3
AVAILABLE REGISTERS
In essence, global registers gr64–gr95 are reserved for interrupt handlers and
the operating system use. The remaining 32 global registers (gr96–gr127) are reserved for holding User mode program context. The high level language calling convention described in Chapter 2 established this convention. Figure 3-3 illustrates the
partitioning of the global registers among the operating system and user program
tasks. General purpose registers 128–255 are better known as local registers, and accessed via the registers stack pointer, gr1, rather than directly addressed as global
registers. General purpose registers 128–255 can not be accessed like global registers
(gr96–gr127); they can only be accessed as local registers or via indirect pointers.
The calling convention goes further than just dividing the register space into two
groups. The user space registers are assigned particular high level language support
tasks. All but four registers (gr112–gr115) in user task space will be accessed and modified by compiler generated code at various times. Most of the registers are used as
compiler temporaries; three registers are used to support memory and register stacks;
and the remaining four registers support the high level language procedure call
178
Evaluating and Programming the 29K RISC Family
255
Register and
Memory stack
support
static link pointer
large return pointer
trap return pointer
trap argument vector
local
register
cache
128
gr127
User
Task
Space
Compiler Temporaries
5 Global Registers
gr96
gr95
Reserved for Programer
4 Global Registers
Operating
system
space
gr64
gr63
not
implemented
gr2
gr1
stack pointer
rfb
rab
msp
slp
lrp
tpc
tav
Operating System gr95
Statics
ks0–ks15
gr80
Operating System gr79
Temporaries
kt0–kt11
gr68
Interrupt Handler gr67
Temporaries
it0–it3
gr64
Figure 3-3.
gr127
gr126
gr125
gr124
gr123
gr122
gr121
gr120
gr116
gr115
gr112
Compiler Temporaries and gr111
Function Return Values
16 Global Registers
gr96–gr111
gr96
General Purpose Register Usage
mechanism and system calls. Global registers in the range gr121–gr127 are known to
the programmer by special synonym; however the registers themselves operate no
differently from other global registers.
In particular gr121 (tav) and gr122 (tpc) are used to pass arguments to trap routines invoked with Assert type instructions. This occurs during procedure prologue
and epilogue as well as operating system service calls. At other times, the compiler
uses these registers to hold temporary data values.
Register gr123 (lrp) is known as the Large Return Pointer. It is used when a procedure is required to return an object which is larger than 16 words and therefore cannot fit in the normal return space (gr96–gr111). The caller must assign lrp to point to
memory which will hold the 17th and higher words of return data.
Register gr124 (slp) is known as the Static Link Pointer. It is used when accessing data variables defined in a parent procedure. This occurs in some languages,
Pascal for example, where nested procedure declarations are permitted. The High C
29K and GNU compiler do not use this register unless C language extensions are
used.
Chapter 3
Assembly Language Programming
179
A called procedure can locate its dynamic parent and the variables of the dynamic parent because of the the caller–callee activation record linkage (see section
3.5). However, the linkage is not adequate to locate variables of the static parent
which may be referenced in the procedure. If such references appear in a procedure,
the procedure must be provided with a local register which forms part of the static
link pointer chain. Since there can be a hierarchy of static parents, the slp points to the
slp of the immediate parent, which in turn points to the slp of its immediate parent.
The memory–stack support register gr125 (msp) and the register–stack support
registers gr126 (rab), gr127 (rfb) and gr1 (rsp), were discussed in detail in Chapter 2.
They maintain the current positions of the stack resources.
The calling convention does not assign any particular task to the registers in the
operating system (OS) group (gr64–gr95). However, over time a convention has
evolved among 29K processor users.The subdivision of the OS registers shown in
Figure 3-3 is widely adhered to. The subgroups are known as: the interrupt freeze
mode temporaries (given synonyms it0–it3); the operating system temporaries
(kt0–kt11); and the operating system statics support registers (ks0–ks15). Note, static
register ks0 is often combined with it0–kt11 to form an interrupt context cache (see
section 2.5.1). Consequently, ks1 is the first available static support register.
When developing a new assembly language procedure a useful technique is to
construct a C language routine which receives any passed parameters and implements the appropriate task. With the AMD High C 29K compiler, the procedure can
be compiled to produce an assembly language output file with the “–S –Hanno” compiler switches. The Assembly level code can then be directly modified into the required code sequence.
3.3.1 Useful Macro–Instructions
The code examples shown in later chapters make use of a number macros for
pushing and popping special registers to an external memory stack. A macro instruction is composed of a sequence of simpler instructions. Effectively, a macro is an in–
line procedure call. Using macros is faster than making an actual procedure call but
consumes more instruction memory space.The macro definitions are presented below:
.macro pushsr,sp,reg,sreg
mfsr
reg,sreg
sub
sp,sp,4
store
0,0,reg,sp
.endm
;
.macro popsr,sreg,reg,sp
load
0,0,reg,sp
add
sp,sp,4
mtsr
sreg,reg
.endm
180
;macro name and parameters
;copy from special
;decrement pointer
;store on stack
;macro name and parameters
;get from stack
;increment pointer
;move to special
Evaluating and Programming the 29K RISC Family
Note how the LOAD instruction is used first when poping. This enables the
ADD and MTSR instruction to overlap the LOAD execution and thus reduce pipeline stalling. This is particularly useful when popsr macro instructions are used
back–to–back in sequence. Such sequences are useful when a memory system does
not support burst mode addressing. If bust mode is supported then it can be more efficient to use a LOADM instruction and then transfer the global register date into the
special registers. However, LOADM and STOREM cannot be used in Freeze mode
code which frequently require popsr and pushsr instruction sequences. Similar macros are used to push and pop global registers:
.macro push,sp,reg
sub
sp,sp,4
store
0,0,reg,sp
.endm
;
.macro pop,reg,sp
load
0,0,reg,sp
add
sp,sp,4
.endm
;macro name and parameters
;decrement pointer
;store on stack
;macro name and parameters
;get from stack
;increment pointer
3.3.2 Using Indirect Pointers and gr0
Three of the 29K special registers are known as indirect pointers: IPA, IPB, and
IPC. These registers are used to point into general purpose register space, and support
indirect register access. They hold the absolute register number of the general purpose register being accessed, and are used in instructions by referencing the pseudo–
register gr0. When an indirect pointer is to be used to identify an instruction operand,
gr0 is placed in the appropriate instruction operand field. Indirect pointer IPA is used
with the SRCA operand field. Similarly, IPB and IPC apply with the SRCB operand
and DEST instruction fields.
The indirect pointer registers are set with the SETIP and EMULATE instructions. Additionally, they are set when a trap is taken as a result of executing an
instruction which is not directly supported by the 29K processor. With some family
members this occurs with floating–point operations and integer multiply and divide.
The example code below shows how the gr0 register is used to select indirect pointer
use. Note, indirect pointers can not be accessed in the cycle following the one in
which they are set; this explains the NOP instruction.
setip
nop
add
Chapter 3
gr98,lr2,gr96
gr0,gr97,gr0
;set indirect pointers
;delay
;gr98 = gr97+gr96
Assembly Language Programming
181
The main use of indirect pointers is to support transparent routines (see section
3.7) and instruction emulation. With most 29K family members the integer multiply
instruction (MULTIPLY) is not directly supported.
multiply lr4,gr98,lr10
;integer multiply, lr4 = gr98*lr10
On entering the trapware support routine for the vector assigned to the MULTIPLY instruction (vector number 32) the indirect pointers are set to IPC=lr4,
IPA=gr98 and IPB=lr10 for the above example. This enables the trap handler to easily and efficiently access the register operands for the instruction without having to
examine the actual instruction in memory.
When using a MTSRIM instruction to set an indirect pointer register value, it is
important to remember that the most significant bit (bit position 9) must be set if local
registers are to be accessed. This is because indirect pointers operate with absolute
register numbers. See the following section discussing the use of gr1 for more details
on register addressing.
3.3.3 Using gr1
Global register gr1 performs the special task of supporting indirect access of the
128 local registers. When an instruction operand, say SRCA, has its top most bit set
then base–plus–offset addressing is used to access the operand. This means only general purpose registers in the range gr1–gr127 can be addressed via their absolute register number given the supported instruction operand decoding. (Indirect pointers enable all general purpose registers to be accessed via absolute address numbers.) The
lower 7–bits of the operand supply the offset which is shifted left 2–bits then added
with the local register base held in register gr1. Register gr1 is a 32–bit register, and
bits 8–2 contain the local register base (see Figure 3-4).
31
23
15
7
2
0
local register base 0 0
Figure 3-4. Global Register gr1 Fields
The base offset calculation is performed modulo–128. The most significant address bit is assumed set when forming the absolute address for all local register accesses.
29K processors use a shadow copy of the gr1 register when performing local
register addressing. The shadow copy can only be modified by an arithmetic or logical instruction; a shift or load into gr1 will not update the shadow copy. Because of
182
Evaluating and Programming the 29K RISC Family
the shadow register technique, there must be a delay of one cycle before the register
file can be accessed after gr1 has been modified.
3.3.4 Accessing Special Register Space
The special registers control the operation of the processor. They are divided
into two groups: those that can be accessed only in Supervisor mode and those which
have unrestricted access. Access of special registers sr128 and above do not generate
a protection violation when accessed in User mode. Special register space was described in section 1.10.2. Not all 29K family members have fully implemented special register spaces. In the Supervisor–only accessible space there are a number of
differences due to differences in on–chip resources such as cache memory and hardware breakpoint registers. Because these are not accessible to application code they
do not effect application code portability.
However, some members of the 29K family do not implement, in hardware, all
of the special registers accessible by User mode programs. In particular the floating
point support registers (sr160–sr162) are only implemented on processors which directly support floating–point instructions in hardware. All other family members
virtualize these registers. An attempted access to unimplemented special registers
causes a Protection Violation trap to occur. The trapware code implements the access
and returns the result. Unfortunately, the trapware code does not use the indirect
pointer as they are not set by a protection violation trap. This means the trapware
must read the instruction space to determine the special register being accessed. This
leads to the consequence that the special floating point support registers can not be
virtualized with Harvard memory architectures which do not provide a memory
bridge to enable instruction memory to be accessed as data. The emulation technique
also requires the support of three operating system registers. The trapware is typically configured to use global registers ks13–ks15 (gr93–gr95) for this task.
Special registers are located in their own register space. They can only be accessed by the move–from (MFSR) and move–to (MTSR) instructions which transfer
data between special register space and general purpose registers. In addition there is
a MTSRIM instruction which can be used to set a special register with 16–bit immediate data. The indirect pointers can not be used to access special register space.
This imposes some restriction in accessing special registers but in practice is acceptable. However, where the address of a special register to be accessed is contained in a
general purpose register, the technique shown below can be used. In the example, lr2
contains the address of the special register to be read with a MFSR instruction. The
example assumes instruction memory can be written to; the required instruction is
built in gr97 and stored in memory at an address given by gr98. The instruction is
then visited with a JMPI instruction. A jump instructions target address is visited
when the jump instruction contains a further jump in its delay slot. The second jump
is in the decode stage of the processor pipeline when the first jump is in execute. This
Chapter 3
Assembly Language Programming
183
means the second jump must be taken, and only the first instruction of the new
instruction stream is started before execution continues at label continue.
const
consth
const
consth
sll
or
store
jmpi
jmp
continue:
gr98,I_memory
gr98,I_memory
gr97,0xC6600000
gr97,0xC6600000
lr2,lr2,8
lr2,lr2,gr97
0,0,lr2,gr98
gr98
continue
;establish instruction address
;MFSR, DEST=gr96, SRCA=0
;lr2 has special register number
;instruction now constructed
;store target instruction
;visit the target instruction
;must execute the delay slot
The constructed MFSR instruction places the result in register gr96. The lr2
source address value had to be shifted left 8–bits into the SRCA field position of the
MFSR instruction.
3.3.5 Floating–point Accumulators
The Am29050 processor is currently the only member of the 29K family which
directly supports in hardware floating–point arithmetic operations. In addition to
supporting floating–point operations without using trapware emulation, functions
involving multiply–and–accumulate operations are supported by four additional
hardware instructions not implemented in other 29K family members. Sum–of–
product type operations are frequently required by many floating–point intensive applications, such as matrix multiplication. Implementing this operation efficiently in
hardware makes the Am29050 processor suitable for use in graphics and signal processing applications.
The FMAC and DMAC instructions can be used to multiply two general purpose register values together and sum the product with one of the four floating–point
accumulators. The DMAC instruction operates on double–precision operand data
and the FMAC operates on single–precision. Double–precision operands can be accessed from the register file in a single cycle as the register file is implemented as
64–bits wide, and there is 64–bit wide ports supplying data to the floating–point
execution unit components. Double–precision operands must be aligned on double–
register address boundaries.
The FMSM and DMSM instructions support single and double–precision floating–point multiply–and–sum. One operand for the multiplication is a general purpose register, the second is accumulator 0; the product is summed with the second
instruction operand and the result placed back in the register file. These two instructions can be used when the multiplier is a fixed value such as with SAXPY (single–
precision A times X plus Y).
The Floating–Point Unit on the Am29050 processor is constructed from a number of specialized operation pipelines; one for addition/subtraction, one for multi-
184
Evaluating and Programming the 29K RISC Family
plication , and one for division/square root. The functional units used by the pipelines
all operate separately. This enables multiple floating–point instructions to be in
execute at the same time. Additionally floating–point operations can commence in
parallel with operations carried out by the processor’s integer pipeline. The operation
of some of the pipeline functional units can be multicycle and contention for resources can result if simultaneous floating–point operations are being performed.
However, all floating–point operations are fully interlocked, and operations requiring the result of a previous functional unit operation are prevented from proceeding
until that result is available. The programmer never has to become involved in the
pipeline stage details to ensure the success of an operation.
To sustain efficient use of the floating–point pipelines, four floating–point accumulator registers are provided. The programmer must multiplex their use during
heavily pipelined code sequences to reduce resource contention. The Am29050 processor can issue a new floating–point instruction every cycle but many of the operations have multicycle latency. Thus to avoid pipeline stalling, the results should not
be used until a sufficient number of delay cycles has passed (see Am29050 processor
User’s Manual). The processor has an additional 64–bit write port on the general purpose register file for use by the floating–point unit. This enables floating–point results to be written back at the same time as integer pipeline results.
The floating–point accumulators can be accessed by the MTACC (move–to)
and MFACC (move–from) instructions which are available to User mode code. Only
29K family members which directly support floating–point operations implement
these instructions.
3.4
DELAYED EFFECTS OF INSTRUCTIONS
Modification of some registers has a delayed effect on processor behavior.
When developing assembly code, care must be taken to prevent unexpected behavior. The easiest of the delayed effects to remember is the one cycle that must follow
the use of an indirect pointer after having set it. This occurs most often with the register stack pointer. It cannot be used to access a local register in the instruction that follows the instruction that writes to gr1. An instruction that does not require gr1 (and
that means all local registers referenced via gr1) can be placed immediately after the
instruction that updates gr1.
Direct modification of the CPS register must also be done carefully. Particularly
where the freeze (FZ) bit is cleared. When the processor is frozen, the special-purpose registers are not updated during instruction execution. This means that the PC1
register does not reflect the actual program counter value at the current execution address, but rather at the point where freeze mode was entered. When the processor is
unfrozen, either by an interrupt return or direct modification of the CPS, two cycles
are required before the PC1 buffer register reflects the new execution address. Unless
the CPS register is being modified directly, this creates no problem.
Chapter 3
Assembly Language Programming
185
Consider the following examples. If the FZ bit is cleared and trace enable (TE) is
set at the same time, the next instruction should cause a trace trap, but the PC buffer
registers frozen by the trap will not have had time to catch up with the current execution address. Within the trap code the processor will have appeared to have stopped at
some random address, held in PC1. If interrupts and traps are enabled at the same
time the FZ bit is cleared, then the next instruction could suffer an external interrupt
or an illegal instruction trap. Once again, the PC buffer register will not reflect the
true execution address. An interrupt return would cause execution to commence at a
random address. The above problems can be avoided by clearing FZ two cycles before enabling the processor to once again enter freeze mode.
3.5
TRACE–BACK TAGS
When the compiler generates the code for a procedure, it places a one or two
word tag before the first instruction of the procedure. The tag information is used by
debuggers to determine the sequence of procedure calls and the value of program
variables at a given point in program execution. The trace–back tag describes the
memory frame size and the number of local registers used by the associated procedure. A one word tag is used unless the memory stack usage is greater than 2k words,
in which case a two–word tag is used. Figure 3-5 shows the format of the tag data.
Most of the tag data fields are self explanatory. The M bit–field is set if the the
procedure uses the memory stack. In such case, msize is the memory stack frame size
in double words. The argcount is the number of in–coming arguments in registers
plus two. The T bit–field, when set, indicates the routine is transparent (see section
3.7).
When procedures are built in assembly language rather than, say C, the programmer is responsible for building the appropriate tag data word[s] ahead of the first
instruction. For an example see section 2.3.5. Figure 3-6 shows an example register
stack history for three levels of procedure calls. In the example, the current procedure
is a small leaf procedure. Small leaves differ from large leaf procedures in that they
do not lower the register stack and allocate new local registers to the procedure.
Looking at the parent of the current procedure it can be seen the stack was lowered by six words (rsize) during the parent procedure prologue. The top of the activation record is identified by the procedure lr1 register value. In principal the start of the
grandparent procedure activation record can be found by subtracting the argcount
value from the address identified by the parent lr1. In this way the rsize for the parent
procedure can be determined; adding rsize to the parent’s gr1 value enables the
grandparent gr1 value to be obtained. Repeating the mechanism with the grandparent
lr1 value allows all previous activation records to be identified until the first procedure call is found. The first procedure is identified by a tag value of zero, and is normally the start procedure in file crt0.s.
186
Evaluating and Programming the 29K RISC Family
31
23
15
0 0 0 0 0 0 0 0 0 M T
argcount
7
reserved
msize
0
fp
one–word tag
31
23
15
7
msize
0 0 0 0 0 0 0 0 1 M T
argcount
0
0 0
reserved
fp
two–word tag
Figure 3-5. Trace–Back Tag Format
However, there is a problem with this scheme as shown in Figure 3-6. Small leaf
procedures do not have lr1 values for their own activation record; they share the lr1
value of their parent. Additionally, large leaf procedures have a new lr1 register assigned, but because leaves do not call other procedures, the lr1 register is not assigned
to point to the top of the activation record. For this reason, the lr1 value can not be
initially used as a mechanism for walking back through procedure call register allocation.
In practice, most debuggers walk back through instruction memory till they find
the current procedure tag value, then they look at the immediately following prologue code. The first prologue instruction is normally a “SUB gr1, gr1, rsize*4”. If
the rsize is bigger than 64, then it is a CONST followed by a SUB. In any case the rsize
value is determined by this method rather than computing it from an lr1–argcount
based calculation.
Before the Am29050 processor, became available, floating–point intensive applications were normally supported with an Am29027 coprocessor. The procedure
call mechanism specified that coprocessor float registers 0–2 are allowed to be modified by the callee and are not saved across calls. Float registers 3–7 may also be modified by the callee but are preserved across procedure calls. Thus the caller must first
save them before making a call, and restore them upon callee return. A region of the
procedure activation record is assigned for saving the coprocessor registers. Additionally, the fp field in the tag word is used to indicate the the number of registers
saved.
Chapter 3
Assembly Language Programming
187
grand parent
activation
record
size=12
lr5
lr4
lr3
lr2
lr1
lr0
arg4
arg3
arg2
arg1
parent
activation
record
argcount=6
arg2
arg1
rsize=6
lr1
lr0
lr0
small leaf
activation
record
gr1
Figure 3-6. Walking Back Through Activation Records
When using an Am29050 processor the fp field value is always zero. The four
double–word accumulator registers are not preserved across a procedure call. If a
procedure uses the accumulators and wishes to preserve their contents, it must first
save them before making a procedure call. This may involve temporary modifying
the special floating–point environment registers. Because the floating–point accumulators are normally accessed by assembly language leaf routines, caller saving of
the accumulators results in a reduced overhead compared to callee saving.
3.6
INTERRUPT TAGS
The High C 29K compiler will place an additional tag word before the normal
procedure tag when the key word _Interrupt is used to define a procedure’s return
type. Figure 3-7 shows the typical tag word combination produced. The first procedure of an interrupt handler, that is the procedure accessed after the interrupt vector is
processed and any necessary preparation work is performed, should be identified by
the _Interrupt key word. Examples of how interrupt tags are used by application code
is shown in section 2.5.4.
The second (or second and third) word of tag information has the same format as
all procedure tags. Only the first tag word is new and this word is known as the interrupt tag word. It has several bit–fields which describe the execution environment required by the procedure. These fields can be examined at interrupt occurrence time or
at interrupt installation time to determine the appropriate steps required to prepare
the interrupt processing environment. The objective is to optimize interrupt preparation by only preserving the minimum required context. Of course, the exact steps tak-
188
Evaluating and Programming the 29K RISC Family
31
interrupt tag
23
19
15
7
0 0 0 0 0 0 0 0 C F I Q temps.
return registers
0 0 0 0 0 0 0 0 0 M T
reserved
argcount
0
local registers
msize
fp
normal one–word procedure tag
Figure 3-7. Interrupt Procedure Tag Words
en will be very much dependant on the operating system in use. For example, some
operating systems may process interrupts in User mode with address translation in
use. Others may process interrupts in Supervisor mode with physical addressing.
The C bit is set if the procedure calls any other procedure (excluding transparent
routines). That is, the C bit is set if the procedure is not a child. When a another procedure is called, it may be necessary to have the register stack repaired before the first
procedure is entered. The local registers bit–field indicates the number of registers
required from the register stack cache. However, other procedures called by the first
procedure may require additional local registers. Note that large leaf routines may
require local registers but of course the C bit will still not be set. When the C bit is set,
preparation code is unlikely to scan the other bit–fields as it is usually necessary to
assume that called functions may perform any 29K operation.
The F bit will be set if any floating–point operations are performed. Most 29K
family members do not directly support floating–point instructions but take a trap
when a floating–point instruction is encountered. Trap handlers can not be entered
from Freeze mode and execution of a floating–point operation could modify the state
of floating–point accumulators (Am29050) or coprocessor (Am29027) status regisetrs.
The I bit is set if any of the indirect pointer registers (IPA, IPB and IPC) are modified by the procedure. These registers would be effected by a call to a transparent
helper routine which issues a trap. The High C 29K compiler uses a transparent routine to perform integer multiply with most 29K family members. If the I bit is set then
interrupt preparation code would be required to preserve the indirect pointer registers
before entering the first procedure. The Q bit is set when the Q register (sr131) is modified. This registers is used during floating–point and integer multiply and divide
emulation routines.
The 29K calling convention states that a procedure return its results in global
registers gr96–gr111. An interrupt handler routines does not have any return value.
However, it may use registers in this range to hold temporary values during proceChapter 3
Assembly Language Programming
189
dure execution. The return registers bit–field indicates the number of registers used.
Additionally, temporary registers can be obtained from register range gr116–gr120.
In fact, the GNU compiler prefers allocate temporary registers in this range before
allocation from the return registers range (see section 2.5.2). The temps. bit–field indices the number of registers modified in the range gr116–gr124. When interrupt
processing is accomplished with a leaf routine, these bit–fields enable only the minimum number of global registers to be saved before the interrupt handler procedure is
entered.
3.7
TRANSPARENT ROUTINES
Transparent routines are used to support small highly efficient procedure calls.
They are like small leaf procedures in that they do not lower the register stack and
allocate a new activation record. They are unlike leaf procedures in that the only
global registers which the caller does not expect to survive the call are tav (gr121) and
tpc (gr122). They are normally used to support compiler specific support routines
such as integer multiply (where the 29K hardware does not directly support this operation).
Parameters are passed to transparent routines using tav and the indirect pointer
registers. Return values are via tpc and possibly the modified register identified by
indirect pointer ipc. Leaf procedures can call transparent routines without changing
their status as leaf routines.
Newer versions of the High C 29K compiler enable the user to select procedures
for implementation as transparent routines. For example, a procedure which would
normally be of return type “int” would be defined (and declared) as type “_Transparent int”. The _Transparent key word extends the C language. Of course there are a
number of restrictions which apply to transparent routine construction: They can
only receive two in–coming parameters (passed via IPA and IPB); They must be of
type void or return an object of word–size or less (return values are passed via IPC);
They must not perform any floating point (and some integer) operations which require trapware support; And of course, they must not call any other procedures (even
if they are transparent).
3.8
INITIALIZING THE PROCESSOR
Reset mode is entered when the processor’s *RESET pin is activated. This
causes the Current Processor Status (CPS) register to be set to the Reset mode values;
the processor operates in Supervisor mode with all data and instruction addresses being physical (no address translation); all traps and interrupts are disabled and the processor Freeze mode bit is set. (See the Initialization section of the processor User’s
Manual for the exact CPS register setting.) Individual 29K family members have
190
Evaluating and Programming the 29K RISC Family
additional Reset mode conditions established, such as disabling cache memory
where appropriate.
Instruction execution begins at address 0. For processors supporting both
Instruction memory space and read–only memory (ROM) space, ROM space is used
when fetching instructions from external memory. However, many Am29000 processor systems apply no distinction when decoding instruction and ROM memory
space.
The programmer must establish control of the processor and available resources. Section 7.4 discusses how this is achieved with the OS–boot operating system. OS–boot is made available by AMD, and is used to implement a single–task
application environment which supports HIF (see Chapter 2) system call services.
Because OS–boot is so freely available to the 29K community, it is convenient to use
the included processor start–up code sequence for any new designs.
3.9
ASSEMBLER SYNTAX
Assembly language, like all languages, has a character set and a set of grammar
rules. Purchasers of the ASM29K assembly language tool package from AMD or
other tool company, normally obtain a copy of the assembly language syntax specification. There are a number of assembler tools available and all of them comply (but
not always fully) with the AMD defined syntax for assembly level programming.
Many of the assemblers have options which are unique, but it has been my experience that assemblers will generally accept code which is produced by any of the
available compilers.
3.9.1 The AMD Assembler
The AMD assembly language tool package, ASM29K, was used to develop all
of the assembly language examples shown in this book. The assembler, linker and
librarian tools included in the package were developed by Microrec Research Inc.
(MRI) for AMD. The tools are available on a number of platforms; the most popular
being SUN and HP workstations and IBM PCs and compatibles. This section does
not cover the details of the AMD assembler (as29) and its options as they are well
documented in the literature supplied with each purchased tool package.
During the introduction of the Am29000 processor, AMD had a second assembly level tool package developed by Information Processing Techniques Corp. (IPT).
This second tool chain forms the basis of a number of elaborate tool packages made
available by third party tool suppliers. All of the tool suppliers are listed in the AMD
Fusion29Ksm Catalogue [AMD 1992a][AMD 1993b]. Both assemblers fully comply with the AMD assembler syntax for 29K code. However, the librarian tools supplied with the different tool packages maintain library code in different formats. This
means libraries cannot be shared unless reformatting is applied.
Chapter 3
Assembly Language Programming
191
3.9.2 Free Software Foundation (GNU), Assembler
The Free Software Foundation Inc. is an organization based in Cambridge MA,
USA, which helps develop and distribute software development tools for a range of
processors. Anyone can contribute programs to the the foundation and users of
foundation supplied tools have the freedom to distribute copies of tools freely (or can
charge for this service if they wish). The foundation tools (often known as GNU
tools) include a complete tool chain for software development for the 29K family.
The GNU assembler is known as GAS, and is available in source form from AMD
and from the Cygnus Support company.
GAS is primarily intended to assemble the output from the GNU C language
compiler, GCC (see Chapter 2). It does accept code complying with the AMD assembly language syntax; however, there are a number of differences. Most notably, it
does not support macro instructions. Developers may wish to use a UNIX utility such
as M4 or CPP to support macros with the GAS tool (section 2.5.2 has an example of
assembler macros using the C preprocessor, CPP).
A number of developers have compiled GAS for use in a cross–development
environment where the target processor is a 29K, but the development platform is a
SUN or HP workstation or an IBM 386–PC. These tools are available among the 29K
GNU community, many of which are university engineering departments. AMD has
a University Support Program which helps universities wishing to include the 29K in
educational programs, to obtain hardware and software development tools as well as
other class materials. There may be a university near you which will supply you with
a copy of the compiled GNU tools for a small tape handling charge.
If you get a copy of GAS from AMD or Cygnus or other Fusion29K partner,
then it is likely that the documentation supporting the tool was supplied. After installing the tools on a UNIX machine and updating the MANPATH variable to include
the GNU manual pages, it should be possible to just type “man gas” and obtain a display of the GAS program options. Alternatively look in the GAS source directories
for a file called 29k/src/gas/doc/gas.1 or as.1 to obtain the necessary documentation.
Below is a extract from the GAS manual pages which indicates some of the capabilities of the tool.
gas [–a | –al | –as] [–f] [–I path] [–K] [–L] [–o objfile] [–R] [–v] [–W]
files...
OPTIONS
–a | al | as
Turn on assembly listing; –al, listing only; –as, symbols, –a, everything.
–f
Fast ––skip preprocessing (assume source is compiler output).
–I path
Add path to search list for .include directives.
192
Evaluating and Programming the 29K RISC Family
–K
Issue warning when difference tables altered for long displacements
–L
Keep (in symbol table) local symbols starting with L.
–o objfile
Name the object–file output for GAS.
–R
Fold data sections into text sections.
–v
Announce GAS version.
–W
Suppress warning messages.
Chapter 3
Assembly Language Programming
193
194
Evaluating and Programming the 29K RISC Family
Chapter 4
Interrupts and Traps
This chapter describes techniques for writing interrupt and trap handlers for
29K processor-based systems. It also describes the interrupt hardware for the 29K
Processor Family, and the software environment within which interrupt handlers
execute.
Handler descriptions are separated into two major sections. The first discusses
Supervisor mode handlers and the second covers User mode handlers. The
descriptions apply equally well to interrupts and traps. For the purposes of this
chapter, User mode handlers refer to interrupt and trap handlers written in a
high-order language. However, it is possible to enter User mode without first
establishing high-order language support. Additionally, for our purposes we shall
call assembly level handlers Supervisor mode handlers.
Although interrupts are largely asynchronous events, traps most often occur
synchronously with instruction execution; however, both share common logic in the
29K Processor Family and are often handled entirely in Supervisor mode, with interrupts disabled and Freeze mode (described later) in effect. However, interrupt and
trap handlers may execute in one or more of the stages shown in Figure 4-1. Each
stage implies an increased level of complexity, and may execute a return from interrupt (IRET instruction) if the process is complete. However, in the case where User
mode has been entered, the handler must first reenter Supervisor mode before executing an IRET instruction.
The first stage is entered when an interrupt occurs. In this stage the processor is
running in Supervisor mode, with Freeze mode enabled and interrupts disabled. In
the second stage Freeze mode is turned off (disabled), but the processor remains in
Supervisor mode with interrupts disabled. The third stage execution takes place with
interrupts enabled, but with the processor still operating in Supervisor mode. In the
195
Supervisor Mode Handler
Freeze Enabled
Supervisor Mode Handler
Freeze Disabled
Supervisor or User
Mode Handler
Interrupts Enabled
C–level
Handler
Freeze On
Freeze Off Int. Enabled User Mode
Prepare Stacks
Interrupts Enabled
Freeze Mode Disabled
Interrupt Occurs
.
Figure 4-1. Interrupt Handler Execution Stages
fourth stage, execution continues in User mode. Each stage is discussed in the following sections of this chapter.
Before entering into a discussion of Supervisor mode interrupts and traps, it is
necessary to first understand the way interrupts are handled by the 29K family hardware.
4.1
29K PROCESSOR FAMILY INTERRUPT SEQUENCE
When an interrupt or trap occurs and is recognized, the processor initiates the
following sequence of steps.
Instruction execution is suspended.
Instruction fetching is suspended.
Any in-progress load or store operation, which was not the cause of a trap, is
completed. In the case of load- and store-multiple, any additional operations are
suspended.
The contents of the Current Processor Status (CPS) register are copied into the
Old Processor Status (OPS) register.
The CPS register is modified as shown below. The letter “u” means unaffected,
and “r” indicates that this bit depends on the value of the RV bit in the CFG
register, or the R bit in the fetched interrupt vector. Note, only 3–bus 29K
processors have the R bit–field implemented. The letter “f” is only supported by
196
Evaluating and Programming the 29K RISC Family
the Am29040 processor; it is used hear to indicates the value of the PD bit is
unaffected when taking a trap or interrupt if the FPD (Freeze PD) bit is set in the
CFG register. Otherwise, the PD bit is set to a 1 (see section 5.14.2).
31
23
0
15
7
0 0 u u 0 0 0 1 0
IP
MM
CA
TE
0
r 0 f 1 1 u u 1 1
RE PD SM
TP FZ
TU
LK WM
PI
IM
DA
DI
Figure 4-2. The Format of Special Registers CPS and OPS
The setting of the Freeze (FZ) bit freezes the Channel Address (CHA), Channel
Data (CHD), Channel Control (CHC), Program Counters (PC0–PC2),and ALU
Status registers.
The address of the first instruction of the interrupt or trap handler is determined.
If the VF bit of the Configuration register is 1, the address is obtained by
accessing a vector from data memory. The access is performed by using the
physical address obtained from the Vector Area Base Address register and the
vector number. If the VF bit is 0, the instruction address is directly given by the
Vector Area Base Address register and the vector number. For all 29K
processors other than 3–bus processors, the VF bit is reserved and effectively
set to 1.
With 3–bus processors, if the VF bit is 1, the R bit in the vector fetched above is
copied into the RE bit of the CPS register. If the VF bit is 0, the RV bit of the
Configuration register is copied into the RE bit. This determines whether or not
the first instruction of the interrupt handler is an instruction-ROM-space or
instruction-space.
An instruction fetch is initiated using the instruction address determined above.
At this point, normal instruction execution resumes.
No registers (beyond the interrupted program’s CPS) are saved when an interrupt occurs. Any registers whose contents are essential to restarting the interrupted
program must be deliberately saved if they are going to be modified by the interrupt
handler.
4.2
29K PROCESSOR FAMILY INTERRUPT RETURN
After the handler has processed the interrupt, and control is given back to the
interrupted task, execution of an IRET or IRETINV instruction is used to cause the
Am29000 processor to initiate the following steps.
Chapter 4
Interrupts and Traps
197
Any in-progress LOAD or STORE operation is completed. If a load-multiple or
store-multiple sequence has been suspended, the interrupt return is not
completed until that operation is finished.
Interrupts and traps are disabled, regardless of the settings of the DA, DI, and IM
fields of the CPS register.
If the interrupt return instruction is an IRETINV, the Valid bit associated with
each entry in the Branch Target Cache memory is reset. In the case of
the Am29030 processor, the IRETINV instruction causes cache blocks to
become invalid, unless the blocks are locked and the cache is enabled.
The contents of the OPS register are copied into the CPS register. This normally
resets the FZ bit, allowing the Program Counters (PC0–PC2) and the CHA,
CHD, CHC, and ALU Status registers to update normally. The Interrupt
Pending bit (IP) of the CPS register is always updated by the processor. The
copy operation is irrelevant for this bit.
The address in Program Counter 1 (PC1) is used to fetch an instruction. The
CPS register conditions the fetch. This step is treated as a branch, in the sense
that the processor searches the Branch Target Cache memory for the target of the
fetch.
The fetched instruction above enters the decode stage of the pipeline.
The address in PC0 is used to fetch an instruction. The CPS register conditions
the fetch. This step is treated as a branch, in the sense that the processor searches
the Branch Target Cache memory for the target of the fetch.
The first fetched instruction enters the execute stage of the pipeline, and the
second instruction fetched enters the decode stage.
If the Contents Valid (CV) bit of the CHC register is 1, and the Not Needed (NN)
bit is 1 and Multiple Operation (ML) bit is also 0, an external access is restarted.
If the PC1 register points to an interrupted load- or store-multiple instruction,
and the ML bit is one, then an interrupted load- or store-multiple operation is
restarted. The external memory access is continued based on the contents of the
CHA, CHD, and CHC registers. The interrupt return is not completed until this
operation is finished.
Interrupts and traps are enabled per the appropriate bits in the CPS register.
The processor resumes normal operation.
It is important to remember that once an interrupt or trap occurs, the processor is
immediately vectored to the appropriate handler, with interrupts disabled, Freeze
mode enabled, and Supervisor mode execution. The next section discusses Supervi-
198
Evaluating and Programming the 29K RISC Family
sor mode interrupt handlers. The final section describes User mode interrupt handlers. Both sections include 29K Processor Family assembly language source code
examples.
4.3
SUPERVISOR MODE HANDLERS
4.3.1 The Interrupt Environment
After an interrupt or trap occurs, and the event is recognized by the processor,
the 29K family hardware interrupt sequence, described earlier, is initiated. Interrupt
handler code begins execution at this point.
The amount of code necessary to handle an interrupt or trap depends on the nature of the interruption, and the degree to which a given operating system supports
interrupts and traps. For robust systems, interrupt and trap handlers must be sure to
return to an environment guaranteed to be intact when their processing is complete.
Some systems may elect to terminate a program if certain interrupts and traps occur,
while others may ignore these entirely. The operating system will also set some standards for register availability in interrupt routines. As stated in the section describing
the calling convention (Chapter 2), AMD recommends that the 29K processor’s
global registers gr95 and below be reserved for non User-mode code. Additionally
section 3.3, of Assembly Language Programming, goes further, and suggests an allocation scheme for operating system, reserved registers. (See Table 4-1.)
Table 4-1. Global Register Allocations
Registers Name
gr1
gr64–67
gr68–79
gr80–92
gr93–95
gr96–127
rsp
it0–it3
kt0–kt11
ks0–ks12
ks13–ks15
various
Description
Local register stack pointer
Interrupt handler temporaries
Temporaries for use by operating system
Operating system statics
Floating-point trap handler statics
Reserved by Am29000 processor calling
conventions
In essence, global registers (gr64–gr95) are reserved for interrupt handlers and
the operating system. The remaining 32 global registers (gr96–gr127) are reserved
for holding the interrupted program’s context.
Existing floating-point trap handlers use gr64–gr78 as temporary registers,
with interrupts disabled. In addition, registers gr93–gr95 are used to hold static variables for these routines. The register assignments in these routines can easily be
Chapter 4
Interrupts and Traps
199
changed, but fifteen temporary global registers and three static global registers must
be allocated. Note, with the Am29050 processor, only the integer divide instructions
are not directly supported by the processor hardware and require trapware support.
This requires six temporary global registers and no static global registers.
If all of the local registers are given over to User-mode code use, then interrupt
and trap handlers must also assume that the local registers are being used and may not
be arbitrarily rewritten, unless the values they contain are saved upon entry, and are
restored prior to exit. If a cache window size (rfb–rab) less than the physical register
file size is used, then a number of non-static temporary local registers can be made
available for handler use.
Fortunately, most interrupt handlers can operate very efficiently using only a
few temporary registers. It is recommended that global registers gr64–gr67 (it0–it3 )
be allocated for this purpose. However, additional temporary registers kt0–kt3 may
be used for interrupt handlers if these registers are not used by the operating system.
4.3.2 Interrupt Latency
The determination of the number of cycles required to reach the first instruction
of an interrupt or trap handler is a little complicated. First consider the case for the
non-vector fetch, table of handlers method.
An external interrupt line may have to be held active for one cycle before the
processor internally recognizes it. Once recognized, one cycle is required to internally synchronize the processor. Now any in-progress load or store must be completed
(Dc cycles, where 0 ≤ Dc ≤ Dw, note Dw is the number of cycles required to complete
a data memory write and is often greater than Dr, the number of cycles required to
complete a data memory read). One cycle is then required to calculate the vector. The
first instruction can then be fetched (Ir cycles) and presented to the instruction fetch
unit. One cycle is required by the fetch unit and a further cycle by the decode unit
before the instruction reaches execute. If the first instruction is found in the cache,
then the Branch Target Cache memory forwards the instruction directly to the decode
unit. The total latency (minimum of five cycles for the hit case) is given by the equation below.
delay(miss) = 1 + 1 + Dc + 1 + Ir + 1 + 1
delay(hit) = 1 + 1 + Dc + 1 + 1 + 1
Now let’s consider the case for a table of vectors, that is the VF bit in the CFG
register is set (always the case for 2–bus processors and microcontrollers). The vector must still be calculated and any in-progress load or store completed before the
vector can be fetched from data memory. Additionaly, if the processor has a data
cache, the cache state is synchronized after any current data access is completed.
200
Evaluating and Programming the 29K RISC Family
Data cache synchronizing is discussed in detail at the end of this section. The number
of cycles required to read the data memory is represented by Dr. Once the address of
the handler has been fetched it must be routed to the processor PC, this takes one
cycle. A further cycle occurs before the address reaches the Address Pins. Delays involved in fetching the first instruction are then the same as described above. Once
again, if the first instruction is found in the cache, the Branch Target Cache memory
forwards the instruction directly to the decode unit. The total latency (minimum of
seven cycles for the hit case) is given by the equation below.
delay(miss) = 1 + 1 + Dc + <cache sync.> + 1 + Dr + 1 + 1 + Ir + 1 + 1
delay(hit) = 1 + 1 + Dc + <cache sync.> + 1 + Dr + 1 + 1 + 1
The Am29050 processor supports instruction forwarding. This enables instructions to be forwarded directly to the decode unit, bypassing the fetch unit and saving
one cycle. The minimum latency for the Am29050 processor for the vector fetch and
non-vector fetch methods is six cycles and four cycles, respectively.
The Am29040 and Am2924x processors have data cache which can add to interrupt latency. Consider that the Am29240 has a two word write–buffer which must be
flushed before interrupt processing can be completed. This adds as much as 2xDw
cycles to interrupt latency. The processor could be performing a load when interrupted. If the load caused a block (cache entry) to be allocated, then the load would be
completed but block allocation canceled.
Cache synchronizing for the Am29040 processor is a little more complicated.
The worst case condition occurs when the write buffer is full and a load is performed.
The load can cause block allocation and because of the write–back policy, the selected block may have to be copied–back. The Am29040 always flushes the write–
buffer before reloading a new block. Cache reload can not be cancelled even if the
interrupt occurs before the write–buffer is flushed. However, the loaded block will be
held in the reload buffer (see Figure 5-9) and the copy–back buffer returned to the
cache. Unfortunately, the reload buffer contents will never make it into the cache.
The effects of data cache synchronizing on interrupt latency are summarize below:
Am29240 <cache sync.> = 2 x Dw
Am29040 <cache sync.> = (2 x Dw) + (4 x Dr)
4.3.3 Simple Freeze-mode Handlers
The simplest interrupt or trap handler will execute in its entirety in Supervisor
mode, with interrupts disabled, and with the FZ bit set in the CPS register. This corresponds to the first stage depicted in Figure 4-1.
Chapter 4
Interrupts and Traps
201
The FZ bit in the Current Processor Status register is responsible for locking the
values in the Program Counters (PC0–PC2), the Channel registers (CHA, CHD and
CHC), and the ALU status. As long as the FZ bit remains set, these registers will not
be updated. Note, the PC0–PC2 registers are not the actual Program counter, but a
three-stage buffer store that records the stages of program execution.
If the intention is to ignore the interrupt and return control to the interrupted process, the entire handler can consist of little more than an IRET instruction. After the
interrupt request has been cleared, execution of this instruction will cause the processor to perform the interrupt return sequence described above, resuming execution of
the interrupted program at the point of interruption.
4.3.4 Operating in Freeze mode
Interrupt or trap handlers executing only a small number of instructions before
returning will benefit from the very short latency of the interrupt sequence performed
by the 29K processor. This is because the 29K processor offers superior performance
when compared with conventional processors that save a great deal of context whenever an interrupt or trap occurs.
Because the executing program’s context is often not disturbed by the interrupt
or trap handling code, both the reaction time (latency) and processing time of the interrupt handler are minimized.
In this context, no registers (except the CPS) have been saved when an interrupt
or trap handler is given control by the processor. In addition, if the Program Counter
registers (PC0 and PC1) are left undisturbed, the 29K processor’s instruction pipeline is more quickly restarted when the handler returns.
But, because Freeze mode has frozen the contents of several important registers,
there are some instructions that should not be used in this context, or whose use is
restricted. These instructions are:
Instructions that can generate traps. These should not be used because traps are
disabled in Freeze mode. These include ASSERT, emulated floating-point
operations (e.g., FADD), and certain integer operations whose execution
could cause a trap to occur. Note, the Am29050 processor executes all floating
point operations directly and thus these instructions can be used with the
Am29050 processor as they will not generate a trap.
If a trap generating instruction is executed it will have the same affect as a NOP
instruction. An exception trap is caused by bad memory accesses. These traps are always taken, even if they occur in Freeze-mode code. Because the processor registers
were already frozen at the time of the nested trap, it can be difficult to determine the
cause of the trap or issue an IRET instruction.
However, if an Am29050 processor is being used and a trap occurs when the DA
bit is set in the CPS register, Monitor mode is entered. Monitor mode (section 4.3.5)
can be used by monitors to debug kernel Freeze-mode code.
202
Evaluating and Programming the 29K RISC Family
Instructions that use special registers––these instructions may be used;
however, any modified registers may have to be saved and restored before the
interrupt handler returns. The EXTRACT and INSERT instructions are in this
category.
Instructions that modify special registers–– because of the normal side effect of
their operation, these instructions must be used with caution. There are three
subgroups within this group:
—Arithmetic and logical instructions that set the Z, N, V, and C status bits in the
ALU Status register. These instructions can be used in Freeze mode if the ALU
status bits are not used. Because Freeze mode disables updating the ALU Status
register, extended precision arithmetic instructions, such as ADDC or SUBC,
will not execute properly.
—Load-Multiple and Store Multiple. These instructions cannot be used in Freeze
mode, because the Channel registers (CHA, CHD, and CHC) upon which their
execution depends are frozen.
—LOAD and STORE instructions with the set BP option enabled, if the Data
Width Enable (DW bit) is 0. In this case, if BP must be set, it will have to
be done explicitly by using a Move-To-Special Register (MTSR) instruction.
Therefore, LOAD and STORE instructions with word-aligned addresses (i.e.,
those whose least significant 2 bits are 0) may be used without additional effort;
however, if byte or half-word instructions are needed, the BP register must be
explicitly set prior to execution of a non-word-aligned LOAD, STORE,
INSERT, or EXTRACT instruction.
All other instructions may be used without restriction, keeping in mind the inherent implications of Freeze mode. (Note: Other restrictions apply to Am29000
processors manufactured prior to revision C.)
4.3.5 Monitor mode
Monitor mode only applies to the Am29050 processor. If a trap occurs when the
DA bit in the CPS register is a 1, the processor starts executing at address 16 in
instruction ROM space. Monitor mode is not entered as a result of asynchronous
events such as timer interrupts or activation of the TRAP(1–0) or INTR(3–0) lines.
On taking a Monitor mode trap the Reason Vector register (RSN) is set to indicate the cause of the trap. Additionally, the MM bit in the CPS register is set to 1.
When the MM bit is set, the shadow program counters (SPC0, SPC1, and SPC2) are
frozen, in a similar way to the FZ bit freezing the PC0–PC2 registers. Because the
shadow program counters continue to record PC-bus activity when the FZ bit is set,
they can be used to restart Freeze mode execution. This is achieved by an IRET or
IRETINV instruction being executed while in Monitor mode.
Chapter 4
Interrupts and Traps
203
Because Monitor mode traps are used by monitors in the debugging of trap and
interrupt handlers and are not intended for operating system use, they are dealt with
further in Chapter 7 (Software Debugging).
4.3.6 Freeze-mode Clock Interrupt Handler
The code shown in this example illustrates one way to program a clock that
keeps the current time. One important aspect of this routine is the need to minimize
overhead in the function, taking as little time as possible to update the clock when an
interrupt occurs. Allocating two Operating System Static registers (ks1, ks2) to contain millisecond and second values reduces the need to access data memory inside the
handler.
; freeze mode clock interrupt handler
;
.equ
IN,0x0200000
IN-bit of TMR reg
.reg
.reg
CLOCK,ks1
SECS,ks2
;1 ms increments
;time in seconds
.equ
.equ
CPUCLK,25
RATE,1000
;CPU clock in MHz
;ints per second
it0,IN
it0,IN
it1,tmr
it1,it1,it0
tmr,it1
it0,RATE
it0,CLOCK,it0
it0,carry
CLOCK,CLOCK,1
;IN-bit in TMR
;check if 1 sec
;jump if CLOCK > RATE
;increment CLOCK
CLOCK,0
SECS,SECS,1
;increment seconds
intr14:
const
consth
mfsr
andn
mtsr
const
cplt
jmpf
add
iret
carry:
const
add
iret
;clear IN-bit
This handler executes once each time an interrupt from the on-board timer occurs. In the preceding code, timer interrupts are assumed to occur once each millisecond, therefore the value in the CLOCK register will increment 1000 times in one second. When the 1000th interrupt occurs, the CLOCK register is set to 0, and the SECS
variable is incremented.
The 29K processor Timer Counter register includes a 24-bit Timer Count Value
(TCV) field that is automatically decremented on every processor cycle. When the
TCV field decrements to 0, it is written with the Timer Reload Value (TRV) field of
the Timer Reload (TMR) register on the next cycle. The Interrupt (IN) bit
of the TMR register is set at the same time. The following code illustrates a technique
to initialize the timer for this purpose.
204
Evaluating and Programming the 29K RISC Family
; freeze mode clock interrupt initialization
;
.equ
TICKS,(CPUCLK*1000000/RATE)
.equ
IE,0x1000000
;IE-bit in TMR reg
clkinit:
const
consth
mtsr
const
consth
mtsr
const
jmpi
const
it0,TICKS
it0,TICKS
tmc,it0
it0,(IE|TICKS)
it0,(IE|TICKS)
tmr,it0
SECS,0
lr0
CLOCK,0
;i.e., 25,000
;set counter value
;value+int.–enable
;set reload value
;set seconds=0
;set clock=0
Assuming the processor is running at 25 MHz, setting the timer reload and
count values to 25000 causes the count to decrement to 0 once each millisecond. This
will accumulate 1000 counts during one second of CPU execution. If two Operating
System Static registers can not be spared for this purpose, the SECS variable should
be located in main memory. The modified code for incrementing the seconds counter
in memory is shown below.
SECS:
.word
0
const
consth
load
add
const
store
iret
it0,SECS
it0,SECS
0,0,it1,it0
it1,it1,1
CLOCK,0
0,0,it1,it0
carry:
Because the SECS variable is only referenced once per second, the performance
degradation due to this change would be minimal. The initialization code would also
need to be modified to set the memory location for SECS to 0 in this case.
4.3.7 Removing Freeze mode
Some interrupt handlers will benefit from removing Freeze mode, without enabling interrupts, in order to use the load-multiple and store-multiple instructions. A
less common, reason for removing Freeze mode is the ability to use ALU Status bits:
V, N, Z, and C. In either case, several registers must be saved before the Freeze-mode
bit in the CPS register can be cleared.
The removal of Freeze mode represents entry into the second stage of interrupt
handling, as shown in Figure 4-1.
The frozen Program Counters (PC0 and PC1) must be saved so that the handler
will be able to resume execution of the interrupted program. If external data memory
Chapter 4
Interrupts and Traps
205
is to be accessed, the CHA, CHD and CHC Channel registers must be saved so that
their contents can be restored after a load- or store-multiple instruction has been
executed. Saving the channel registers also saves the Count Remaining register,
which is contained within the CHC register. Additionally, before any ALU/Logical
operations are performed, the ALU register must be saved.
After the Program Counters have been saved and before any Channel or ALU
operation is executed, Freeze mode can be removed by clearing the Freeze (FZ) bit of
the CPS register. This immediately removes the freeze condition, and all registers,
including the Program Counters, will update normally. The PC0 register shall reflect
the PC-BUS activity on the cycle following the clearing of Freeze mode. One cycle
later, the PC1 register shall begin to reflect the PC-BUS activity for the current
execution stream. Other registers will only be updated when the relevant instructions
are performed (as described above).
The primary benefit of leaving Freeze mode is the ability to use the load- and
store-multiple instructions. After Freeze mode has been exited, the DA bit in the CPS
register is still set and instructions causing traps should not be used. Thus, many of
the restrictions listed in the section titled Operating in Freeze mode (section 4.3.4)
will still apply, with the additional requirement that several of the interrupt temporary
global registers will be needed to hold the saved registers.
An example of code that implements removing Freeze mode is shown below.
; Removing Freeze mode example code
;
.equ
FZ,0x00000400
FZ-bit in
.equ
SM,0x00000010
SM-bit in
.equ
PD,0x00000040
PD-bit in
.equ
PI,0x00000020
PI-bit in
.equ
DI,0x00000002
DI-bit in
.equ
DA,0x00000001
DA-bit in
.equ
REMOVE,(SM|PD|PI|DI|DA)
.equ
FREEZE,(REMOVE|FZ)
CPS
CPS
CPS
CPS
CPS
CPS
intr0: ;interrupt vector points here
mfsr
it0,pc0
;save PC0
mfsr
it1,pc1
;save PC1
mtsrim cps,REMOVE
;remove Freeze mode
mfsr
it3,alu
;save ALU
mfsr
kt0,cha
;save CHA
mfsr
kt1,chd
;save CHD
mfsr
kt2,chc
;save CHC
;
; The interrupt handler code goes here
;
mtsr
chc,kt2
;restore CHC
mtsr
chd,kt1
;restore CHD
mtsr
cha,kt0
;restore CHA
mtsr
alu,it3
;restore ALU
mtsrim cps,FREEZE
;set Freeze mode
mtsr
pc1,it1
;restore PC1
206
Evaluating and Programming the 29K RISC Family
mtsr
iret
pc0,it0
;restore PC0
The example code begins by saving the Program Counters (PC0 and PC1), using MFSR instructions to move the values from special registers to temporary global
registers it0 and it1.
Freeze mode is then disabled by clearing the FZ bit in the CPS register. (Note the
bits set by the MTSRIM instruction are system implementation dependent; the RE bit
may be required.) Once Freeze mode is turned off, the ALU register will be modified
by any ALU/Logical operation. Thus, it is important that the ALU register be saved
now. (Note that two processor cycles are needed, after Freeze mode is removed, to
allow the program state to properly update the program counters.)
If interrupts are not to be re-enabled and the kernel does not require the use of
global registers (kt0–kt2), then these registers can be used to extend the number of
available interrupt temporary registers.
The ALU register is saved in temporary kernel register it3. The Channel registers (CHA, CHD and CHC) are then saved in operating system temporary registers
kt0–kt2.
The interrupt handler is still executing with interrupts disabled at this point in
the program, but load- and store-multiple instructions can be freely used, as long as
they do not cause another interrupt or trap to occur. Note, even with the DA bit in the
CPS register set, certain traps such as a Data Access Exception can still be taken.
When the handler is finished, it must reverse the process by restoring all the saved
registers. No particular order of instructions is necessary, as long as Freeze mode is
entered before PC1 and PC0 are restored. Additionally, instructions affecting the
ALU register must not be used after the saved value has been restored. By restoring
the ALU unit after Freeze mode is entered, instructions are prevented from affecting
the ALU register.
When the IRET instruction is executed, the restored Program Counters
(PC0–PC1) are used to resume the interrupted program. The restored CPS (saved in
OPS by the CPU) and Channel register contents are used to restart any unfinished
operations.
If enough global registers are not available for saving the Program Counters and
Channel registers, memory could be used for this purpose. In this case, six words of
memory are needed. Example code for saving and restoring the registers in the user’s
memory stack is shown below. Note, the pushsr and popsr macro instructions first
introduced in section 3.3.1 (page 119), are used in the example code and are presented
again below:
.macro
mfsr
sub
store
.endm
Chapter 4
pushsr,sp,reg,sreg
reg,sreg
sp,sp,4
0,0,reg,sp
Interrupts and Traps
207
;
.macro
load
add
mtsr
.endm
popsr,sreg,reg,sp
0,0,reg,sp
sp,sp,4
sreg,reg
; save registers on memory stack
;
pushsr msp,it0,pc0
;save PC0
pushsr msp,it0,pc1
;save PC1
pushsr msp,it0,alu
;save ALU
pushsr msp,it0,cha
;save CHA
pushsr msp,it0,chd
;save CHD
pushsr msp,it0,chc
;save CHC
;
const
it3,FZ
mfsr
it2,cps
andn
it2,it2,it3
mtsr
cps,it2
;remove Freeze mode
;
; The interrupt handler code goes here
;
const
it3,FZ
mfsr
it2,cps
or
it2,it2,it3
mtsr
cps,it2
;set Freeze mode
;
popsr
chc,it0,msp
;restore CHC
popsr
chd,it0,msp
;restore CHD
popsr
cha,it0,msp
;restore CHA
popsr
alu,it0,msp
;restore ALU
popsr
pc1,it0,msp
;restore PC1
popsr
pc0,it0,msp
;restore PC0
iret
The previous code can be made more efficient by saving more registers at a time,
at the expense of using a greater number of global registers. Using store-multiple
instructions to save the registers’ contents takes advantage of Burst mode in the
processor memory system.
4.3.8 Handling Nested Interrupts
Handling Nested Interrupts is a complex topic, and the method presented in this
section discusses multiple levels of interrupt nesting [Mann 1992b]. Two methods
are presented. The first method results in an interrupt mechanism similar to the
interrupt scheme used by some CISC microprocessors. The second method takes
advantage of the 29K family RISC architecture, and offers better performance. The
following section, titled An Interrupt Queuing Model (section 4.3.10), provides an
alternative solution to the problem that offers better interrupt processing throughput.
For any interrupt handler taking a significant amount of time to execute, it is
usually important to permit interrupts of a higher priority to occur. This keeps the
208
Evaluating and Programming the 29K RISC Family
latency of higher priority interrupts within acceptable limits. Whenever an interrupt
is allowed to preempt the execution of another interrupt handler, the interrupts are
said to be “nested.” That is, execution of the lower priority handler is interrupted, and
the higher priority handler begins execution immediately.
To allow for nested interrupts, it is only necessary to save the registers or
temporary variables that could be overwritten by a new interrupt handler’s context.
As in the previous example, the program counters (PC0 and PC1) and channel
registers (CHA,CHD, and CHC) need to be saved. In addition, because more than
one execution thread may need to be restarted, the Old Processor Status (OPS) and
ALU registers must be saved.
Because an interrupt may occur immediately after being enabled, it is important
that the PC0 and PC1 registers reflect the activity of the current execution PC-BUS.
As already described in the Removing Freeze Mode section, a two cycle delay occurs
before the PC1 register starts updating. Thus Freeze mode must be removed two
cycles before interrupts are enabled.
If the interrupt handler intends to use integer multiply or divide instructions or
emulated floating point instructions, the contents of the Indirect Pointers (IPA, IPB
and IPC) and the Q register should also be saved. Before interrupts are enabled, it is
also important to clear the CHC register, so that incomplete load- or store-multiple
instructions are not restarted when the first interrupt return (IRET) instruction is
executed. Figure 4-3 illustrates the context in which this could lead to unfortunate
results.
In Figure 4-3, execution of a load-multiple instruction in the main program is in
progress when an external interrupt occurs. This results in control being given to a
first-level interrupt handler. The handler enables interrupts, and another interrupt oc-
2nd level handler
1st level handler
Main Program
<Enable occurs>
LOADM
<Interrupt occurs>
End
IRET
Figure 4-3.
Chapter 4
IRET
The LOADM
instruction is
restarted in
this context.
Freeze mode
code
Interrupted Load Multiple Instruction
Interrupts and Traps
209
curs (e.g., a Timer Interrupt). When this happens, the second-level interrupt handler
is given control.
After completing its processing, execution of an IRET instruction causes the
processor to use the information in its CHC register to resume the interrupted loadmultiple instruction; but this is in the context of the first-level interrupt handler, rather than in the main program where it was interrupted.
This CHC discussion is merely an explanation to stress that the CHC register
should not only be saved and restored in each interrupt level, but that CHC should
also be cleared before interrupts are enabled. This will ensure that only when the
proper copy of the CHC is restored will execution of an IRET instruction restart the
interrupted load- or store-multiple operation.
A problem, relating to clearing the CHC register, has been observed with a
number of 29K family members. The problem effects the last word of a LOADM
instruction reaching its destination register when the LOADM is interrupted. The
problem can be overcome by performing a LOADM or STOREM instruction in the
interrupt handler after coming off Freeze mode but before reenabling interrupts. The
LOADM or STOREM must use a CR value of one or greater. Processors have a
hidden internal shadow CHC which may not be cleared by a move of zero into the
CHC register. A LOADM or STOREM instruction causes the hidden CHC register to
be cleared. The problem can also be overcome by performing a STORE or LOAD
instruction while still in Freeze mode. If interrupts are not reenabled by the interrupt
handler, no special steps are required to deal with the interrupted LOADM difficulty.
The problem is of little importance, as interrupt handlers generally perform the
solutions described without additional code being added.
Additionally, when a trap occurs as a result of a Data Exception Error (DERR)
the TF bit in the CHC will become set. It is important that the CHC register be cleared
rather than be restored for the context containing the violating data access. Otherwise
an interrupt handler loop will result.
4.3.9 Saving Registers
The following code illustrates saving the necessary registers, turning off Freeze
mode, and enabling interrupts.
;multi-level nested interrupt handler
;example code
;
intr0:
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
210
;save registers
msp,it0,pc0
msp,it0,pc1
msp,it0,alu
msp,it0,cha
msp,it0,chd
msp,it0,chc
;save
;save
;save
;save
;save
;save
PC0
PC1
ALU
CHA
CHD
CHC
Evaluating and Programming the 29K RISC Family
pushsr msp,it0,ops
;save OPS
;
;come off freeze - could use mtsrim
const
it1,FZ
mfsr
it0,cps
;get CPS
andn
it0,it0,it1
mtsr
cps,it1
;remove Freeze mode
;
;save more regs while PC0,1 get in step
pushsr msp,it0,ipa
;save IPA
pushsr msp,it0,ipb
;save IPB
pushsr msp,it0,ipc
;save IPC
pushsr msp,it0,q
;save Q
;
mtsrim CHC,0
;clear CHC
andn
it0,it1,(DI|DA)
mtsr
cps,it1
;enable interrupts
dispatch:
;
; Interrupt handler code starts here.
; Dispatch to appropriate service routine.
Saving the Indirect Pointers and Q register is a user preference, but their contents are modified by several 29K processor instructions. It is important to bear this in
mind when writing interrupt handlers. The safest approach is to always save the contents of these registers.
The above code uses a stack to save the register contents, similar to the way a
CISC processor’s microcode saves processor state. However, better performance can
be achieved by use of the large number of processor registers to cache the interrupted
context before having to resort to an interrupt context stack. The following code performs much the same task as the previous code, but it can reach the interrupt dispatcher (label dispatch:) in twelve cycles less for the first interrupt and costs only an additional two cycles for interrupts at greater levels of nesting (assuming MTSRIM is
used to update the CPS register).
This code implements a first level interrupt context cache in global registers
kt4–kt10. Global register kt11 is used to keep a record of the current level of interrupt
nesting; and should be initialized to –1, that is cache empty. Considering the speed of
the 29K family, it is likely the first-level interrupt processing will be complete before
a further interrupt occurs, thus avoiding the need to save context on a memory stack.
The use of registers rather than memory to save context also results in reduced latency between the time the interrupt occurred and the appropriate service routine starts
executing.
The example code below does not store the indirect pointer registers (IPA, IPB,
IPC, and Q). These registers do not need to be saved except by interrupt handlers
which either make use of the indirect pointers, use emulated arithmetic instructions,
or use integer multiply or divide. Best performance is achieved by postponing the
Chapter 4
Interrupts and Traps
211
saving of these registers to the specific handler routine which expects to use them.
Correspondingly, a handler which uses them is also responsible for restoring them.
.equ
not_1st:
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
jmp
mtsrim
Kmode,(PD|PI|SM|IM)
;save
;save
;save
;save
;save
;save
;save
;save
on stack
PC0
PC1
ALU
CHA
CHD
CHC
OPS
msp,it0,pc0
msp,it0,pc1
msp,it0,alu
msp,it0,cha
msp,it0,chd
msp,it0,chc
msp,it0,ops
dispatch–8
cps,REMOVE
;remove Freeze mode
jmpf
add
kt11,not_1st
kt11,kt11,1
;save registers
;test cache in use
;level count
mfsr
mfsr
mtsrim
mfsr
mfsr
mfsr
mfsr
mfsr
;save in cache
kt4,pc0
kt5,pc1
cps,REMOVE
kt6,alu
kt7,cha
kt8,chd
kt9,chc
kt10,ops
;save PC0
;save PC1
;remove Freeze mode
;save ALU
;save CHA
;save CHD
;save CHC
;save OPS
mtsrim
mtsrim
chc,0
cps,Kmode
;clear CHC
;enable interrupts
intr0:
;
cache:
;
;
dispatch:
;
; Interrupt handler code starts here.
; Dispatch to appropriate handler.
4.3.10 Enabling Interrupts
Interrupts are enabled by clearing the DI and DA bits of the CPS. If an unmasked interrupt, INTR[0..3], is pending at this point (the IP bit of the CPS register is
set to 1), the processor will immediately process the interrupt and execute the handler
at the new vector address.
In the previous code example, when interrupts are enabled, and if an interrupt
occurs, the succeeding register saves will not be performed; however, the recently
invoked interrupt handler will save these registers if it intends to enable interrupts
during its execution. The contents of the Indirect Pointers and Q register will be preserved, or not touched, depending on the nature of the nested interrupt handler.
When clearing the DI and DA bits of the CPS register, the state of the other bits
must be saved. The first example code illustrates this by using an ANDN instruction
212
Evaluating and Programming the 29K RISC Family
to AND the current contents of the register, with a complement bit pattern of the DA
and DI bits in that register (i.e., 1111 1111 1111 1100).
Figure 4-4 shows the interrupt enable logic of the Am29000 processor. Notice
that interrupts generated by the on-chip timer are controlled by the DA bit in the CPS
register. This indicates it is impossible to enable traps for use by ASSERT and other
instructions, without also permitting asynchronous interrupts from the timer to occur
(unless the on-chip timer is not being used). If it is necessary to avoid timer interrupts,
the IE bit in the TMR register can be saved, then cleared to disable timer interrupts.
The interrupt inputs to the Prioritizer logic (as shown in Figure 4-4) are not
latched, and must be continuously asserted by an interrupting external device until
the interrupt has been recognized. Recognition of the interrupt is usually accomplished by executing an instruction that accesses the interrupting device. This removes the interrupt request, which must be done before interrupts are enabled; otherwise, the same interrupt will recur immediately when interrupts are enabled.
The Interrupt Mask (IM) field of the CPS register can be used to disable recognition of interrupt requests on the INTR inputs. The mask bits implement a simplified
interrupt priority scheme that can be set up to recognize only higher-priority interrupts, while another handler is in execution.
The two-bit IM field allows four priority levels to be established. An IM field
value of zero (IM=00) enables only the interrupts occurring at the INTR0 input.
When IM = 01, both INTR0 and INTR1 are enabled; if IM = 10, then INTR0, INTR1,
and INTR2 are enabled; and if IM = 11, then INTR0, INTR1, INTR2, and INTR3 are
Instruction Access Exception
Traps 0–1
Data Access Exception
Internal Traps
Coprocessor Exception
Interrupt
Request
OR
Timer
OR
AND
IE
DI
AND
AND
DA
INTs
0–3
Prioritiser
IM
Figure 4-4.
Chapter 4
Am29000 Processor Interrupt Enable Logic
Interrupts and Traps
213
enabled. The only way to disable the INTR0 input is to set the DI (Disable Interrupts)
bit to 1 in the CPS register.
An example code fragment that sets the IM bits for a handler, according to its
priority, is shown below.
; set interrupt mask according to priority, then enable interrupts
;
.equ
MYLEVEL,2
.equ
IM,0b1100
setim:
mfsr
andn
or
mtsr
it0,cps
it0,it0,(IM|DI|DA)
it0,it0,((MYLEVEL-1)<<2)
cps,it0
In the above example, after the CPS has been moved to a global register, the bits
corresponding to the IM field, the DI bit, and the DA bit are cleared by ANDing them
with a complement mask. Next, the bits defined by the MYLEVEL definition (decreased by 1) are ORed into the proper position in the IM field, and the result is stored
back into the CPS. With the values shown, the IM field is set to the value 01, which
enables interrupts on INTR0 and INTR1.
In the main part of the handler, any Am29000 processor instructions can be
executed; however, because most of the global registers have not been saved, the handler may not have any extra working space. Depending on the number of registers
needed to carry out the handler’s task, a few additional global registers may have to
be saved, then restored.
4.3.11 Restoring Saved Registers
The final act of an interrupt or trap handler, before executing the IRET instruction, is to restore the contents of all saved registers so the complete environment of
the interrupted task is restored before execution is resumed. The proper approach to
restoring the saved registers is to reverse the steps taken to save them.
Any additional registers saved by a specific handler called by the interrupt dispatcher must restore the additional registers before the generic interrupt return code
is executed. In the case of an external interrupt, it is also important that the specific
handler has cleared the external device causing the interrupt line to be held active.
Otherwise, the processor may be forced into an interrupt handler loop. Because of
internal delays in the processor, the external interrupt must be cleared at least three
cycles before interrupts are enabled. In practice this requirement is easily met.
At this point, interrupts are still enabled. The last portion of the restoration process must run with interrupts disabled, because important processor configuration
data is being reloaded, and an interrupt occurring during this phase could hopelessly
confuse the process.
214
Evaluating and Programming the 29K RISC Family
The final code fragment is shown below:
; code to disable interrupts and complete
; the restoration of registers prior to
; issuing an IRET instruction.
;
popsr
msp,it0,q
;restore Q
popsr
msp,it0,ipc
; IPC
popsr
msp,it0,ipb
; IPB
popsr
msp,it0,ipa
; IPA
;
const
it3,(FZ|DI|DA)
mfsr
it2,cps
;disable interrupts
or
it2,it2,it3
;and
mtsr
cps,it2
;set Freeze mode
;
popsr
ops,it0,msp
;restore OPS
popsr
chc,it0,msp
; CHC
popsr
chd,it0,msp
; CHD
popsr
cha,it0,msp
; CHA
popsr
alu,it0,msp
; ALU
popsr
pc1,it0,msp
; PC1
popsr
pc0,it0,msp
; PC0
iret
The interrupt context restore code for the first-level context cache method is
shown below. Restoring the context from registers is much faster than accessing an
external memory stack.
.equ
.equ
DISABLE,(PD|PI|SM|DI|DA)
FREEZE,(DISABLE|FZ)
;
sub
kt11,kt11,1
jmpf
kt11,not_1st
mtsrim cps,FREEZE
;decrement
;level counter
;disable and Freeze
;
mtsr
mtsr
mtsr
mtsr
mtsr
mtsr
mtsr
iret
not_1st:
popsr
popsr
popsr
popsr
popsr
popsr
popsr
iret
Chapter 4
ops,kt10
chc,kt9
chd,kt8
chc,kt7
alu,kt6
pc1,kt5
pc0,kt4
;restore
;restore
;restore
;restore
;restore
;restore
;restore
;restore
from cache
OPS
CHC
CHD
CHA
ALU
PC1
PC0
ops,it0,msp
chc,it0,msp
chd,it0,msp
cha,it0,msp
alu,it0,msp
pc1,it0,msp
pc0,it0,msp
;restore
;restore
;restore
;restore
;restore
;restore
;restore
;restore
from stack
OPS
CHC
CHD
CHA
ALU
PC1
PC0
Interrupts and Traps
215
4.3.12 An Interrupt Queuing model
One approach to solving the latency demands of a high-performance system is
to simply queue interrupts in a linked list when they occur, and process them in a
higher-level context. Figure 4-5 illustrates the structure and linkages of individual
entries in the example queue. This method results in a greater interrupt processing
throughput. Less time is spent executing Freeze mode context stacking and unstacking when compared with the previously described nested interrupt handling method.
In the example program, only a few global registers are allocated—because
placing an entry into a global queue is a simple operation.
The example code in this section applies to handling receive data interrupts
from a UART port, but several types of interrupts can easily share the same queue.
For simplicity, queue entries consist of three words plus an optional data block.
Pointer to the next entry in the queue (forward link).
Received data count / active flag.
Pointer to the handler for this entry.
An optional data block.
Once an I/O operation has begun (in this case, reception of data from a UART),
an interrupt occurs for the UART device and the handler is called to place a new entry
into the global queue.
As each byte arrives, the first section of the handler continues the I/O process,
often by simply reading the data from the UART and indicating that the data has been
accepted. This causes the UART to remove the interrupt input and prepare to receive
new data.
Head of Queue
First Entry
Second (last) Entry
Tail of Queue
IRQH
Next Entry
Next Entry
IRQT
Count/Active Flag
Count/Active Flag
Pointer to Handler
Pointer to Handler
Optional Data
Optional Data
Figure 4-5.
216
Interrupt Queue Entry Chaining
Evaluating and Programming the 29K RISC Family
Only one receive operation for a given interrupt can be in progress at a time. This
allows the queue entry to contain three things: a static entry descriptor that holds a
pointer to the next entry in the queue, the byte count, and a pointer to the high-level
handler function.
The example shown below uses four global (interrupt temporary) registers for
its queue building processes. Because interrupts are disabled during this entire part of
the process, handlers for other interrupts can use these same registers.
After the first byte has been stored in the static buffer, the handler must determine if the queue is empty or if it already contains one or more entries. If empty, the
handler can immediately invoke a routine to process the entry. If the queue contains
one or more entries, the current entry is linked into the queue. The code is shown below.
; UART receive interrupt handler (intr0)
;
.reg
irqh,gr80
;queue head pointer
.reg
irqt,gr81
;queue tail pointer
.data
entry:
.word
.block
0,0,receive
256
;entry descriptor
; and data block
const
consth
load
add
add
add
store
const
consth
load
store
it1,entry+4
it1,entry+4
0,0,it0,it1
it3,it1,8
it3,it3,it0
it0,it0,4
0,0,it0,it1
it2,uart_rcv
it2,uart_rcv
0,1,it2,it2
0,1,it2,it3
;address of entry
cpeq
jmpt
nop
iret
it0,it0,4
it0,startup
intr0:
startup:
cpeq
jmpf
sub
jmp
add
;get count
;address of data
;add count
;increment count
;count->entry+4
;UART data address
;get data from UART
;save in buffer
;first byte?
;yes, start daemon
;no, return
;go daemon if not already running
it2,irqh,0
;is queue empty
it2,add
;no, link this entry
it1,it1,4
;point to entry
daemon
;yes, go daemon
irqh,it1,0
;init queue header
add:
store
Chapter 4
0,0,it1,irqt
Interrupts and Traps
;tail->entry
217
add
iret
irqt,it1,0
;entry->tail
;return
When interrupts occur for the second and succeeding bytes, they are stored in
the local data block, following the static descriptor entry.
After each byte has been stored, the handler can immediately return because a
routine has been invoked to process the entire queue. In UNIX systems, this routine is
often called a daemon. Once invoked, it continues to process entries until the queue is
empty, at which point it terminates its own execution. The title Dispatcher shall be
used to describe the routine invoked to process queue entries (see Figure 4-6). A
dispatcher routine may operate in User mode; in such case it’s operation is very similar to a signal handler (described in section 4.4).
Main Program
add queue entry
Dispatcher
Interrupt
queue
processing
add queue entry
add queue entry
End
IRET
Figure 4-6.
Freeze mode
code
An Interrupt Queuing Approach
The queue processing Dispatcher for this example must run with interrupts enabled; otherwise, new data bytes could not be received from the UART, and other interrupt driven processes could not execute. Before interrupts are enabled, a number
of processor registers must be saved, as indicated earlier. Nine kernel temporary registers are allocated for this purpose (kt3–kt11). Because the Dispatcher is used to process all queued interrupts, it will not be necessary to push these temporary registers
onto the memory stack. The example queue processing code is shown below.
; queue processing Dispatcher
;
.equ
DISABLE,(PD|PI|SM|DI|DA)
.equ
Kmode,(PD|PI|SM|IM)
.equ
FREEZE,(DISABLE|FZ)
;
Dispatcher:
mfsr
kt3,PC0
;save PC0
mfsr
kt4,PC1
;save PC1
mfsr
kt5,PC2
;save PC2
218
Evaluating and Programming the 29K RISC Family
mfsr
mfsr
mfsr
mfsr
mfsr
mtsrim
mtsrim
add
kt6,CHA
kt7,CHD
kt8,CHC
kt9,ALU
kt10,OPS
CPS,DISABLE
CHC,0
irqt,irqh,0
;save CHA
;save CHD
;save CHC
;save ALU
;save OPS
;remove Freeze mode
;clear CHC
;set tail = head
mtsrim
add
load
calli
nop
mtsrim
CPS,Kmode
kt11,irqh,8
0,0,kt11,kt11
kt11,kt11
;enable interrupts
;point to handler
;get address
;call handler
CPS,DISABLE
;disable interrupts
cpeq
jmpt
nop
load
jmp
add
kt11,irqt,irqh
kt11,finish
;queue empty?
;yes, wrapup
0,0,kt11,irqh
loop
irqh,kt11,0
;no, get next entry
;and loop back
;with head<-next
cps,FREEZE
irqh,0
PC0,kt3
PC1,kt4
PC2,kt5
CHA,kt6
CHD,kt7
CHC,kt8
ALU,kt9
OPS,kt10
;enable freeze mode
;make queue empty
;restore PC0
;restore PC1
;restore PC2
;restore CHA
;restore CHD
;restore CHC
;restore ALU
;restore OPS
;terminate execution
loop:
finish:
mtsrim
const
mtsr
mtsr
mtsr
mtsr
mtsr
mtsr
mtsr
mtsr
iret
Note that the example code does not save the Indirect Pointers (IPA–IPC) or the
Q register. If any of the individual high-level handlers will disturb the contents of
these registers, they must also be saved. If high-level handlers are written carefully, it
will not be necessary to save.
The queue processor is responsible for removing entries from the queue and
calling the handler associated with each entry. In the above example, a pointer to the
high level handler is contained in the third word of the entry descriptor (in this case,
receive).
The handler is called after Freeze mode has been disabled, and interrupts are enabled. When the handler receives control, the IRQH register points to the queue entry.
The high-level handler is responsible for removing the data associated with a
queue entry, and it must do this with interrupts disabled; however, interrupts need
only be disabled while the data is being removed and when the queue entry data count
Chapter 4
Interrupts and Traps
219
is reset to zero. Any other portions of the handler not relevant to these tasks can run
with interrupts enabled.
After the handler has disposed of the data, it returns control to the Dispatcher,
which disables interrupts, enables Freeze mode, and attempts to process the next
entry in the queue. If no entries remain, it restores the saved registers from kernel
temporary registers kt3–kt10, and executes an IRET instruction to return control to
the interrupted task.
In cases where a transaction with an external device takes a long time, compared
with the execution time of the high-level handler, the data is moved in chunks.
An execution profile of this process might include the following threads.
User process
Interrupt-1
User process
resumes exeuction
Interrupt-2
Process daemon
High-level Handler-1
Process daemon
returns
High-level Handler-2
Handler-1
completes
Handler-1 is
interrupted
User process
is interrupted
Handler-2 is
interrupted
Figure 4-7. Queued Interrupt Execution Flow
Interrupt function stores several bytes of data into the data block.
Process Dispatcher executes the high-level handler, which empties the bytes,
zeros the count in the queue entry.
Another handler might execute for another active interrupt task.
Interrupt function creates new queue entry for the next series of received data
bytes.
High-level handler gets called to remove the bytes after the process Dispatcher
has finished with the current queue entry.
Figure 4-7 illustrates this process. The occurrence of Interrupt-1 causes the ongoing User process to be interrupted, and initiates execution of its interrupt handler.
The process builds the first queue entry and initiates execution of the Process Dispatcher. The Dispatcher passes control to High-level Handler-1, which begins execution.
This handler is interrupted by the occurrence of Interrupt-2 and Interrupt-1
events as it executes between these interruptions. When Handler-1 completes, it re-
220
Evaluating and Programming the 29K RISC Family
turns control to the Process Dispatcher, which selects the next queue entry and turns
control over to high-level Handler-2.
During this execution, one more Interrupt-1 event occurs, which results in the
creation of another queue entry. This entry is processed when high-level Handler-2
finishes its execution and the Process Dispatcher again receives control.
High-level Handler-1 processes the remaining data and returns control to the
Process Dispatcher which, upon finding no more queue entries, returns to the interrupted user process.
Each execution of the interrupt processes Interrupt-1 or Interrupt-2, as well as
the Process Dispatcher and high-level Handler-1 and high-level Handler-2 code segments, is quite short; however, with the short execution approach, individual interrupt priorities are not taken into account. If priority handling of interrupts is important, a different approach is needed. For example, entries could be linked into a single
queue, with their position in the queue determined by their priority. In this case, more
sophisticated queue handling procedures would have to be implemented; however, a
given high-level handler would still execute to completion before another handler is
given control.
To handle fully-nested priority-oriented interrupts, that is the ability of a higher
priority interrupt to preempt the execution of a lower priority handler, requires
an interrupt stack (possibly with the support of a interrupt context cache). It is questionable whether the responsiveness of the nested interrupt technique would override
the increased overhead of saving and restoring many sets of registers.
In the approach shown in the previous code only nine global registers are required. These serve for all interrupt handlers in the system. During the execution
of the Freeze-mode interrupt handler only four interrupt temporary registers are used
(it0–it3).
4.3.13 Making Timer Interrupts Synchronous
The 29K on–chip timer can be configured to generate an interrupt when the
Timer Counter Register (TCR) decrements to zero; more accurately, when the 24–bit
TCV field of the TCV register reaches zero. The TCV field is decremented with each
processor cycle; when it reaches zero, it is loaded with the Timer Reload Value field
(TRV) in the Timer Reload register (TR).
When the Interrupt Enable bit (IE) of the TR register is set and the TCV reaches
zero, the processor will take a timer interrupt unless the DA bit is set in the Configuration Register (CFG). Two–bus and microcontroller members of the 29K family can
additionally disable timer interrupts by setting the TD bits in the CPS register. Timer
interrupts are not disabled by setting the DI bit in the CFG. This means timer interrupts can not be simply disabled along with other external asynchronous interrupts
by setting DI. Note, the TRAP[1,0] asynchronous inputs are not disabled by setting
Chapter 4
Interrupts and Traps
221
the DI bit. For this reason, the use of TRAP[1,0] pins requires complex software support. It is best to avoid the use of these input pins.
It is often desirable to disable timer interrupts during critical code stages, because timer interrupts often lead to such tasks as context switching. However, timer
interrupts may be required to support a real–time clock, and to maintain accuracy, a
timer interrupt can not be missed. The timer interrupt must be taken but processing
the event can be postponed till later, when it can be dealt with. To do this efficiently,
the Freeze mode interrupt handler for the timer should set register ast to true. This
register is a kernel space support register chosen from the range ks0–ks15
(gr80–gr95). It indicates an Asynchronous Software Trap (AST) is ready for processing. The ast register can be quickly tested with an ASSERT type instruction, as
shown below:
mfsr
andn
mtsrim
asneq
iret
it0,ops
it0,it0,1
ops,it0
V_AST,ast,0
;get OPS register, DA already clear
;clear DI bit
;enable interrupts
;trap if ast != 0, timer ’event’
;otherwise iret
Clearing the DI bit reenables asynchronous interrupts (with the exception that
TRAP[1,0] are already active); but we must check to see if an AST is pending (timer
event). The high level timer processing is performed before the IRET instruction is
executed, via trapware supporting the V_AST trap.
4.4
USER-MODE INTERRUPT HANDLERS
Many present day operating systems allow interrupt handlers to be written in
high-order languages. User mode routines for 29K Processor Family based systems
are no different. When providing this facility, the operating system designer must be
aware of the following concerns.
User mode programs are often written by programmers who lack specific
knowledge of the operating system and it’s allocation of global registers.
The User mode handler, when written in a high-level language, such as C, will
require access to the local register stack, as well as global registers defined for
its management.
A good approach for addressing these concerns is to perform all necessary register saving, with interrupts disabled, while in Supervisor mode; remove the cause of
the interrupt, then enable interrupts and enter User mode to execute the user’s interrupt handler code. This allows interrupt (signal) handlers to be compatible with
AMD’s Host Interface (HIF) v2.0 Specification (see section 2.2), which includes the
definition of operating system services . These services install and invoke user-supplied interrupt handlers for floating-point exceptions and keyboard interrupt
222
Evaluating and Programming the 29K RISC Family
(SIGFPE and SIGINT) events. It also allows the operating system to perform its own
register preservation and restoration processes, without burdening the user with technical operating system details. Complete listings of the code contained in this section
are provided in Appendix B and also by AMD in their library source code. Users who
intend to modify any of this code should bear in mind that the SPILL, FILL, setjmp,
longjmp, and Signal Trampoline code are highly interdependent.
The code uses an expanded interface definition that uses a set of global registers
to hold the important local register stack support values emitted by compiler generated code in User mode programs. The registers defined for this environment are
shown in Table 4-2, and were discussed in detail in section 3.3 (page 117).
Table 4-2. Expanded Register Usage
Names
tav
tpc
lrp
slp
msp
rab
rfb
Registers
Usage Description
gr121
gr122
gr123
gr124
gr125
gr126
gr127
Trap Argument Vector
Trap Return Pointer
Large Return Pointer
Static Link Pointer
Memory Stack Pointer
Register Allocate Bound
Register Free Bound
In order to prepare for execution of a User mode handler, the HIF specification
indicates that the Supervisor mode portion of the handler must save important registers in the user’s memory stack, as shown in Figure 4-8. In the figure, the stack pointer (msp) is shown decremented by 48 bytes (12 registers times 4 bytes each), and
positioned to point to the saved value of register tav.
Other registers may need to be saved to allow complete freedom in executing
29K processor instructions (such as multiply or divide trap routines) in the Usermode handler code. Candidates for saving are the Indirect Pointers (IPA–IPC), the Q
register, the stack frame pointer, fp (lr1), and the local register stack bounds in rfb. In
addition, because high-level languages use many of the global registers as temporaries, these (gr96–gr124) may also have to be saved.
4.4.1 Supervisor mode Code
When an interrupt occurs, the supervisor portion of the interrupt handler is
executed. This code is responsible for saving important processor registers, as shown
in Figure 4-8. The assembler macro instructions, used earlier in this chapter (push,
pop, pushsr and popsr), and described in detail in section 3.3.1 (page 119), are used in
the following code examples to aid in pushing special registers onto the memory
stack.
Chapter 4
Interrupts and Traps
223
Signal Number
gr1
rab
PC0
User’s Memory Stack
The stack is shown with
higher addresses at the
top of the figure, and
lower addresses at the
bottom.
PC1
PC2
CHA
CHD
CHC
ALU
OPS
tav
Register msp points to
the last register saved
by the Supervisor
mode portion of the
handler when control
is given to the User
mode code.
Figure 4-8. Saved Registers
The code to save the registers is executed in Supervisor mode, with Freeze mode
enabled, as indicated in prior section 4.3.7. This ensures that a higher priority interrupt does not disrupt this critical section of code. The code is shown below.
; supervisor portion of interrupt handler
;
sigint:
jmp interrupt
const it0,2
;SIGINT
;
sigfpe:
const
it0,8
;SIGFPE
;
interrupt:
sub
msp,msp,4
store
0,0,it0,msp
;save signal number
sub
msp,msp,4
store
0,0,gr1,msp
;push gr1
sub
msp,msp,4
store
0,0,rab,msp
;push rab
const
it0,512
sub
rab,rfb,it0
;set rab = rfb-512
;
pushsr msp,it0,PC0
pushsr msp,it0,PC1
pushsr msp,it0,PC2
pushsr msp,it0,CHA
pushsr msp,it0,CHD
pushsr msp,it0,CHC
pushsr msp,it0,ALU
224
Evaluating and Programming the 29K RISC Family
pushsr
msp,it0,OPS
;
sub
store
trampoline:
msp,msp,4
0,0,tav,4
;push tav
At this point in the code, with all of the critical registers saved, the memory stack
will appear as shown in Figure 4-8. When the User mode interrupt handler is complete, these registers will be restored.
Special provisions were made in the code above in anticipation of the following
situation: If a FILL operation is interrupted, and the trampoline code has not yet
realigned the rab register to rfb-WindowSize, another interrupt occurring at that
point could again activate the trampoline code. This interrupt could cause the trampoline code to assume that a FILL operation was in progress, thereby causing it to
“reposition” the value in PC1 to recommence the (assumed) FILL operation.
; Now come off freeze, and go to user-mode code.
; ensure load/store does not restart
;
trampoline:
;ensure load/store
mtsrim chc,0
; does not restart
const
it1,RegSigHand
consth it1,RegSigHand
load
0,0,it1,it1
cpeq
it0,it1,0
jmpt
it0,SigDfl
;jump if no handler(s)
add
it0,it1,4
mtsr
pc1,it1
mtsr
pc0,it0
iret
Two types of interrupts are handled by this code: keyboard interrupts and floating-point exceptions. It is assumed that the interrupt vectors were previously set to
vector to either sigint or sigfpe, depending on the type of interrupt. Interrupt temporary (it0 ) is used to contain the type of interrupt (signal), when entering the common
code at label interrupt.
Once the memory stack is set up as indicated, the User mode portion of the handler (beginning at label sigcode) is placed into execution by loading Program Counters (PC0 and PC1) with the address of the handler. Then while still in Freeze mode
with interrupts disabled, an IRET instruction is executed to begin execution of the
handler.
The HIF specification indicates that User mode signal handlers must call one of
the specified signal return services to return control to the user’s code at the appropriate point. When one of these services (sigret, sigrep, sigdfl, or sigskp) is called via an
ASSERT instruction, msp will point to the same location shown in Figure 4-8, so the
supervisor portion of the handler can properly restore the interrupted task’s environment.
Chapter 4
Interrupts and Traps
225
The following code fragment illustrates how one of the return services restores
all of the registers. It is invoked by the HIF Service Trap (69) with interrupts disabled
and Freeze mode enabled—as is the case with any interrupt or trap.
; Signal return service, restore registers
;
sigret:
; assume msp points to tav
load
0,0,tav,msp
;restore tav
add
msp,msp,4
;
popsr OPS,it0,msp
;pop specials
popsr ALU,it0,msp
popsr CHC,it0,msp
popsr CHD,it0,msp
popsr CHA,it0,msp
popsr PC2,it0,msp
popsr PC1,it0,msp
popsr PC0,it0,msp
load
0,0,rab,msp
;pop rab
add
msp,msp,4
load
0,0,it0,msp
;pop rsp
add
gr1,it0,0
add
msp,msp,8
;discount signal
iret
;
number
As indicated in the HIF Specification, User mode interrupt handlers must save a
number of additional registers, to prepare for executing high-level language code.
The following section discusses some of the necessary preparations.
4.4.2 Register Stack Operation
The 29K Processor Family contains 128 general registers that can be configured
as a register stack. In this case, global register (gr1) is used to point to the first register
in this group that belongs to the current process. This first register is addressed as lr0
(local register 0).
Several additional global registers provide other information describing the
register stack bounds. These are all shown in Figure 4-9, which illustrates the implementation of the local register file as a shadow copy of the memory-based register
stack cache.
The rab, rsp, lr1, and rfb registers (shown in Figure 4-9) contain the bounds of
the current memory stack cache in the form of addresses.
The rsp (register stack pointer) shown in Figure 4-9 is assigned to global register gr1, whose low-order 9 bits (bits 0 and 1 are not used for register addressing) are
used to address the local register file whenever a local register number is encountered
in a 29K processor instruction. Therefore, local register lr2 is actually referenced by
the CPU, by adding 8 (2 words) to the value held in register gr1.
Other important details of the register stack and local register file are discussed
in Chapter 7 (Operating System Issues).
226
Evaluating and Programming the 29K RISC Family
local
register
file
register free bound
gr127
Spilled
Activation
Records
frame pointer
lr1
Other
Stack
Entries
ÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍ
rfb
fp
Current Activation
Record (in local
registers)
lr3
lr2
lr1
lr0
lr127
register stack
pointer
gr1
register
allocate bound
gr126
rsp
Unused
rab
Figure 4-9. Register and Stack Cache
The important concern in writing interrupt handlers that use local registers is
that the local register file bounds and contents at the time of an interrupt reflect the
current state of the interrupted program.
For example, looking at Figure 4-9, when an application calls a function, the activation record for the new function is allocated immediately below the current rsp;
occupying part of the register file whose corresponding section is indicated as “unused.” If the new activation record is larger than the currently unused space (i.e., rsp
is decremented to point below the current value in the rab register), the stack is said to
have overflowed. When this overflow occurs, some of the existing registers in the
local register file must be “spilled” to make room for the new activation record. The
number of registers involved in the “spill” must be sufficient to allow the entire new
activation record to be wholly contained in the local register stack.
A similar situation occurs when a called function is about to return to its caller
and the entire activation record of the caller is not currently contained in the local
register file. In this case, the portion of the caller’s activation record not located in the
register file must be “filled” from the memory stack cache. Management of the local
register file requires the use of User mode functions that perform SPILL and FILL
operations, in concert with a Supervisor mode trap handler when the SPILL or FILL
operation is needed.
Chapter 4
Interrupts and Traps
227
4.4.3 SPILL and FILL Trampoline
High-level language compilers automatically generate code that tests for a required SPILL upon entry to a called function, and for a required FILL operation just
before a called function exits. In either case, the SPILL or FILL is initiated by an
ASSERT instruction whose assertion fails. This causes the SPILL or FILL trap handler to begin its execution in Supervisor mode with special registers frozen.
The Supervisor mode code must initiate execution of the appropriate handler by
leaving Supervisor mode, and doing its processing in User mode. Several benefits are
obtained from operating SPILL or FILL handlers in User mode. First, the overhead
of leaving Freeze mode is avoided, handlers must leave Freeze mode because they
require the use of load- and store-multiple instructions. Additionally, FILL and
SPILL handlers may require several machine cycles to complete, if they were to operate with DA set, a potential interrupt latency problem would result.
The following entry points, SpillTrap and FillTrap are directly invoked by
their corresponding hardware vectors when the associated ASSERT instruction
is executed. The operands SpillAddrReg and FillAddrReg are aliased to kernel static registers (two of ks0–ks15), which hold the addresses of the User mode SPILL and
FILL handlers.
Because the processor’s execution jumps from Supervisor mode to User mode
in this fashion, the SpillTrap and FillTrap code is called a trampoline. The SpillTrap and FillTrap trampoline code is shown below.
SpillTrap:
;
; Preserve the return address in the
; designated register
mfsr
tpc,PC1
;
; Fixup PC0 and PC1 to point at the user
; designated spill handler
mtsr
PC1,SpillAddrReg
add
tav,SpillAddrReg,4
mtsr
PC0,tav
;
; And return to that handler
iret
FillTrap:
;
; Preserve the return address in the
; designated register
mfsr
tpc,PC1
;
; Fixup PC0 and PC1 to point at the user
; designated fill handler
228
Evaluating and Programming the 29K RISC Family
mtsr
add
mtsr
PC1,FillAddrReg
tav,FillAddrReg,4
PC0,tav
;
; And return to that handler
iret
The SpillTrap and FillTrap routines both turn control over to the User mode
sections of their respective handlers by modifying the addresses held in the processor’s frozen PC0 and PC1 registers. This happens after the current address in PC1 has
been temporarily saved in register tpc (gr122).
When the IRET instruction is executed, the processor reenters User mode, with
the same interrupt enable state as when the trap occurred, and begins execution at the
address loaded into PC1.
4.4.4 SPILL Handler
The FILL and SPILL handlers are executed in User mode to ensure the greatest
processor performance for these operations. The handlers are invoked by the Supervisor mode trap handler, usually with interrupts enabled. This permits SPILL and
FILL operations to be interrupted, and to use load- and store-multiple operations to
accomplish their task.
An example User mode SPILL handler is shown below.
;
;
;
;
;
;
;
;
;
;
spill handler
spill registers from (*gr1-*rab)
and move rab down to where gr1 points.
On entry: rfb - rab = windowsize,
gr1 < rab.
Near the end: rfb - rab > windowsize,
gr1 == rab
; On exit: rfb - rab = windowsize,
;
gr1 == rab
;
.global spill_handler
spill_handler:
sub
srl
sub
mtsr
sub
sub
add
storem
jmpi
add
Chapter 4
tav,rab,gr1
tav,tav,2
tav,tav,1
CR,tav
tav,rab,gr1
tav,rfb,tav
rab,gr1,0
0,0,lr0,tav
tpc
rfb,tav,0
Interrupts and Traps
;bytes to spill
;bytes to words
;make zero based
;set CR register
;dec. rfb by tav
;copy rsp into rab
;store lr0..lr(tav)
;return...
229
In the above code, that the condition for entry is that global register gr1 (rsp) has
already been decremented to a value less than the current value in rab. This lower
value is what signals the need to spill some registers. The order in which the management registers are changed by the SPILL handler is very important, particularly if an
interrupt were to occur during the SPILL operation. In this case, register rab must be
changed before rfb.
The value in register rab is maintained for convenience, and performance gain;
it is a cache of the rfb-WindowSize value. The rfb register is the anchor point for
local register file (cache) and memory resident register-stack crossover.
4.4.5 FILL Handler
The FILL handler is similar to the SPILL handler, except that bytes are moved
from the memory stack to the local register file. This handler is initiated when the
value in lr1 is larger than the current value in the rfb register.
;
;
;
;
;
;
;
;
;
;
fill registers from [*rfb..*lr1)
and move rfb upto where lr1 points.
On entry: rfb - rab = windowsize,
lr1 > rfb
Near the end: rfb - rab < windowsize,
lr1 == rab + windowsize
On exit: rfb - rab = windowsize,
lr1 == rfb
.global
fill_handler:
const
or
mtsr
sub
add
srl
sub
mtsr
loadm
jmpi
add
fill_handler
tav,(0x80<<2)
tav,tav,rfb
IPA,tav
tav,lr1,rfb
rab,rab,tav
tav,tav,2
tav,tav,1
CR,tav
0,0,gr0,rfb
tpc
rfb,lr1,0
;tav=[rfb]<<2
;ipa = [rfb]<<2
;tav = byte count
;push up rab
;word count
;zero based
;set CR register
;load registers
;return...
;...pushing up rfb
In the case of a fill condition, the rfb register must be changed only after the
FILL operation is complete; however, the rab register is modified prior to execution
of the LOADM instruction. That is, the anchor point indicated by register rfb must be
updated only after the data transfer has been accomplished.
230
Evaluating and Programming the 29K RISC Family
4.4.6 Register File Inconsistencies
The discussion of SPILL and FILL User mode handlers is important when
writing interrupt routines because a SPILL or FILL may be incomplete at the time
the interrupt occurs. Depending on whether a SPILL or FILL is in progress, the interrupt handler must prepare the register stack support registers before attempting to
pass control to a User mode handler that makes use of the local register file.
Figure 4-10 illustrates a global view of the register stack, as it might appear both
in the local registers and in the memory stack cache at the time of an interrupt. In this
case, the interrupt occurred during execution of a SPILL operation, probably during
execution of the STOREM instruction. Therefore, the address in register gr1 has already been decremented in anticipation that the proposed activation record will fit in
memory stack cache
setjmp refers
to an activation
record at this point
in the stack cache
local
register register free bound
gr127
file
frame pointer
lr1
ÍÍÍÍÍÍ
ÍÍÍÍÍÍ
ÍÍÍÍÍÍ
ÍÍÍÍÍÍ
lr9
lr6
rfb
Other
Stack
Entries
fp
Caller’s Activation
Record (in local
registers)
lr8
lr7
Spilled
Activation
Records
register gr1
used to point
here
lr5
lr4
lr3
lr2
lr1
lr0
register
allocate bound
gr126
Proposed
Activation
Record
register stack
pointer
gr1
rsp, rab
Figure 4-10. Stack Upon Interrupt
Chapter 4
Interrupts and Traps
231
the local registers. In addition, because a SPILL operation was necessary, the rab
register has also been set equal to gr1 in the SPILL handler.
The interrupt handler must recognize this condition because it must prepare the
register stack for entry into a C language user interrupt function. This will require the
stack management registers to be consistent. Repairing stack inconsistencies depends on the interrupt handler being able to recognize each unique situation where
such an inconsistency could occur. In the case of the C language environment, there
are three situations that must be detected.
The interrupt occurred when a SPILL was in progress, in which case the distance between the values in the rfb and rab registers exceeds the size of the local
register file (referred to as the WindowSize).
The interrupt occurred when a FILL operation was in progress, in which case
the distance between the values in the rfb and rab registers is less than the size of
the local register file.
The interrupt occurred during a far-longjmp operation (see Figure 4-12a), in
which case the value (gr1 + 8) — which is the address of local register lr2 on the
register memory stack — is greater than the value in the rfb register.
The following code fragment illustrates a method of recognizing these inconsistent stack conditions.
The Supervisor mode portion of the interrupt handler has saved the important
processor registers as shown in Figure 4-8. Because the User mode portion of the
handler is intended to execute a C language function, additional registers will need to
be saved. The register stack support registers, indirect pointers (IPA–IPC), as well as
global registers (gr96–gr124) are pushed onto the memory stack just below the signal context frame.
sigcode:
push
msp,lr1
push
msp,rfb
push
msp,msp
sub
msp,msp,3*4
;
pushsr msp,tav,IPA
pushsr msp,tav,IPB
pushsr msp,tav,IPC
pushsr msp,tav,Q
;
sub
msp,msp,29*4
mtsrim cr,29–1
storem 0,0,gr96,msp
;push R–stack
; support
;M–stack support
;Floating Point
;User mode specials
;push gr96–gr124
Additional space on the memory stack is allocated for floating point registers. If
the C language signal handler is to make use of floating point resources then the necessary critical support registers should be saved. Further discussion of these and an
232
Evaluating and Programming the 29K RISC Family
explanation of the format of the saved context information can be found in Chapter 5
(Operating System Issues). After the additional context status has been saved the register stack condition can then be examined.
;Recognize inconsistent stack conditions
const
consth
load
add
load
sub
cpgeu
jmpt
add
cpgtu
jmpt
nop
gr96,WindowSize;get cache size
gr96,WindowSize
0,0,gr96,gr96
gr98,msp,SIGCTX_RAB
0,0,gr98,gr98
;interrupted rab
gr97,rfb,gr98
;rfb–rab <= WS
gr97,gr97,gr96
gr97,nfill
;jmp if spill
gr97,gr1,8
;or normal stack
gr97,gr97,rfb
;longjmp test
gr97,nfill
;yes, longjmp case
;jmp if gr1+8 > rfb
;
;Fixup registers to re–start FILL operation
ifill:
add
push
const
consth
push
sub
push
const
sub
push
gr96,msp,SIGCTX_RAB+4
gr96,rab
;resave rab=rfb–512
gr98,fill_handler+4
gr98,fill_handler+4
gr96,gr98
;resave PC0
gr98,gr98,4
gr96,gr98
;resave PC1
gr98,0
gr96,gr96,3*4
;point to CHC
gr96,gr98
;resave CHC=0
The variable WindowSize is initialized to the size of the local register stack, in
bytes, when the library signal function is first called. Referring to Figure 4-10, and to
the code fragment shown above, it is clear that the result of subtracting the saved rab
from rfb will be larger than the local register stack size. Therefore, the program will
handle the spill (and normal interrupt) cases by jumping to label nfill. The longjmp
case, once detected, is also sent to the nfill label, where the code discriminates between the conditions.
;discriminate between SPILL, longjmp and
; normal interrupts
nfill:
cpgtu
gr96,gr1,rfb
;if gr1 > rfb
jmpt
gr96,lower
;then gr1 = rfb
cpltu gr96,gr1,rab
;if gr1 < rab
jmpt
gr96,raise
;then gr1 = rab
nop
sendsig:
In the interrupted FILL case, the saved rab value is over-written with the realigned rab value. The send-signal code (section 4.4.1) subtracted the WindowSize
Chapter 4
Interrupts and Traps
233
from the value in rfb to determine the aligned rab value. This was done before issuing
an IRET to sigcode.
Essentially, this restores rab to where it pointed immediately before executing
the function call that caused the FILL operation. Note that this recomputation is also
valid for a normal case, where the management registers are consistent.
The two comparisons shown below determine which method, if any, should be
used to repair the value in register gr1. The method depends on whether a longjmp,
SPILL, or normal interrupt occurred. This is required to align gr1 to a valid cache
position where longjmp or SPILL is interrupted. The following code fragment
shows the code associated with the lower and raise labels.
;lower or raise value in gr1
lower:
jmp
add
sendsig
gr1,rfb,0
;set gr1 = rfb
jmp
add
sendsig
gr1,rab,0
;set gr1 = rab
raise:
According to the situation depicted in Figure 4-10, when a SPILL operation is
interrupted, code at the raise label is executed; however, the code resumes at the label
sendsig.
The code fragment titled “fix–up registers to restart FILL operation”, shown
above, is entered if the interrupt occurred during a FILL operation. If so, it is necessary to change the saved values for the Program Counters, PC0 and PC1, and clear the
value saved in the CHC register. These registers are assumed to have been saved in
the order shown in Figure 4-8. This is required in addition to realigning the register
stack support register, rab.
The identifiers called SIGCTX_RAB, and SIGCTX_SIG are defined as numeric offsets (to be added) to the memory stack address held in register msp. Making
these changes will effectively restart the FILL operation from its beginning. This
code also falls into the code beginning at label nfill, but in the case of an interrupted
FILL operation, the value in register gr1 will not be adjusted.
4.4.7 Preparing the C Environment
After stack repairs have been made to the (possibly inconsistent) management
registers, it is necessary to prepare for C language interrupt handler code execution.
These preparations consist mainly of setting up a new stack frame from which the
user’s handler can execute. At this point in the process, the register stack and memory
cache appear as shown in Figure 4-11.
The following code fragment picks up at the label sendsig, which is repeated for
clarity. The handler is almost ready to pass control to the user’s C language handler
code, but first it must set up a stack frame that looks as though the user’s function was
234
Evaluating and Programming the 29K RISC Family
memory stack cache
local
register
file
register free
bound
frame pointers
lr1
lr3
lr2
Spilled
Activation
Records
ÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍ
register
stack pointer
gr1
rfb
Other
Stack
Entries
fp
Current Activation
Record (in local
registers)
lr1
lr0
register
allocate bound
rab
rsp, rab
Proposed
Handler
Activation
Record
Figure 4-11. Stack After Fix–up
called in a normal fashion (rather than being invoked as part of an interrupt handler).
This is accomplished in the same way a normal C language function allocates its
stack frame upon entry.
; Create an activation record on the stack
; for our handler, so the user code will
; operate as though it has been “called”
;
.equ
RALLOC,4*4
;space for function
sendsig:
sub
gr1,gr1,RALLOC
asgeu
V_SPILL,gr1,rab
add
lr1,rfb,0
;set lr1 = rfb
add
gr97,msp,SIGCTX_SIG
load
0,0,lr2,gr97
;restore sig number
sub
gr97,lr2,1
;get handler index
sll
gr97,gr97,2
;point to addresses
;Handler must not use HIF services other ;than the _sigret() type.
const
gr96,SigEntry
consth gr96,SigEntry
add
gr96,gr96,gr97
load
0,0,gr96,gr96
;registered handler
cpeq
gr97,gr96,0
jmpt
gr97,NoHandler
nop
Chapter 4
Interrupts and Traps
235
calli
nop
NoHandler:
jmp
nop
lr0,gr96
;call C–level
;signal handler
__sigdfl
The user function called by the above code is assumed to be one that has been
passed to the signal library function to process either SIGINT or SIGFPE interrupts,
or both. The SigEntry label in the above code refers to a table of pointers. In the example, one contains the address of a user signal handler for keyboard interrupts (SIGINT) and the other points to the handler for floating-point exceptions (SIGFPE). A
pointer to the user handler for each of these is installed in the SigEntry table by the
signal library function.
4.4.8 Handling Setjmp and Longjmp
Although not strictly related to interrupt handling, many C language libraries
contain a setjmp routine used to record the values of the register and memory stack
support registers, and an additional longjmp routine that allows a program to jump to
a consistent environment saved by a previous call to the setjmp routine.
Figure 4-12 illustrates the location in the stack and memory cache to which the
saved information from a previously executed setjmp call might refer. The saved information (stored in a special record specified in the call to setjmp), contains the values of gr1, msp, lr0, and lr1, as they appear when the call to setjmp was made.
Interrupt handler code must make provisions for a User mode handler to call the
longjmp function from within the code. During the course of executing the longjmp,
the values stored in the marker record are loaded into their respective registers. The
processor is executing in User mode, with interrupts enabled, so this process might be
interrupted at any point. The interrupt handler code that recognizes stack inconsistencies (presented earlier) handles this case by fixing up the management registers, to
establish a consistent stack. When the interrupt handler returns, the longjmp will be
properly completed.
Not all User mode signal handlers will have to contend with the complexities
introduced by setjmp and longjmp function calls. In this case, the code presented
earlier can be somewhat simplified; however, because the amount of code devoted to
this potential situation is very small, it is recommended that users provide the additional checks and compensating code.
236
Evaluating and Programming the 29K RISC Family
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
External Memory
External Memory
Fill In
lr1 (set)
gr1 (set)
Cache at setjmp()
rfb = lr1 (set)
gr1 = gr1 (set)
rfb
gr1
Cache after longjmp()
Cache at longjmp()
(a) Long–Jump to a far Setjm
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ
ÉÉÉ ÉÉÉ ÇÇÇ
ÉÉÉ
ÉÉÉ
ÇÇÇ
ÉÉÉ
ÉÉÉ
ÉÉÉÉ
ÉÉÉÉ
ÉÉÉÉ
ÉÉÉÉ
External Memory
Fill In
lr1 (set)
External Memory
Already
in Cache
rfb
gr1 (set)
Cache at setjmp()
rfb = lr1 (set)
gr1 = gr1 (set)
gr1
Cache at longjmp()
Cache after longjmp()
(b) Long–Jump to a near Setjmp
Figure 4-12. Long–Jump to Setjmp
Chapter 4
Interrupts and Traps
237
238
Evaluating and Programming the 29K RISC Family
Chapter 5
Operating System Issues
Because application programs make use of operating system services, the
overhead costs associated with typically requested services is of great interest. With
the performance levels offered by the best RISC implementations, these overhead
costs have become very low. However, the often increased complexity of RISC
operating systems has lead to some confusion about the efficiency of operating
system implementations.
This chapter discusses in detail the various forms of context switching which
occur between operating system and application code. This particular task is one of
the more complex functions supported by a typical operating system. Also discussed
are general issues related to context switching. The large number of registers
available to application programs may initially suggest that the 29K is not ideal at
performing application context switching. However, there are a number of
optimizations which, when applied, greatly reduce context save and restore times
[Mann 1992a].
The code examples shown make use of a number macros for pushing and
popping special registers to an external memory stack. These macros were presented
in section 3.3.1, Assembly Programming.
Within this chapter, context information will be frequently stored and reloaded
from a per–task data region known as the Process Control Block (PCB). An operating
system register in the range of ks1–ks12 is assumed to point within the PCB stack.
The example code assumes that the relevant register known as pcb has already been
assigned the correct memory address value by operating system specific code. The
example code also uses constants of the form CTX_CHC. These are offsets from the
top of the PCB stack (lower address) to the relevant address containing the desired
register information (the CHC register in the example). When a memory stack is used
239
to save the context in place of the PCB data structure, the CTX_ offset constants may
still be used.
5.1
REGISTER CONTEXT
Part of the increased performance of the 29K family comes from using 128
internal registers as a register stack cache. The cache holds the top of the run–time
stack. Each procedure obtains its necessary register allocation by claiming a region
of the register stack. The register cache does not have to be flushed (spilled) until
there is insufficient unallocated register space. This happens infrequently. The
register stack offers greater performance benefits over a data memory cache, due to
register cache triple porting on–chip (two read ports and one write port). Note, the
Am29050 has an additional write port which can be used to simultaneously
write–back a result from the floating–point unit. Chapter 2 explains in detail the
procedure calling mechanism’s use of the cache.
However, when a context switch is required from one user task to another user
task, it is necessary to copy all internal registers currently allocated to the current user
task to a data memory save region. This makes the registers available for use by the
in–coming task.
In performing a context switch, a clear understanding is required of processor
register usage. The AMD C Language register usage convention (see section 2.1)
makes 33 of the 65 global registers (gr1, gr96–gr127) available for User data storage.
Global registers gr128–gr255, used to implement the local register stack, are also
used by the compiler generated code. (See section 3.3 (page 117) of Chapter 3,
Assembly Language Programming, for global register assignment.)
Processor global registers gr64–gr95 are not accessed by C generated code.
These registers are normally used by the Supervisor to store operating system
information or implement interrupt handler temporary working space. Particular
Supervisor implementations may store data in registers gr64–gr95. This data is
relevant to the task currently executing, and includes such information as pointers to
memory resident data structures containing system support information. This data
may also have to be copied–out to memory when a task switch is required.
The C procedure calling convention specifies that global registers gr96–gr111
are used for return value passing. For a procedure returning a 32–bit integer, only
register gr96 is required to store return value information. The compiler generally
uses these global registers for temporary working space before the return value data
is determined. The compiler has nine more temporary registers in the gr112–gr127
range which can also be used for temporary data storage. Other registers in this range
are used to implement register stack support functions.
When more registers are required by a procedure for data storage, the local
register stack can be used. This reduces the need to use external data memory to store
procedure data.
240
Evaluating and Programming the 29K RISC Family
The prologue of each procedure lowers the register stack pointer (gr1) by the
amount necessary to allocate space for a procedure’s in–coming and out–going
parameters. The prologue code is generated by the compiler, and can thus lower the
stack pointer by an additional amount to make temporary registers available to the
procedure. The compiler is more likely to do this when the “–O” optimization switch
is used and the procedure has an unusually large register requirement.
Each 29K processor reserves global register gr1 to implement a register stack
memory stack cache
local
register
file
register free bound
spilled
activation
records
rfb
current
activation
records
(in local
registers)
lr3
lr2
lr1
lr0
lr127
register stack
pointer
gr1
rsp
register
allocate bound
unused
rab
top of cache
Register
stack grows
down
lower memory address
Figure 5-1. A Consistent Register Stack Cache
pointer, which points to the base of the current procedure register allocation
(activation record) (see Figure 5-1). Register gr1 points to the first local register
allocated to the procedure, known as lr0. Local register lr1, located in the register
cache at location [gr1]+4, is the second local register available to the procedure. The
C calling convention rules state that this register is reserved for pointing to the top of
the procedure activation record. The lr1 register, known as the frame pointer, points
to the first register above the register group allocated to the current procedure (see
Figure 5-2). The frame pointer is used during register stack filling (cache filling)
when it must be determined if the registers allocated to the current procedure are
located in the register stack and not spilled–out (flushed out) to external data
memory.
Chapter 5
Operating System Issues
241
higher addresses
top of activation
record
lr6
in–coming pram
lr5
frame pointer
lr4
return address
lr3
gr1
activation record
is 7 words
local
lr2
out–going pram
lr1
frame pointer
lr0
gr1 lowered
4 words
base of procedure
activation record
Figure 5-2. Current Procedures Activation Record
A leaf procedure is defined as one that does not call any other procedure.
Because leaf procedures have no out–going parameters (data passed to called
functions), they do not have to lower the register stack pointer and create an
activation record. It is likely they have sufficient temporary working space in the 25
global registers available to each procedure. Of course, when one procedure calls
another it must assume the called procedure will use available global registers, and
thus store critical data to local register locations or external data memory. However, a
particularly large leaf procedure may allocate an activation record to gain access to
additional local register storage. Leaf procedures which do this obtain a new lr1
register that need not be used to point to the top of the activation record (because
leaves do not call other procedures). In this case, a leaf procedure is free to use local
register lr1 as additional temporary storage.
It is interesting to note, a performance gain is achieved by some C compilers by
breaking a previously listed rule. That is, a calling procedure need not always assume
the called procedure will use all 25 global registers. If the called procedure is defined
before calls are made to it, the compiler can determine its register usage. This enables
the compiler to only issue code to save the global registers effected by the callee,
rather than preserve all global registers which are in use at the time of the call.
5.2
SYNCHRONOUS CONTEXT SWITCH
The discussion in the Register Context section is not a complete introduction to
the register stack mechanism required to support C procedures executing on a 29K
processor (see Chapter 2). However, the information is required to understand the
242
Evaluating and Programming the 29K RISC Family
process of a synchronous context switch. In a synchronous context switch, the
currently executing user task voluntarily gives up the processor to enable another
task to start execution. This is normally done via a system call. Because of the C
calling rules, the procedure which makes the system call cannot itself be a leaf
function. This means that the lr1 value of the procedure making the system call
always contains a valid pointer to the top of the current activation record. If the
library routine implementing the system call does not lower the register stack (in
practical terms –– it is a small leaf procedure), the current lr1 value is a valid pointer
to the top of the activation record.
At first glance it seems the large number of internal registers must result in an
expensive context save and restore time penalty. Further study shows that this is not
the case.
Much of the time required to complete a context switch is associated with
moving data between external memory and internal registers. However, a significant
portion of the time is associated with supervisor overhead activities.
When saving the context of the current process all the registers holding data
relevant to the current task must have their contents copied to the external data
memory save area.
A 29K processor contains a number of special purpose registers. There are eight
user task accessible special registers, sr128–sr135, used to support certain
instruction type execution. Assuming the exiting–task (the one that is being saved)
was written in C and the system call library code does not contain any explicit
move–to–special–register instructions, there is no need to save the registers as any
instructions requiring the support of special registers would have completed by the
time of the context switch system call. The AMD C calling convention does not
support preservation of these special registers across a procedure call.
Of the 15 supervisor–mode only accessible special registers (sr0–sr14), three
registers are allocated to controlling access to external data memory (the channel
registers). Because at the time of a synchronous context switch there is no
outstanding data memory access activity, these registers also need not be saved. This
is only true if an instruction causing a trap is used to issue the system call and there is
no outstanding data memory access DERR pending. The Am29000 processor
serializes (completes) all channel activity before trap handler code commences. For
more detail on the DERR pending issue, see the Optimization section which follows.
On entering the system call procedure, the 25 global registers used by the calling
procedure no longer contain essential data. This means that they need not be saved.
The register stack support registers and the relevant global supervisor registers must
be saved.
Additionally four global registers (gr112–gr115) reserved for the user (not
affected by the compiler) must be saved if any application program uses them. If
these registers are not being used on a per–User basis, but shared between all Users
Chapter 5
Operating System Issues
243
and the Supervisor code, then they need not be saved. For example, a real–time
system may chose to place peripheral status information in these registers for users to
examine. The status information may be updated by Supervisor mode interrupt
handlers.
The context information is stored in a per–task data region known as the Process
Control Block (PCB). The example task context save code below assume the
register pointing to the PCB data region, pcb, has already been assigned the correct
memory address starting value.
An operating system register in the range ks1–ks12 is assumed to point to the
bottom of the PCB stack. Note that the CPS register bits set by the MTSRIM
instruction are system dependent; the RE bit may be required in some cases and the
IM field value is system dependent,
.equ
.equ
.equ
.equ
.equ
.equ
sync_save:
constn
push
push
push
pushsr
pushsr
sub
pushsr
pushsr
pushsr
sub
pushsr
mtsrim
sub
mtsrim
;
push
push
push
;
mtsrim
;
push
push
push
;
sub
;
sub
sub
244
SIG_SYNC, –1
;indicate a synchronous save
ENABLE,(SM|PD|PI)
DISABLE,(ENABLE|DI|DA)
FPStat0,gr93
;floating–point
FPStat1,gr94
;trapware support
FPStat2,gr95
;registers
;example synchronous context save
it0,SIG_SYNC
pcb,it0
pcb,gr1
pcb,rab
pcb,it0,pc0
pcb,it0,pc1
pcb,pcb,1*4
pcb,it0
pcb,it0
pcb,it0
pcb,pcb,1*4
pcb,ops
cps,DISABLE
pcb,pcb,1*4
chc,0
;push rab
;push specials
;space pc2
;push CHC
;push CHD
;push CHC
;space for alu
;push OPS
;remove freeze
;space for tav
;possible DERR
pcb,lr1
pcb,rfb
pcb,msp
;push R–stack
; support
;push M–stack
cps,ENABLE
;enable interrupts
pcb,FPStat0
pcb,FPStat1
pcb,FPStat2
;floating point
pcb,pcb,4*4
;space for IPA..Q
pcb,pcb,9*4
pcb,pcb,4*4
;space gr116–124
;push gr112–115, optional
pnt.
Evaluating and Programming the 29K RISC Family
mtsrim
storem
sub
cr, 4–1
0,0,gr112,pcb
pcb,pcb,16*4
;space for gr96–111
Local registers currently in use, those that lie in the region pointed to by gr1 and
rfb (gr127), require saving. Not all of the local register cache needs saving. The
example code below assumes the user was running with address translation on. Thus,
to gain access to the user’s register stack, the Supervisor must use the UA option bit
when storing out the cache contents. If the user had been running in physical address
mode, then there is no need for the Supervisor to use the UA option to temporarily
obtain User mode access permissions.
The context save code example above, operates with physical addresses in
Supervisor mode. This means address translation is not enabled. To enable data
address translation when the UA bit is use, the PD bit in the CPS register must be
cleared. Some operating system developers may choose to run the Supervisor mode
code with address translation turned on; in such cases, the PD bit will already be
cleared. Remember, once the PD bit is reset, it is possible to take a TLB miss. With the
UA bit set during the cache store operation, the TLB miss will relate to the temporary
User mode data memory access.
.equ
.equ
UA,0x08
PD,0x40
;UA access
;PD bit
mtsrim
cps,ENABLE&~PD
;virtual data
sub
srl
sub
mtsr
storem
kt0,rfb,gr1
kt0,kt0,2
kt0,kt0,1
cr,kt0
0,UA,lr0,gr1
;get bytes in cache
;adjust to words
mtsrim
cps,ENABLE
;return to physical
;
;save lr0–rfb
;
5.2.1 Optimizations
When an ASSERT instruction is used to enter Supervisor mode, all outstanding
data memory access activity is completed before the trap handler gains control. If no
data access error (DERR) occurs then the channel registers will contain no valid data
and need not be saved. However, when the channel access is serialized and forced to
complete, a priority four DERR may have occurred. The DERR trap competes with
the priority three system call trap (higher than four), and thus the system call trap
handler commences but with the channel still containing information pertaining to
the failed data access.
A performance gain can be obtained by not saving the channel registers to
external data memory. If the memory system hardware is unable to generate the
DERR signal, then the channel registers should not be saved. Additionally, if the
Chapter 5
Operating System Issues
245
software developer knows the previous data memory access has been completed or
was to a known memory location, there may be no need to save the channel registers.
The code shown below is an alternative to the previous system call trap handler entry
code, the transaction–fault bit (TF) in the channel control register (CHC) is tested to
determine if channel registers need saving.
A further performance gain can be obtained by not saving the PC0 register.
When the PC1 register is restored, the PC0 register can be determined by adding 4 to
the PC1 address value. To achieve the best performance gains, the code in the
subsequent Restoring Context section may be optimized to avoid restoring channel
registers CHA and CHD if the CHC contents–valid (CV) bit is zero.
save_channel:
pushsr
pushsr
pushsr
jmp
mtsrim
sync_save:
constn
push
push
push
sub
pushsr
sub
mfsr
sll
jmpt
sub
const
push
channel_saved:
sub
pushsr
mtsrim
sub
;deal with
pcb,it0,cha
pcb,it0,chd
pcb,it0,chc
channel_saved
chc,0
;DERR fault
;clear TF
;example synchronous context save
it0,SIG_SYNC
pcb,it0
pcb,gr1
pcb,rab
;push rab
pcb,pcb,1*4
;space for pc0
pcb,it0,pc1
;push pc1
pcb,pcb,1*4
;space for pc2
it0,chc
;test TF bit
it0,it0,31–10
; in CHC set
it0,save_channel
pcb,pcb,2*4
;space for cha,chd
it0,0
pcb,it0
;push CHC=0
pcb,pcb,1*4
pcb,ops
cps,DISABLE
pcb,pcb,1*4
;space for alu
;push OPS
;remove freeze
;space for tav
When restoring the task currently being saved, it is not necessary to reload all
128 local registers, or even the part of the register file in use at context save time
([gr1]––[rfb]). Only the activation record of the last executing procedure for the task
([gr1]––[lr1]) (see Figure 5-3). This greatly reduces the time required to restore a
task context originally saved by a synchronous context switch. Typically the size of a
procedure activation record ([gr1]–––[lr1]) is twelve words. To achieve this
optimization, the push of rab and rfb, shown in the previous code fragment, must be
changed to the code shown below. This ensures only one activation record is restored.
246
Evaluating and Programming the 29K RISC Family
higher addresses
Memory resident
frame pointer
Cache resident
rfb
Activation record
only partially in cache
top of activation
record
frame pointer
top of activation
record
lr6
in–coming pram
lr5
frame pointer
lr4
return address
local
lr3
lr2
out–going pram
lr1
frame pointer
lr0
Activation record
of last executing
procedure
gr1
top of stack
Figure 5-3. Overlapping Activation Records Eventual Spill Out of the Register Stack Cache
.equ
WS,512
;Window Size
const
sub
push
it0,WS
rab,lr1,it0
pcb,rab
;replacement for
; push rab
;push lr1–512
push
pcb,lr1
;replacement for push rfb
;
Chapter 5
Operating System Issues
247
Burst mode enables data to be loaded–from or stored–to memory consecutively,
without the processor continuously supplying addresses information. An external
address latch/counter is required to support such memory systems with an Am29000
the stack is shown with
higher addresses at the
top of the figure, and lower
addresses at the bottom.
enable interrupts
Interrupt Frame
System Call Frame
signal number
signal number
gr1
gr1
rab
rab
PC0
PC0
PC1
PC1
PC2
PC2
CHA
CHA
CHD
CHD
CHC
CHC=0
ALU
ALU
OPS
OPS
tav
tav
lr1
lr1
rfb
rfb
msp
msp
floating point support
IPA
stack
grows
down
floating point support
IPA
rfb
IPB
IPB
IPC
IPC
Q
Q
gr116–gr124
gr116–gr124
gr112–gr115
gr112–gr115
gr96–gr111
gr96–gr111
Async. Save Frame
Sync. Save Frame
registers not
normaly saved
Figure 5-4. Context Save PCB Layout
248
Evaluating and Programming the 29K RISC Family
or Am29050 processor. A system designer can use this feature to reduced context
switch times.
5.3
ASYNCHRONOUS CONTEXT SWITCH
An asynchronous context switch occurs when the current task unexpectedly
gives up the processor to enable another task to execute. This may occur when a timer
interrupt results in the supervisor deciding the current task is no longer the task of
highest priority. Unlike at the point at which a synchronous context switch occurs,
when an interrupt occurs the state of the processor is not restricted to a simple state.
Because an interrupt may occur in a leaf procedure, it is not possible to
determine if the current lr1 value contains a valid pointer to the top of the procedure
activation record. Further, the interrupt may have occurred during a procedure
prologue, where the register stack pointer (gr1) has been lowered but the lr1 value
has not yet been updated. This means when an asynchronous–saved task is switched
back in, it is impossible to restore only the activation record of the interrupted
procedure. The register stack containing valid data, that is [gr1]––[rfb], must be
restored. Assuming this amounts to half of the register file, an additional 2.6 micro
seconds would be required to restore the task with a single–cycle Am29000
processor memory system at 25MHz.
A task voluntarily giving up the processor via a system call from within a
procedure of typical activation record size can be restored faster then a task giving up
the processor involuntarily via an asynchronous interrupt.
When a User mode program is interrupted it could mean the current process is to
be sent a signal, such as a segmentation violation. It could also mean that the
Supervisor wishes to gain control of the processor to support servicing the
interrupting device. If the current process is being signaled, the label user_signal
should be jumped to by Supervisor mode interrupt handler (see the example code
below). This is explained in the later section titled User Mode Signals (section 5.5). If
Supervisor support code is required for peripheral device servicing, then the action to
be taken is very much dependent on the interrupting device needs.
.equ
time_out:
jmp
const
SIGALRM,14
;alarm signal
;timer interrupt handler
interrupt_common
it0,SIGALRM
;signal number
interrupt_common:
;Depending on required processing,
;jump to user_signal for current process signaling.
;Or, jump to user_interrupt to save the current process context.
Some interrupts can be serviced in Freeze mode, without the need to save the
current process context. Use of these so–called lightweight interrupt handlers can
Chapter 5
Operating System Issues
249
offer significant performance gains. Other interrupts will require the interrupted
process context to be saved. This is described in the following section, Interrupting
User Mode (section 5.4).
It is possible an interrupt has arrived that requires a signal to be sent to a process
which is not the currently executing process. In this case, the operating system must
first save the current process context and then restore the context of the signaled
process. Once the in–coming process is prepared to run, using the code in the Restore
Context section (section 5.10), the restored context will have to be then placed on the
signal stack as described in the User Mode Signals (section 5.5). Thus, execution
would begin in the User mode trampoline code of the in–coming process. To follow
this in detail, later sections of the chapter shall have to be studied.
5.4
INTERRUPTING USER MODE
This section describes how the operating system can prepare the processor to
execute a C level interrupt handler, where the handler is to run in Supervisor mode
and the interrupt occurred during User mode code execution.
Because the User mode task is being asynchronously interrupted, the complete
processor state must be saved. The context information should be stored in the PCB
rather than a temporary stack, as a context switch to a new user task may occur after
the interrupt has been processed. Storing the state in the PCB saves having to copy
the state from the temporary stack to the PCB after the context switch decision has
been made. When saving task context, a performance optimization is obtained by
only saving the registers which are currently in use. However, such optimizations
typically only apply to synchronous–task context saving.
When User mode is interrupted, the special purpose support registers may
contain valid data. This means an additional nine special register data values must be
copied to external data memory, compared to the synchronous context switch.
Below is a code example of interrupt context saving. Notice the rab stack
support register is adjusted to a window distance below rfb within the interrupt
disabled portion of the code. This is to conform to the same PCB format used by those
who wish to perform the register stack fix–up with User mode code, rather than in the
Supervisor code shown. Register rab is merely a convenience value for determining
rfb–WindowSize (WindowSize normally 512) in detecting a SPILL condition.
However, it is also used to determine FILL or SPILL interruption. Should the User
mode stack fix–up code be interrupted during it’s operation, it is important that it does
not become confused with the original SPILL or FILL interrupt. Realigning the rab
register whilst interrupts are off prevents this confusion.
.equ
WS,512
user_interrupt:
push
pcb,it0
250
;Window Size
;saving User mode context
;stack signal id
Evaluating and Programming the 29K RISC Family
push
push
const
sub
pcb,gr1
pcb,rab
it0,WS
rab,rfb,it0
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
mtsrim
push
mtsrim
pcb,it0,pc0
pcb,it0,pc1
pcb,it0,pc2
pcb,it0,cha
pcb,it0,chd
pcb,it0,chc
pcb,it0,alu
pcb,it0,ops
cps,DISABLE
pcb,tav
chc,0
;push specials
push
push
push
pcb,lr1
pcb,rfb
pcb,msp
;push R–stack
; support
;M–stack pnt.
mtsrim
cps,ENABLE
;enable interrupts
push
push
push
pcb,FPStat0
pcb,FPStat1
pcb,FPStat2
;floating point
pushsr
pushsr
pushsr
pushsr
pcb,kt0,ipa
pcb,kt0,ipb
pcb,kt0,ipc
pcb,kt0,q
;more specials
;stack real rab
;set rab=rfb–512
;
;remove freeze
;clear CHC
;
;
;
;
The 25 global registers, known to contain no valid data during a synchronous
context switch, must also be considered active, and consequently saved. Because
these global registers are located adjacent to the four global registers reserved for the
user, a single store–multiple instruction can be used to save the relevant global
registers. Considering a single–cycle memory system, two micro seconds should be
required to save the additional current task context.
sub
mtsrim
storem
pcb,pcb,29*4
cr,29–1
0,0,gr96,pcb
;push gr96–gr124
;including optional save of
; gr112–gr115
If the interrupt is expected to result in a context switch then the local registers
currently in use require saving. Note, this can be postponed (see the following
optimizations section). Not all of the local register cache needs to be saved. However,
as is explained below, do not simply assume that those that lie in the region pointed to
by gr1 and rfb (gr127) are the only active cache registers.
When a synchronous context switch occurs the register stack is known to be in a
valid condition (see Figure 5-1). With an asynchronous event causing a context
Chapter 5
Operating System Issues
251
switch, the stack may not be in a valid condition. There are three inconsistent
situations that must be detected and dealt with.
The interrupt occurred when a SPILL was in progress, in which case the
distance between the values in the rfb and rab registers exceeds the size of the
local register file (referred to as the Window Size). All of the local register file
must be saved. Some of the cached data may have already been copied out to
memory locations just below rfb. This data should remain at this location on the
memory resident portion of the stack until the task is restarted.
The interrupt occurred when a FILL operation was in progress, in which case
the distance between the values in the rfb and rab registers is less than the size of
the local register file. Some data may have been copied in from the top of the
memory resident portion of the register stack into local registers just above rab.
These registers will not be saved during the normal cache save ([gr1]–[rfb]). To
deal with this the FILL must be restarted when the context is restored.
The interrupt occurred during a far–longjmp operation. A far–longjmp is
defined as one in which the future (gr1 + 8) value—which is the address of local
register lr2 on the register memory stack—is greater than the current value in
the rfb register. In this case the local registers contain no valuable data because a
previous activation record (present during setjmp) is about to be restored from
the memory resident portion of the stack.
.equ
R_fixup:
add
load
sub
srl
cpeq
jmpt
cpltu
jmpt
add
cpgtu
jmpt
nop
;
ispill:
const
jmp
sub
;
ifill:
add
const
push
252
WS,512
;Window Size
;register stack fix–up
kt0,pcb,CTX_RAB
0,0,kt2,kt0
kt0,rfb,kt2
kt0,kt0,2
kt1,kt0,WS>>2
kt1,norm
kt1,kt0,WS>>2
kt1,ifill
kt1,gr1,8
kt1,kt1,rfb
kt1,illjmp
;get rab value
;window size
;convert to words
;test for valid
; stack condition
;test for FILL
; interrupt
;test far–longjmp
; interrupt
;yes, gr1+8 > rfb
;deal with interrupted SPILL
kt1,WS
norm
gr1,rfb–kt1
;gr1=rfb–512
;deal wilth interrupted FILL
kt1,pcb,CTX_CHC
kt0,0
kt1,kt0
;resave CHC=0
Evaluating and Programming the 29K RISC Family
add
add
push
push
add
push
kt0,FillAddrReg,4
kt1,pcb,CTX_PC0
kt1,kt0
;resave PC0,PC1
kt1,FillAddrReg
kt1,pcb,CTX_RAB
kt1,rab
;resave rab=rfb–512
sub
srl
sub
mtsrim
mtsrim
storem
mtsrim
kt1,rfb,gr1
kt1,kt1,2
kt1,kt1,1
cr,kt1
cps,ENABLE&~PD
0,UA,lr0,gr1
cps,ENABLE
;
norm:
;
illjmp:
;deal with consistant stack
;bytes in cache
;convert to words
;adjust for storem
;virtual data
;copy to stack
;physical data
;valid local registers now saved
Once the user’s User mode register stack has been saved, the interrupt handler
continues using the user’s Supervisor mode register and memory stacks.
.macro
const
consth
.endm
const32,reg,data
reg,data
;zero high, set low
reg,data
;high 16–bits
const32
const32
add
const32
add
msp,SM_STACK
rab,SR_STACK–WS
gr1,rfb,8
rfb,SR_STACK
lr1,rfb,0
;Supervisor M–stack
;prepare Supervisor
; R–stack support
; registers
;
;call appropriate C–level interrupt handler
The current task context has now been saved. After the interrupt has been
processed the operating system can select a different task to restore. This operation is
described in a subsequent section entitled Restoring Context (section 5.10). The PCB
structure for the out–going task shall not be accessed until the task is again restored as
the current executing task.
5.4.1 Optimizations
When User mode is interrupted, processing continues using the user’s
Supervisor mode stacks. This is necessary because the interrupt may result in the
process being put to sleep until some time later when it is again able to run. When the
process is put to sleep, the process state is stored in the Supervisor memory stack,
described in the Interrupting Supervisor Mode section (section 5.6). If the user’s
User mode context was saved on a shared interrupt stack rather than the per–process
Supervisor stack, then the context would have to be copied from the global interrupt
stack to the Supervisor stack before a context switch could proceed.
Chapter 5
Operating System Issues
253
The code shown above determines the region of cache registers currently in use
and stores them out onto the top of the user’s User mode register stack. This operation
can be postponed. The interrupt handler will use the register cache in conjunction
with the Supervisor mode register stack. If the interrupt handler runs to completion
and no context switch occurs, then the cache need not be saved. If a context switch
does occur then the cache will be saved on the top of the user’s Supervisor mode
register stack. This means some User mode data contained in the cache may be
temporary saved on the the Supervisor stack; however, this is not a problem.
The previous code determines the region of the cache currently in use, it does
not bring the stack into a valid condition. The code following the label R_fixup: in
the User Mode Signals section (section 5.5) does bring the stack into a valid
condition, and can be used to replace the code shown above. Once the stack support
registers are restored to a valid state, the stack–cut–across method described in the
later User System Calls section (section 5.7) can be used to attach the cache to the
Supervisor mode stack. By this method the storing of cache data can be prevented and
any unused portion of the cache is made immediately available to the interrupt
service routine.
5.5
PROCESSING SIGNALS IN USER MODE
Asynchronous context switches often occur because an interrupt has occurred
and must be processed by a handler function developed in C. A technique often
overlooked in real–time applications is using a signal handler to process the interrupt.
This often avoids much of the supervisor overheads associated with a context switch.
Additionally, a context switch requires the instruction cache to be flushed. Signal
handlers run in the context of the interrupted User mode process, this avoids the need
to flush the cache.
It is not necessary to store the contents of the local register file. After signal
support code has fixed–up the stack management support registers, the C level
handler code can continue to use the register stack as if the interrupted procedure had
executed a call to the handler function. In as little as 5.5 micro seconds from the time
of receiving the interrupt, the Am29000 can be executing the interrupt handler code
which was written in C.
Unlike asynchronous context switching, the interrupted context can not be
saved in the PCB. To do so would be convenient if a context switch was possible after
the signal handler had finished executing. The PCB structure would be already
updated. However, a further interrupt may occur during the C level signal handler
execution, which may itself result in an immediate context switch and require the use
of the PCB data save area. Additionally, the signal handler may do a longjmp to a
setjmp which occurred in User mode code before the signal handler started
executing. For this reason the context information is placed on the User’s memory
stack pointed to by msp.
254
Evaluating and Programming the 29K RISC Family
Users of operating systems complying with the AMD HIF–specification are
required to complete signal handler preparation tasks in User mode code supplied in
AMD libraries. HIF compliant operating systems only save the signal–number
through the tav register portion of the interrupt frame on the user’s memory stack.
The remaining part of the interrupt frame is saved by the user’s code. Any necessary
register stack management is performed. The User mode code is shown in Appendix
B and described in detail in section 4.4. The following code is for operating systems
which save the complete interrupt frame and prepare for a User mode signal while in
Supervisor mode
.equ
.equ
protect:
jmp
const
SIGILL,4
WS,512
;illegal operation
;Window Size
;Protection violation trap handler
user_signal
;send interrupted task a signal
it0,SIGILL
;signal number
If the interrupted User mode code was running with address translation turned
on, then the user’s memory stack must be accessed by the Supervisor using the UA bit
during LOAD and STORE instructions (note, this is also true for the push and pushsr
macros). The following code example shows pushing onto a physically accessible
user memory stack. If the user’s stack were virtually addressed, then the push
instructions would be replaced by move to temporary register instructions. After
interrupts were enabled the PD bit in the CPS register would be cleared to enable data
address translation, and then the temporary registers would be pushed onto the user’s
memory stack using the UA bit during the STORE instruction operation. Once the
frozen special registers had been saved, via the use of temporary registers, the
Supervisor could continue to run with the CPS register bits PD and DA cleared, and
store the remaining user state via push operations.
user_signal:
push
push
const
sub
;
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
mtsrim
push
mtsrim
;
Chapter 5
msp,it0
msp,gr1
it0,WS
rab,rfb,it0
msp,it0,pc0
msp,it0,pc1
msp,it0,pc2
msp,it0,cha
msp,it0,chd
msp,it0,chc
msp,it0,alu
msp,it0,ops
cps,DISABLE
msp,tav
chc,0
Operating System Issues
;prepare to process a signal
;stack signal id
;set rab=rfb–512
;push specials
;remove freeze
;clear CHC
255
push
push
push
msp,lr1
msp,rfb
msp,msp
;push R–stack
; support
;M–stack support
mtsrim
cps,ENABLE
;enable interrupts
push
push
push
msp,FPStat0
msp,FPStat1
msp,FPStat2
;floating point
pushsr
pushsr
pushsr
pushsr
msp,kt0,ipa
msp,kt0,ipb
msp,kt0,ipc
msp,kt0,q
;more specials
sub
mtsrim
storem
msp,msp,29*4
cr,29–1
0,0,gr96,msp
;push gr96–gr124
;including optional save of
; gr112–gr115
;
;
;
;
The register stack must now be brought into a valid condition, if is not already in
a valid condition. Valid is defined as consistent with the conditions supporting a
function call prologue. As described in the previous section 5.3, Asynchronous
Context Switching, the stack may not be valid if a SPILL, FILL or far–longjmp is
interrupted.
Unlike the asynchronous context save case, with signal processing our intention
is not to simply determine the active local registers for saving on the user’s memory
portion of the register stack, but to enable the user to continue making function calls
with the existing stack. That is, the C language signal handler will appear to have
been called in the normal manner, rather than as a result of an interrupt.
; Register stack fixup
R_fixup:
const
add
load
sub
cpgeu
jmpt
add
cpgtu
jmpt
nop
;
ifill:
add
const
push
add
add
push
push
256
kt0,WS
kt2,msp,CTX_RAB
0,0,kt2,kt2
kt1,rfb,kt2
kt1,kt1,kt0
kt1,nfill
kt1,gr1,8
kt1,kt1,rfb
kt1,nfill
;WindowSize
;interrupted rab
;determine if
;rfb–rab>=WindowSize
;jmp if spill
;or valid stack
;check if
; gr1+8 > rfb
;yes, long–longjmp
;here for interrupted FILL restart
kt1,msp,CTX_CHC
kt0,0
kt1,kt0
;resave CHC=0
kt0,FillAddrReg,4
kt1,msp,CTX_PC0
kt1,kt0
;resave PC0,PC1
kt1,FillAddrReg
Evaluating and Programming the 29K RISC Family
add
push
kt1,msp,CTX_RAB
kt1,rab
;resave rab=rfb–512
;
nfill: ;move gr1 into valid range
cpgtu
kt0, gr1, rfb
;if gr1 > rfb
jmpt
kt0, lower
;far–longjmp case
cpltu
kt0, gr1, rab
;if gr1 < rab then
jmpf
kt0, sendsig
;interrupted spill
nop
raise:
add
gr1, rab, 0
jmp
sendsig
nop
lower:
add
gr1, rfb, 0
jmp
sendsig
nop
Now use the signal number to determine the address of the corresponding signal
handler. The code below assumes there is an array of signal handlers. The first entry
of the array is held at memory address SigArray.
sendsig:
add
load
sub
sll
const
consth
add
load
;
mtsrim
const
add
mtsr
mtsr
iret
;prepare to leave Supervisor mode
kt0,msp,CTX_SIGNUMB
0,0,gr96,kt0
;get signal numb.
kt2,gr96,1
;handler index...
kt2,kt2,2
; ...in words
kt1,SigArray
kt1,SigArray
kt2,kt2,kt1
0,0,gr97,kt2
;handler adds.
cps,FREEZE
;enter Freeze mode
kt1,_trampoline
kt0,kt1,4
pc1,kt1
;return to user
pc0,kt0
;and process signal
Via an IRET, execution is continued in User mode procedure trampoline. This
procedure is often located in the memory page containing the PCB structure. Using
User accessible global registers gr96 and gr97, two parameters, the signal number
and a pointer to the signal handler routine, are passed to the trampoline code. The
handler routine is called, passing to it the signal number and a pointer to the saved
context.
;User mode entry to signal handler
_trampoline:
;Dummy Call
sub
gr1,gr1,6*4
;space for C–call
asgeu
V_SPILL,gr1,rab
add
lr1,gr1,6*4
add
lr0,gr97,0
;copy handler()
Chapter 5
Operating System Issues
257
add
add
lr2,gr96,0
lr3,msp,0
;copy signal #
;pass CTX pointer
calli
nop
lr0,lr0
;call handler()
add
nop
asleu
const
asneq
gr1,gr1,6*4
;restore stack
;
;
V_FILL,lr1,rfb
tav,SYS_SIGRETURN
V_SYSCALL,gr1,gr1 ;system call
After the signal handler returns, the interrupted context is restored via the
sigreturn system call. The supervisor mode code used to implement the restoration
process is shown in the section titled Restoring Context (section 5.10). At the time of
the system call trap, the memory stack pointer, msp, must be pointing to the structure
containing the saved context. The system call code checks relevant register data to
ensure that the User is not trying to gain Supervisor access permissions as a result of
manipulating the context information during the signal handler execution. (Note, it is
likely that assembly code library supporting the sigreturn system call shall copy the
lr2 parameter value to the msp register before issuing the system call trap.)
5.6
INTERRUPTING SUPERVISOR MODE
A user program may be in the process of executing a system call when an
interrupt occurs. This interrupt may require C level handler processing. In some
respects this is similar to a user program dealing with a C level signal handler;
however, there are some important differences. A User mode signal handlers may
chose not to run to completion by doing a longjmp out of the signal handler. Also,
signal handlers process User mode data. Supervisor mode interrupt handlers always
run to completion and process data relevant to the Supervisor’s support task rather
than the current User mode task.
Because a user task is being interrupted whilst operating in Supervisor mode,
the complete processor state must be saved in a similar way to an asynchronous
context switch. The context information can not be stored in the current user’s PCB
because it is used to hold the User mode status when Supervisor mode is entered via a
system call.
User programs usually switch stacks when executing system calls (see section
5.7). The user’s system stack is not accessible to the User mode program. This keeps
Supervisor information that appears on the stack during system call execution hidden
from the user. The user’s system stack can be used to support C function calls during
interrupt handler processing. Alternatively, an interrupt processing stack can be
used. Keeping a separate interrupt stack for Supervisor mode interrupt processing
enables a smaller system mode User stack to be supported, as the interrupt processing
258
Evaluating and Programming the 29K RISC Family
does not cause the system stack to grow further. Remember, the per–user system
stack is already in use because the user was processing a system call when the
interrupt occurred.
The interrupt_common entry point to the interrupt handler shown in
Asynchronous Context Switch (section 5.3) needs to be expanded to distinguish
between interrupting User mode and interrupting Supervisor mode. The appropriate
processing requirement is determined by examining the OPS register in the interrupt
handler. The label user_interrupt should be used to select the code for an interrupt of
User mode code.
interrupt_common:
mfsr
it1,ops
sll
it1,it1,27
jmpf
user_interrupt
nop
;examine processor mode interrupted
;get OPS special
;check SM bit
;User mode inter.
The following code assumes Supervisor mode interrupts are not nested, because
the current context is pushed onto the interrupt processing stack which is assumed
empty. If interrupts are to be nested, then the context should be pushed on the current
memory stack once it has been determined that the msp has already been assigned to
the interrupt memory stack. IM_STACK and IR_STACK are the addresses of the
bottom of the interrupt memory and register stacks respectively.
.equ
WS,512
;Window Size
.macro const32,reg,data
const
reg,data
;zero high, set low
consth reg,data
;high 16–bits
.endm
supervisor_interrupt:
const32 it1,IM_STACK
push
it1,it0
push
it1,gr1
const
it0,WS
sub
rab,rfb,it0
;
pushsr it1,it0,pc0
pushsr it1,it0,pc1
pushsr it1,it0,pc2
pushsr it1,it0,cha
pushsr it1,it0,chd
pushsr it1,it0,chc
pushsr it1,it0,alu
pushsr it1,it0,ops
mtsrim cps,DISABLE
push
it1,tav
;
mtsrim chc,0
;
Chapter 5
Operating System Issues
;process Supervisor mode interrupt
;interrupt M–stack
;stack signal id
;set rab=rfb–512
;push specials
;remove freeze
;clear CHC
259
push
push
push
add
it1,lr1
it1,rfb
pcb,msp
msp,it1,0
;push R–stack
; support
;push M–stack pntr.
;use msp pointer
mtsrim
cps,ENABLE
;enable interrupts
push
push
push
msp,FPStat0
msp,FPStat1
msp,FPStat2
;floating point
pushsr
pushsr
pushsr
pushsr
msp,kt0,ipa
msp,kt0,ipb
msp,kt0,ipc
msp,kt0,q
;more specials
sub
mtsrim
storem
msp,msp,29*4
cr,29–1
0,0,gr96,msp
;push gr96–gr124
;including optional save of
; gr112–gr115
;
;
;
;
There is no need to save any of the register cache data. In the following code, the
register stack support registers are updated with the initial values of the supervisor
interrupt stack. If nested high level handler interrupts are to be supported, see the
following Optimizations section. The gr1 register stack pointer is then set to the top
(rab) of the cache, indicating the cache is fully in use. The new activation record size
pointer, lr1, is then set to the bottom of the cache (rfb).This ensures that when the
interrupted C level service function returns, the cache will be repaired to exactly the
position at which the interrupt occurred. This is particularly important if a Supervisor
mode FILL was interrupted. The user’s system mode register data will be spilled
onto the interrupt stack, but this creates no problem.
const32
add
const32
add
rab,IR_STACK–WS
gr1,rab,0
;prepare interrupt
rfb,IR_STACK
; R–stack support
lr1,rfb
; registers
;
;call appropriate C–level interrupt handler
5.6.1 Optimizations
The code shown above does not attempt to determine the region of cache
registers currently in use. This means that the first C level procedure call in the
interrupt handler will result if a cache spill trap occurs.
By determining the region of the cache currently in use and by bringing the
register stack into a valid condition, any available cache registers can be made
immediately available to the interrupt handler C routines. The code following the
label R_fixup: in the previous User Mode Signals section (section5.5) does bring the
260
Evaluating and Programming the 29K RISC Family
stack into a valid condition and can be used to replace the code shown above. Once
the stack support registers are restored to a valid state, the stack–cut–across method
described in the User System Calls section (section 5.7) can attach the cache to the
interrupt register stack.
It is possible that while processing an interrupt (which means the processor is
already in Supervisor mode) an additional interrupt occurs. If an operating system
supports nested interrupts, then the code in the Interrupting Supervisor Mode section
(section 5.6) will be executed again. This overhead can be avoided by following the
Interrupt Queuing Model method described in section 4.3.12 of the Interrupts and
Traps chapter.
The method relies on supporting only lightweight interrupt nesting. The code in
this section is entered only once to start the execution of a C level interrupt processing
Dispatcher. Each interrupt adds a interrupt request descriptor (bead) on to a queue of
descriptors (string of beads). The dispatcher removes the requests and processes the
interrupt until the list becomes empty. Lightweight interrupts enable the external
device to be quickly responded to, although the dispatcher may not complete the
processing till some time later.
5.7
USER SYSTEM CALLS
User programs usually switch stacks when executing system calls. The user’s
system stack is not accessible to the User mode program. This keeps Supervisor
information which appears on the stack during system call execution hidden from the
user.
Synchronous context switching generally happens as a result of a system call.
However, system calls are also used to request the operating system to obtain
information for a user which is only directly obtainable with Supervisor access
privileges. The user’s state must be saved to the PCB structure in a similar way to a
synchronous context save. This makes the global and special registers available for
Supervisor mode C function use. There is no need to save the register cache until a
full context switch is known to be required.
.equ
.equ
.equ
syscall:
constn
push
push
push
pushsr
pushsr
sub
const
Chapter 5
SIG_SYNC, –1
ENABLE,(SM|PD|PI)
DISABLE,(ENABLE|DI|DA)
it0,SIG_SYNC
pcb,it0
pcb,gr1
pcb,rab
pcb,it0,pc0
pcb,it0,pc1
pcb,pcb,3*4
it0,0
Operating System Issues
;V_SYSCALL trap handler
; assumes no
; outstanding DERR
;push gr1
;push rab
;push specials
;space pc2,cha,chd
261
push
sub
pushsr
mtsrim
sub
pcb,it0
pcb,pcb,1*4
pcb,ops
cps,DISABLE
pcb,pcb,1*4
;push CHC=0
;space for alu
;push OPS
;remove freeze
;space for tav
push
push
push
pcb,lr1
pcb,rfb
pcb,msp
;stack support
;push rfb
;push M–stack pnt.
mtsrim
cps,ENABLE
;enable interrupts
push
push
push
pcb,FPStat0
pcb,FPStat1
pcb,FPStat2
;floating point
;
;
;
;
;Assume the same gr112–gr115 data is shared
;by all users and the supervisor, and
;therefor will not push gr112–gr115.
;
;Align pcb for system call return
sub
pcb,pcb,(4+(124–96+1))*4
The system call code can continue to use the cache attached to the user’s system
mode registers stack. To do this the current top of stack position, gr1, must be
maintained. The register stack support registers are relocated to the system stack,
maintaining the existing stack position offset. The following code performs this stack
cut–across operation. It assumes the system call is made from a valid stack condition.
However, it includes bounds protection because operating systems can never
completely rely on users always maintaining valid stack support registers.
$1:
sub
andn
const
cpleu
jmpt
const
const
gr96,rfb,gr1
gr96,gr96,3
gr97,(128*4)
gr97,gr96,gr97
gr97,$1
gr97,0x1fc
gr96,512
;determine rfb–gr1
;stack is double word aligned
;max allowed value for
; rfb–gr1 is 128*4
;jump if normal register usage
;mask for page displacement math
;limit register use to max (512)
and
const
consth
add
add
const
sub
add
gr1,gr1,gr97
;determine gr1 displacement within
gr97,SR_STACK–1024; 512–byte page
gr97,SR_STACK–1024;
gr1,gr1,gr97
;gr1=SR_STACK–1024+displacement
rfb,gr1,gr96
;rfb=(new gr1)+
gr97,(128*4)
; min(512,rfb–gr1))
rab,rfb,gr97
;set rab=rfb–512
lr1,rfb,0
;ensure all User mode registers
; restored
The technique relies on keeping bits 8–2 of the stack pointer, gr1, unchanged. In
other words, the lr0 register has the same position in the cache after the memory
resident stack portion has been exchanged. This is achieved by calculating the
262
Evaluating and Programming the 29K RISC Family
address displacement of gr1 within a 512–byte page size. The gr1 displacement
remains the same if the memory resident portion of the register stack has been
exchanged. SM_STACK and SR_STACK are the addresses of the bottom of the
per–user system memory and register stacks respectively (see Figure 5-5).
SR_STACK
SR_STACK
rfb
Supervisor Mode
Register stack
512–byte
page boundaries
gr1
Register stack support
registers after stack
cut–across
Register stack
support registers
before system
call
UR_STACK
UR_STACK
User Mode
Register stack
register
cache
portion of
register
stack
rfb
gr1
page
displacement
registers
or
memory in
use
Figure 5-5. Register Stack Cut–Across
Once stack cut–across has been completed, a call to the C level system call
handler can be issued. The C code may get its incoming parameters from the register
stack, or the system call trap handler code may copy the parameters from the local
registers to memory locations accessible by the Supervisor mode C handler.
;copy lr2,... arguments to memory locations
add
gr96,tav,0
;save service numb.
sub
gr1,gr1,4*4
;new stack frame
asgeu
V_SPILL,gr1,rab
add
lr1,gr1,4*2
;ensure lr1 restore
const32 lr0,_syscall
;C handler
calli
lr0,lr0
; call
add
lr2,gr96,0
;pass service numb.
The C system call handler may place its return values in known memory
locations, rather than global registers gr96–gr111. If this is the case, then the values
shall have to be copied to the normal return registers. System calls indicate their
successful or unsuccessful completion to their callers by setting register tav (gr121)
Chapter 5
Operating System Issues
263
to TRUE or FALSE; the high level handler achieves this by modifying the gr121
register location in the PCB before the system call return code is executed. A FILL
assertion is used to repair the cache to the position at which the system call was
issued.
add
gr1,gr1,16
;restore system
nop
; call frame lr1
asleu
V_FILL,lr1,rfb ;restore all cache
;copy return values from memory to gr96,...
jmp
resume
;restore context
nop
Because a User mode signal handler may use the system call mechanism to issue
a sigreturn, it is possible an asynchronous context restore may be required in place of
the normal synchronous context restore associated with a system call. Label resume
is jumped to and is described in the Restore Context section (section 5.10). If an
asynchronous context is being restored, then a pointer to the context being restored
will have been passed to the sigreturn system call. The high level C handler will have
copied this data over the PCB data stored at the time of the system call trap entry. The
C handler must change the SIG_SYNC value stored in the PCB by the system call
trap handler. This will cause the resume code to perform an asynchronous rather than
synchronous context restore.
5.8
FLOATING–POINT ISSUES
The example code presented saves only three supervisor accessible global
registers under the heading floating–point support. These registers are typically
ks13–ks15. This is sufficient to save and restore floating–point context when an
Am29000 processor is being used with trapware emulation. This is only true if
interrupts are turned off during floating–point trapware execution. If floating–point
trapware is interruptible, then the Am29000 trapware support registers (typically
it0–it3 and kt0–kt11) would have to be saved.
When an Am29027 floating–point coprocessor is used, either inline or via
trapware support, the complete state of the coprocessor must be saved. This requires
an additional 35 words space in addition to the three Am29000 global support
registers.
Some real–time operating systems may run floating–point trapware with
interrupts off and chose to save no floating support registers at all. This will improve
context switch times. User programs typically only change the rounding mode
information in the support registers. If all user tasks run with the same rounding
information, then there is no need to save and restore the three floating–point support
registers.
The Am29050 directly executes floating point instructions without the need for
trapware. It has four floating point support registers, special registers sr160–162 and
264
Evaluating and Programming the 29K RISC Family
sr164. In fact, the three support registers required by the Am29000 are used to
virtualize these Am29050 registers. Saving Am29050 floating point context would
be achieved by saving these four registers and the four double word accumulator
values. However, the Am29050 does not directly support integer DIVIDE and
DIVIDU instructions. The trapware which implements these instructions requires
six support registers (typically kt0–kt5). If this trapware is interruptable, then these
registers would also have to be saved.
5.9
DEBUGGER ISSUES
Debuggers such as AMD’s MiniMON29K monitor have a special context
switch requirement. They need to be able to switch context to the debugger from a
running application or operating system without losing the contents of any processor
register. One possibility is to reserve a global register in the range gr64–gr95,
specifically for debugger support. But, most operating system developers are
unwilling to give up a register.
A technique which avoids losing a register for operating system use is to use gr4
to first store a single operating system register, and then use this register to start
saving the rest of the processor context. The Am29000 does not have a gr4 register
but the ALU forwarding logic enables this technique to work. The code example
below, taken from MiniMON29K, shows how the processor context save gets
started. Note, _dbg_glob_reg is the memory address used by the debugger to save
global registers.
.macro const32,reg,data
const
reg,data
;zero high, set low
consth reg,data
;high 16–bits
.endm
dbg_V_bkpt:
const32
store
const32
store
add
store
;
call
const
gr4,_dbg_glob_reg+96*4
0,0,gr96,gr4
;save gr96
gr96,_dbg_glob_reg+97*4
0,0,gr97,gr96
;save gr97
gr96,gr96,4
0,0,gr98,gr96
;save gr98
gr96,store_state
gr97,V_BKPT
Label dbg_V_bkpt is the address vectored to by an illegal opcode
(MiniMON29K uses these to implement breakpoints on the Am29000). When
function store_state is reached, global registers gr96–gr98 have already been saved.
The gr4 user should be careful to remember that the Am29000 ALU forwarding
logic only keeps the gr4 register value alive for 1–cycle following its modification.
Additionally, because emulators also make use of gr4 in analyzing processor
Chapter 5
Operating System Issues
265
registers, it is not possible to use an emulator to debug the monitor entry code shown
above.
5.10 RESTORING CONTEXT
The supervisor register pcb must point to the top of the process control block
stack describing the previously saved context. A test of the signal number data
located at the bottom of the PCB stack enables us to determine if the stack was saved
synchronously or asynchronously. Restoring synchronously saved tasks can be
achieved more quickly because there is less relevant data in the PCB stack.
resume:
add
load
jmpt
nop
kt0,pcb,CTX_SIGNUMB
0,0,kt0,kt0
;sync/async save ?
kt0,sync_resume
Asynchronously saved states have a greater number of global registers to be
restored. There are also additional special register values.
async_resume:
mtsrim
sub
loadm
;
popsr
popsr
popsr
popsr
;
jmp
nop
cr,29–1
pcb,pcb,29*4
0,0,gr96,pcb
;restore gr96–124
q,it0,pcb
ipc,it0,pcb
ipb,it0,pcb
ipa,it0,pcb
;restore specials
fp_resume
Now that the context information, unique to an asynchronously saved state, has
been restored, the context which is common between asynchronous and
synchronous save states can be restored via a jump to fp_resume.
sync_resume:
add
;
mtsrim
loadm
add
;
add
add
fp_resume:
pop
266
pcb,pcb,16*4
;space for gr96–111
cr,4–1
0,0,gr112,pcb
pcb,pcb,4*4
;optional restore of gr112–115
pcb,pcb,9*4
pcb,pcb,4*4
;space for gr116–124
;space for IPA–Q
FPStat2,pcb
;floating point
Evaluating and Programming the 29K RISC Family
pop
pop
FPStat1,pcb
FPStat0,pcb
Now that most of the global and User mode accessible special registers have
been restored, it is time to restore the register cache. In the case where they were
saved due to an asynchronous event, this requires care. First the register stack support
registers must be restored.
.equ
DISABLE,(SM|PD|PI|DI|DA)
mtsrim
pop
pop
pop
add
pop
pop
add
cps,DISABLE
msp,pcb
rfb,pcb
lr1,pcb
kt1,pcb,9*4
rab,kt1
gr1,kt1
gr1,gr1,0
;M–stack support
;R–stack support
;alu operation
By examining the register stack support pointers it is possible to determine if the
process state was stored during a SPILL interrupt. In this case the saved gr1 will be
more than a window distance below rfb, this means [gr1]–[rfb] should not be
restored. In the case of restoring an interrupted far–longjmp, the cache need not be
restored.
.equ
WS,512
;Window Size
;If User mode uses virtual addressing,
;restore PID field in MMU register
;to PID of incoming task.
sub
kt0,rfb,rab
;window size
srl
kt0,kt0,2
;convert to words
cpleu
kt1,kt0,WS>>2
;test for normal
jmpt
kt1,rnorm
; or FILL interrupt
cpgtu kt1,gr1,rfb
;test for far–
jmpt
kt1,rlljmp
; longjmp interrupt
nop
;
rspill:
;restore interrupted spill
const
kt0,WS
sub
kt1,rfb,kt0
;determine rab
add
add
mtsrim
mtsrim
loadm
mtsrim
jmp
add
Chapter 5
kt0,gr1,0
gr1,kt1,0
CR,(512>>2)–1
cps,ENABLE&~PD
0,UA,lr0,kt1
cps,ENABLE
rlljmp
gr1,kt0,0
Operating System Issues
;save interrupted gr1
;set gr1=rfb–(window size)
;virtual data
;load all of cache
;physical data
;restore interrupted gr1
267
When synchronously saved tasks are restored, or asynchronously saved tasks
which were interrupted during either a normal register stack condition or an
interrupted FILL, local registers [gr1]–[rfb] are restored to the cache.
rnorm: sub
srl
sub
mtsr
mtsrim
loadm
mtsrim
kt0,rfb,gr1
kt0,kt0,2
kt0,kt0,1
CR,kt0
cps,ENABLE&~PD
0,UA,lr0,gr1
cps,ENABLE
;determine number of bytes
;adjust to words
;virtual data
;restore R–stack cache
;physical data
Now that the local registers have been restored, all that remains to do is restore
the remaining special registers. This requires applying Freeze mode with interrupts
disabled during this critical stage.
rlljmp:
pop
mtsrim
popsr
popsr
popsr
popsr
popsr
popsr
popsr
popsr
iret
tav,pcb
cps,FREEZE
ops,it0,pcb
alu,it0,pcb
chc,it0,pcb
chd,it0,pcb
cha,it0,pcb
pc2,it0,pcb
pc1,it0,pcb
pc0,it0,pcb
;frozen specials
5.11 INTERRUPT LATENCY
Interrupt latency is an important issue for many real–time applications. I
defined it as the time which elapses between identifying the interrupting device’s
request and performing the necessary processing to remove the request. Latency is
increased by having interrupts disabled for long periods of time. Unfortunately it is
desirable to have operating system code perform context switching with interrupts
disabled.
Consider the case where a User mode process is interrupted and a signal is to be
sent to the process. The operating system starts saving the interrupted process context
on the user’s memory stack. However, in the process of doing this an interrupt is
generated by a peripheral device requiring Supervisor mode C level interrupt handler
support. This second interrupt requires a context switch to the Supervisor mode
interrupt stack. In the process of preparing the processor to run the C level handler,
the context switch code may become confused about the state of the stack support
registers as a result of partial changes made by the interrupted signal handler
operating system code. Additionally, there is likely to be register usage conflict
between the different operating system code support routines.
268
Evaluating and Programming the 29K RISC Family
The status confusion and register conflict is avoided by disabling interrupts
during the critical portions of the operating system code. The code shown in this
chapter enables interrupts after the frozen special registers and stack support
registers have been saved. This is insufficient to deal with the nested interrupt
situation described above. However, this does reduce interrupt latency, which is a
concern to real–time 29K users. Some implementors may chose to move the enabling
of interrupts to a later stage in the operating system support code — more
specifically, to a point after register stack support registers have been assigned their
new values. Register usage changes will also be required to avoid conflict.
Within the example code used throughout this chapter, interrupts can be enabled
just after special register CHC has been saved (before lr1 is pushed on the PCB). This
low latency technique enables lightweight interrupt handlers to be supported during
the operation of normally critical operating system code. Lightweight handlers
typically only run in Freeze mode and can easily avoid register conflict if they are
restricted to global registers it0–it3. Using the Interrupt Queuing Model described in
section 4.3.12, or the Signal Dispatcher described in section 2.5.6, a lightweight
handler responds to the peripheral device interrupt. It transfers any critical peripheral
device data and clears the interrupt request. In doing so, it inserts an
interrupt–descriptor, or signal number, into a queue for later processing.
A Supervisor C level interrupt handler known as the Dispatcher removes queue
entries and calls the appropriate handler to process them. If the operating system is
interrupted in a non–critical region by a device requiring a Supervisor mode C level
handler, then the dispatcher is immediately started. If the interrupt is in a critical
region then the Dispatcher shall be started later when the current critical tasks have
been completed. If the Dispatcher is already running when the interrupt occurred,
then the associated interrupt descriptor shall wait in the queue until the Dispatcher
removes it for processing.
The use of a Dispatcher and interrupt queuing helps to reduce interrupt latency
via the use of lightweight interrupts when building queue entries. However, the
method has some restrictions. It works where troublesome nested interrupt servicing
can be partially delayed for later high level handler completion. But some interrupts
can not be delayed. For example an operating system may be running with address
translation turned on, and a TLB miss may occur for an operating system memory
page which needs the support of a high level handler to page–in the data from a
secondary disk device. In this case the interrupt must be completely serviced
immediately. This is not a typical environment for 29K users in real–time
applications. And even in many non–real–time operating system cases the operating
system runs in physical mode or all instruction and data are known to be currently in
physical memory. The trade–offs required in deciding when to enable interrupts and
resolving register conflict are specific to each operating system implementation.
Chapter 5
Operating System Issues
269
5.12 ON–CHIP CACHE SUPPORT
First level caches are small on–chip memories which can respond on behalf of
off–chip memory when a processor attempts a memory access. When the required
access is satisfied by the cache, known as a cache hit, a performance advantage is
obtained when compared to accessing slower off–chip memory. Caches enable high
performance systems to be constructed without the expense and complexity of fast
system memory.
The 29K family supports a mixture of different cache schemes, see Table 5-1.
Some of the inexpensive devices such as the Am29005 processor and the Am29200
microcontroller have no on–chip cache. Other family members generally have some
kind of instruction memory cache; and in some of the top performing processors, data
cache is provided. The individual processor User’s Manual describes the operation of
the available cache in detail. Chapter 1 outlined the basic cache capabilities of the
family (see sections 1.3–1.9). This section deals with the support code needed to
maintain cache operation. Some cache operations are described in more detail for the
purpose of showing how cache maintenance affects system performance.
When a cache is provided, the 29K family supports two–way set associative
caching. The two–way cache associativity (see section 6.2) provides two possible
locations (blocks or cache entries) for caching any selected memory location. A
block contains four contiguous words from memory and associated tag and status
bit–fields. When a cache miss occurs, and both associated blocks are valid but not
locked (can be displaced), a block is chosen at random for replacement (known as
reload). Investigations have shown that random replacement can be more successful
than a Least Recently Used (LRU) replacement scheme.
When a 29K processor is reset, the processor disables all caches by setting the
cache disable bit–fields in the CFG configuration register. Cache entries must first be
invalidated before the cache is enabled. Supervisor mode code can perform most
operating system cache maintenance services by simply manipulating the bit–fields
of the CFG register. In addition Supervisor mode privileged instructions are provided
for cache invalidation.
5.13 INSTRUCTION CACHE MAINTENANCE
Instruction cache memory has typically a larger impact on performance than
data cache with the 29k family. This is due to the reduced number of data accesses
required by application code. The reduction is relative to other processors, generally
CISC, which have a small number of on–chip registers. Application data is normally
held in the 128–word register file which is a cache of the top of the application
register stack.
The potentially higher performance of a RISC chip is only achieved if the
instruction pipeline is kept effectively busy. The RISC engine is instruction hungry
270
Evaluating and Programming the 29K RISC Family
Table 5-1. 29K Family Instruction and Date Cache Support
D–cache
–
Am29035
I–cache
4k
Am29000 BTC 32x16
–
Am29030
8k
–
Am29050 BTC 64x16
or 128x8
–
Am29040
8k
4k
Am29005
I–cache
–
3–bus Microprocessors
D–cache
–
2–bus Microprocessors
All cache sizes in bytes
I–cache
Am29205
–
D–cache
–
Am29200
–
–
Am29245
4k
–
Am29240
4k
2k
Am29243
4k
2k
Microcontrollers
and to prevent stalling it must be kept fed with instructions from cache memory or a
high bandwidth off–chip memory system (see section 1.10). On–chip cache can
supply instruction sequences at a rate of one per cycle without any initial access
penalties. Thus they can keep the pipeline fed without any stalling due to lack of
available instructions to process.
The original 3–bus family members have a Branch Target cache due to the
improved access to off–chip memory made possible with three busses. Later 2–bus
and microcontroller family members have a more conventional, bandwidth
improving, instruction cache. It is interesting to consider the benefits of an
instruction cache when the memory system is able to support single cycle memory
access. For example, the buit–in DRAM controller used in the Am29240
microcontroller is able to support single cycle burst–mode access. An instruction
cache can not improve on the 1–cycle memory access. However, the cache still hides
the initial access penalties incurred when starting a new burst sequence. It also
enables parallel LOAD and STORE instruction execution, the processor pipeline
being supplied by the instruction cache while the data bus is free to perform a data
access (see section1.7.2).
The required cache maintenance software does not present much of an
overhead. Because the address in the program counter is presented to the instruction
Chapter 5
Operating System Issues
271
cache at the same time it is presented to the MMU, the instruction cache does not
operate with physical addresses if the MMU is in use. Thus, the 29K family
instruction caches operate with virtual addresses when testing for a cache hit.
Because cache entries are not tagged with a per–process identifier the cache must be
flushed when a process (or task) context switch occurs. This is to prevent a previous
process’s virtual address appearing to match with the current task’s virtual address.
Only systems which operate with multiple tasks using virtual addressing must
invalidate the cache when a user–task context switch occurs. Using the IRETINV
(interrupt return and invalidate) instruction is one convenient way of doing this.
However, if the processor runs tasks with physical addressing, there is no need to
flush the cache on a process (task) context switch. With physical addressing, each
task is restricted to execution within a limited and possibly unique range of the
available address space.
The instruction cache is enabled by clearing the Instruction Cache Disable (ID)
bit of the CFG configuration register (the CD bit is used with 3–bus processors).
Cache entries are built around blocks of four consecutive instructions. Each block
has some associated tag and status information. This information, shown on
Figure 5-6, is the same for each processor. However, the exact layout of the bit–fields
may vary among family members.
Address Tag
V
P
US
Figure 5-6. Instruction Cache Tag and Status bits
The Valid (V) bit–field indicates if the cache entry is valid. For processors which
have a 1–bit field, setting this bit means all four instructions are valid cache entries.
When a family member supports a 4–bit field, a separate bit is used to indicate a valid
entry for each of the four cached instructions.
Each block has a P bit–field. This bit indicates that the tagged address relates to a
physical address value. The P bit becomes set when the cache is reloaded while the PI
(Physical Instruction) bit in the CPS register is set. This allows cache entries to hold
interrupt handlers which typically run with physical addressing. The interrupt
handler code can be distinguished from User mode and Supervisor mode virtually
addressed code.
When the cache is invalidated using an INV type instruction all valid bits are
reset, even entries which were valid and had their P bit set. In some cases there may be
a performance gain to be had by not invalidating physical cache entries but only
virtual addresses entries. However, the performance gain is small and the on–chip
silicon overhead for this feature would be relatively high.
272
Evaluating and Programming the 29K RISC Family
The US bit–field of each cache block tag indicates if the address relates to User
mode or Supervisor mode code. The US bit becomes set when the cache is reloaded
while the SM bit is set in the CPS register. This allows cache entries to be used for
both User mode and Supervisor mode code at the same time, and entries can remain
valid during application system calls and system interrupt handlers which execute in
Supervisor mode.
Following sections present further detail about instruction caching for
individual 29K family members. Table 5-2 summarizes this information.
Table 5-2. Instruction Cache Comparison
Processor
Am29000
Am29050
Am29030
Am29240
Am29040
Addressing
Virtual
Virtual
Virtual
Cache associativity
2–way set
2–way set
2–way set
Valid bits per block
4 bit
1 bit
4 bit
Per–process identifiers
No
No
No
Replacement selection
Random
Random
Random
Direct cache access
No
via CIR and CDR
via CIR and CDR
Reload blocking
No
Yes
No
Target word first reload
Yes
No
Yes
Cache locking
No
Per–column
Per–column
5.13.1 Cache Locking and Invalidating
Cache locking is an issue when addressing techniques other than physical are
used by an application or operating system. There is often an expressed desire to lock
critical data into the cache and prevent its displacement when User mode address
translation changes. The objective is to improve performance by out–smarting the
random replacement algorithm used for cache reload. In practice this objective is
difficult to achieve. If code is frequently executed, and thus critical to overall
performance, it will naturally be placed in the cache. The random replacement
technique is effective at finding the critical code. It would be difficult and possibly
over ambitious to consider that a programmer, unless supported with sophisticated
tools, could achieve a better result.
Chapter 5
Operating System Issues
273
The cache can be invalidated in a single cycle using an INV or IRETINV type
instruction. However this invalidates all User and Supervisor mode entries. It might
be possible to improve the execution speeds of Supervisor mode code and interrupt
handlers by keeping them locked in the cache. This may also reduce interrupt latency
times but is no doubt at the cost of reduced User mode code execution. The non BTC
processors, that is, the 2–bus processors and microcontroller, provide a means of
locking the cache.
Locking valid blocks (or entries) into the cache is not provided for on a
per–block basis but in terms of the complete cache or one set of the two columns.
When a column is locked, valid blocks are not replaced; invalid blocks will be
replaced and marked valid and locked. Cache locking can be applied before
preloading the cache with instruction sequences critical to performance. Instruction
cache locking is achieved by setting the IL field of the CFG configuration register.
When the cache is locked, an INV type instruction will not cause block invalidation
unless the cache is also disabled. Column 0 and column 1 of each set can be locked or
only column 0 locked. When only column 0 is locked, replacement of blocks in
column 1 continues on a direct mapping basis. That is, there is only one location in the
cache which can cache any particular memory address. This results in increased
cache reload activity which reduces the effectiveness of cache.
As an illustrative exercise, consider the code necessary to invalidate only User
mode cache entries. For a 4K byte Instruction cache there are 1K instructions cached
in 256 blocks of four instructions. Given the two–way–set approach, there are 128
sets; each set containing one block in each of the two columns. The following code
scans the 128 blocks of column 0, and invalidates the block only if the entry is found
to cache User mode code. Note, the cache must be disabled while being accessed via
the Cache Interface (CIR) and Cache Data (CDR) registers. These registers enable
cache tags and data to be directly read and written.
const
mfsr
or
mtsr
gr64, 0x100
; set the ID–bit
gr65, cfg
; read CFG register
gr65, gr65, gr64; disbable cache
cfg, gr65
; write CFG config.
const
const32
const32
const
gr64,
gr65,
gr67,
gr68,
mtsr
mfsr
sll
jmpt
or
mtsr
mtsr
cir, gr65
;
gr66, cdr
;
gr66, gr66, 31 ;
gr66, keep
;
gr66, gr65, gr67;
cir, gr66
;
cdr, gr68
;
;
128–2
;
0x10000000;
0x01000000;
0
;
scan 128 blocks
FSEL=01, tag read
R/W OR mask
zero value
next:
prepare to read tag
read tag–status word
test US–bit
jump if Super. mode
set the RW–bit to write
prepare to write tag
write zero into status
keep:
274
Evaluating and Programming the 29K RISC Family
jmpfdec gr64, next
; test if all blocks tested
add
gr65, gr65, 1*16; point to next block
;
const
mfsr
nand
mtsr
gr64, 0x100
; set the ID–bit
gr65, cfg
; read CFG register
gr65, gr65, gr64; enable cache
cfg, gr65
; write CFG register
With a 2/1 memory system, testing and invalidating each block takes 10 cycles
(2/1 refers to the memory system access times –– 2–cycle first, 1–cycle for
subsequent). This amounts to 1280 cycles for all blocks in column 0; or, 51.2 micro
seconds for a 25 MHz processor. Actual use of the example code presents a
considerable overhead and is unlikely to achieve an overall system benefit over
simply invalidating the whole cache in a single cycle.
5.13.2 Instruction Cache Coherence
The 29K family does not contain unified instruction and data caches. Unified
caches can give a higher hit rate than split caches of the same total size. However,
separate instruction and data caches enable a higher performance due to
simultaneous accesses during the same processor cycle. There are less problems with
instruction cache coherence than data cache coherence. This is because a memory
supplying instructions is unlikely to be modified by another processor or external
DMA controller. Yet, a processor can use store instructions to place new instructions
in memory (assuming a write–through policy described in the following Data Cache
Maintenance section). When this occurs it is possible that the affected memory may
be already located in instruction cache. It is important that the instruction cache be
invalidated after self modifying code has changed memory which will later be
accessed for instructions. Because cache invalidation can only be performed by
Supervisor mode code, a system call service may be required to invalidate the cache.
The Instruction cache operates with virtual address tags when address
translation is in use (physical instruction (PI) bit clear in CPS register). The cache
tags do not contain any per–process identifiers, but can distinguish between User or
Supervisor mode access. When address translation is used, it is possible that a User
mode virtual address maps to the same physical address as a Supervisor mode virtual
address. However, the cache would assign separate blocks to each of the virtual
addresses. Hence, the instructions on shared instruction pages could be cached twice.
This results in inefficient use of the cache but is unlikely to lead to any problems
unless the instructions on the shared physical page are modified. Note, two User
mode processes can not map their virtual address to the same physical page, as the
cache must be invalidated when a process context switch occurs.
5.13.3 Branch Target Cache
The Am29000 and Am29050 3–bus processors have a Branch Target Cache
(BTC) which can supply the first four instructions of a previously taken branch.
Chapter 5
Operating System Issues
275
The Am29000 processor can cache 32 branch targets. The arrangement is the
usual two sets with 16 blocks (or entries) in each set. The Am29050 processors is
configurable to cache 64 branch targets, each block containing four instructions.
Alternatively, 128 blocks, still arranged in two sets, can be used to contain only two
instructions. The smaller block size makes more effective use of the cache when the
BTC is required to hide a smaller instruction memory access latency (see section
1.9).
The programmer has little control over BTC operation; it is maintained
internally by processor hardware. There is no means of accessing or preloading the
cache via the cache interface registers provided on other 29K family members.
Additionally, there are no cache lock bits provided for in the CFG register. The cache
can be disabled by setting the CD bit in the CFG register; and invalidated by
executing an INV or IRETINV instruction.
5.13.4 Am29030 2–bus Microprocessor
The Am29030 has an 8K byte instruction cache; 4K bytes being provided by
each of the two columns. The Am29035 only provides column 0 and hence has 4K of
cache (this results in the Am29030 having typically a 20% performance advantage
for large programs). These processors were the first 29K family members to have non
BTC–type instruction cache. When a branch instruction is executed and the block
(cache entry) containing the target instruction sequence is not found in the cache, the
processor fetches the missing block and marks it valid. Complete blocks are always
fetched, even if the target instruction lies at the end of the block. However, the cache
forwards instructions to the decoder without waiting for the block to be reloaded. If
the cache is enabled and the block to be replaced in the cache is invalid and locked,
then the fetched block is placed in the cache. Note, complete blocks are fetched even
when the cache is disabled. This is a little wasteful if the target of a jump or branch is
not the first address in a block.
Blocks are tagged on a per–block basis. There is only one Valid bit in the block
status information. This bit is not set until the processor has fetched an entire block
with no errors. Blocks which are fetched ahead during prefetch buffer filling are not
marked valid if execution does not continue into the block. Filling the prefetch buffer
in this way enables burst–mode access to be maintained for longer intervals; and
hence reduce overall access delays. LOAD or STORE instructions can occur at any
time; however, the Am29030 processor completes the fetch of the current block
before starting the data access. This is because it is probably more efficient to
complete the instruction fetch, which is likely in single–cycle burst–mode. The
cache reload characteristics of the Am29030 processor (reload blocking) further
emphasise the importance of scheduling LOAD instructions ahead of the time the
data is required for further operations. The current tools for the 29K family do not
support code positioning such that the target of call and jump instructions begin on a
276
Evaluating and Programming the 29K RISC Family
block boundary. This would lead to an expansion of code space requirements and is
likely to produce little performance improvement.
5.13.5 Am29240 and Am29040 Processors
The Am29240 microcontroller has a 4K byte instruction cache. The Am29040
2–bus microprocessor has an 8K byte instruction cache. The caches are implemented
using a similar two–way set associative architecture. The major difference from the
earlier Am29030 processor cache is that the block status information has a valid bit
per instruction. The resulting four bits enable partially filled cache blocks to be
supported. This has been shown to produce an average performance gain of 4% over
the valid bit per block method. However, the performance difference may be larger
for code which contains an unusually large number of branch instructions. Note, the
Am29240 microcontroller only caches instructions held in DRAM or SRAM
address regions.
Because cache blocks are not tagged per block, it is possible to interrupt cache
reload with a higher priority operation. This means LOAD instructions need not wait
till the end of the current block reload before they can gain access to the processor
busses. Unlike the block oriented cache of the Am29030, cache reload begins with
the target instruction of a branch, not the first instruction of the block. As with the
Am29030, instructions are forwarded for execution in parallel with cache block
reload. During instruction prefetch, the next block is fetched ahead if it is not already
in the cache or if any of its valid bits are clear.
The instruction cache can be invalidated in a single cycle using an INV or
IRETINV instruction. These instructions also simultaneously invalidate the data
cache. To invalidate only the instruction cache, instructions INVI and IRETINVI are
provided.
5.14 DATA CACHE MAINTENANCE
Newer members of the 29K family can operate with internal processor speeds
which are higher than the off–chip memory system speeds. This ability is known as
Scalable Clocking. To obtain the processing benefits of the higher internal pipeline
speed, it becomes important to prevent pipeline stalling due to accesses to any
off–chip data memory. For this reason, on–chip data cache has been incorporated into
the 29K family. When a cache hit occurs, the accessed data is supplied by the cache
rather than off–chip memory. If the number of cache hits can be kept high, the
potential pipeline stalling which results from a cache miss can be minimized.
As with instruction caches, two–way set associative addressing is used (see
section 6.2). However, unlike instruction caches, 29K family data caches are always
accessed with physical rather than potentially virtual addresses. Physically
addressed caches have advantages over virtually addressed caches. For example,
Chapter 5
Operating System Issues
277
they do not need to be invalidated on a task context switch; they do not need extra tag
information to distinguish virtual from physical access and Supervisor from User
mode access; and importantly, cache coherence problems are more easily solved with
a physically addressed cache. It is somewhat more difficult to implement a physically
addressed data cache. Virtual data addresses must first be converted to physical
addresses before cache access can be attempted. The required address translation
followed by the cache access overhead can introduce a delay before the cache can
respond with the requested data. As internal processor speeds increase, the cache
may not be able to respond within a single–cycle, thus introducing the potential for
pipeline stalling if load instruction scheduling is not performed.
The data cache is enabled by clearing the Data Cache Disable (DD) bit in the
CFG configuration register. Data caches support accesses to byte and half–word
sized objects within a cached word. Cache tag information is associated with each
block (or cache entry), and the block size is four words (16 bytes). A 2K byte data
cache would have 64 sets, each containing two blocks (a total of 128 blocks given
there is a block for each of the two columns in a set). Individual cache entries can be
accessed via the Cache interface (CIR) and Cache Data (CDR) registers. These
registers enable the data and tags of a cache block to be directly read and written.
There is only one Valid (V) bit for each block. This means blocks are never
partially filled and marked valid. A 29K data cache only allocates cache blocks to
data when a miss occurs during a data load operation. This is known as a
“read–allocate” policy. When performing a data store and an address match is not
found in the cache, no cache block will be allocated. This “no write–allocation”
policy has some advantages. It simplifies the cache design, as an “allocate on write”
policy may require a currently valid block to be written–back to memory before the
block is reallocated to cache the data block causing the cache miss. This would be a
complicated process as the reload and write–back activities both require access to the
system busses. Additionally, the instructions following the load instruction may also
require access to the system bus if they are not being provided by the instruction
cache. To implement an “allocate on write” policy, which avoided the potentially
severe pipeline stalling, would be expensive in terms of on–chip (silicon) resources.
Typically, when data is written–out to memory it is no longer required, as compilers
prefer to keep critical data in registers. Thus, typical patterns of data access indicate
that data written–out should not cause block allocation as the data is somewhat less
likely to be accessed again in the near future.
When stores are performed on data which is not currently in the cache, or to data
which is supported with a “write–through” policy, a write–through buffer is used to
assist the operation. The buffer is two words deep and holds store–data which is
waiting for access to the memory bus. This enables the processor to continue
executing new instructions and not wait till the store is complete. The pipeline only
stalls when there are more than two outstanding stores waiting to be written into
278
Evaluating and Programming the 29K RISC Family
memory. This seldom happens, but when it does, the write–buffer which normally
has the lowest priority is given a higher priority for accessing the system busses.
Because load instructions have a bigger impact on performance than store
instructions, cache reload may be performed before the write–buffer is emptied. The
Am29240 has dependency logic to detect if a load is performed on a data address
which is currently pending in the write–buffer. The data is forwarded from the
write–buffer when necessary. Because the Am29040 has a copy–back rather than
write–through policy, the write–buffer is first flushed before loads that miss in the
cache are performed –– this is explained in the later Am29040 Microprocessor
section.
The write–buffer is disabled when the data cache is disabled. In this case the
processor is not decoupled from the performance of memory writes. Before interrupt
processing commences or when a serializing instruction is executed, the write buffer
is flushed. Additionally, execution of LOADL or LOADSET instructions (which
bypass the data cached) is preceded by write–buffer flushing. Store instructions are
properly ordered, and since the STOREM instruction bypasses the write–buffer, the
buffer is emptied before the STOREM commences.
Data cache reload, resulting from a load access which missed, always fills a
complete block. The process of reloading the cache is assisted with a reload buffer
which temporarily holds the data fetched from memory. The cache reload buffer is
four words deep. When the buffer is full it is transferred into the cache in a single
cycle when the cache is currently not being accessed. Code continues to execute
during cache reload; and the cache will continue to service cache accesses which hit.
However, if a further data load operation is performed on data not found in the cache,
the processor pipeline will stall until the current reload operation is complete. When
the reload buffer becomes available the second reload operation will commence (if
necessary) and the pipeline will restart instruction processing.
The following sections present further detail about data caching for individual
29K family members. Table 5-3 summarizes this information.
5.14.1 Am29240 Microcontroller
A block diagram of the Am29240 cache architecture is shown on Figure 5-7.
The precise cache implementation may differ from the diagram but the data flow
paths can be seen.
A buffered “write–through” policy is implemented for all data stores. If write
data matches with a cached entry, then the cache is updated during the same cycle as
the store. All stores cause writes to off–chip memory, but the write–through buffer
enables the processor to continue code execution while the stores are completed in
parallel.
The cache is accessed in the execute stage of the pipeline even if address
translation is in use. This makes data that hits in the cache available for the instruction
Chapter 5
Operating System Issues
279
Table 5-3. Data Cache Comparison
Processor
Am29040
Am29240
Am29243
Addressing
Physical
Physical
Cache associativity
2–way set
2–way set
Valid bits per block
1 bit
1 bit
Write–through buffer
2 words
2 words
Reload buffer
4 words
4 words
Copy–back buffer
4 words
–
Copy–back policy
Selectable
No
Write–through policy
Selectable
Always
Non cachable regions
On per–page bases
For PIA space
Critical word first reload
No
Yes
Reload memory access
Burst mode
Page mode
Bus snooping
Yes
No
LOADM causes reload
No
Yes
Cache locking
Per–column
Per–column
Cache block allocation
Only on LOAD
Only on LOAD[M]
LOAD hit latency
2–cycles
1–cycle
Replacement selection
Random
Random
following the load without any pipeline stalling. However, scheduling of load
instructions is still required in case of data misses which are still subjected to the
access latencies of the external memory.
Data cache reload always fills a complete block. The format of the cache tag and
status information is very simple, as shown in Figure 5-8. Reload always begins with
the “critical word first”. The critical word is the word containing the requested data.
The critical word is fetched and forwarded to the appropriate execution unit and to
280
Evaluating and Programming the 29K RISC Family
Instruction
cache
PC–Bus
Instruction
Prefetch
Buffer
I–Bus
IR
Instruction/Data Bus
32 bits
block reload
LOAD (miss)
LOADL
STORE (miss)
adds.
tag
reload
buffer
LOAD (hit)
Data
cache
DI
out
32 bits
in
4x32 bits
STORE (hit)
adds.
tag
STOREM
write–through
buffer
STORE
DO
Address Bus
Figure 5-7. Am29240 Microcontroller Cache Data Flow
the cache reload buffer. Reload continues with the remaining words in the block and
if necessary wraps at the end of the block to fill the remainder of the block. To
increase the cache reload speeds, the processor attempts to use page–mode accesses
when loading from DRAM. Note, burst–mode addressing can not be used as the
block may not be accessed with consecutive addresses due to critical word first
reload.
Address Tag
V
Figure 5-8. Am29240 Data Cache Tag and Status bits
The processor only caches accesses made to DRAM or ROM address regions.
The write–through policy ensures that data in external memory is always consistent
Chapter 5
Operating System Issues
281
with data held in cache. Accesses to other address regions or on–chip peripherals are
not cached. When polling the status of a peripheral device, it is important that status
data not be cached. This means that off–chip peripherals should be placed in PIA
space or other non cached space.
When developing code in the C programming language, the key word volatile
can be used to indicate that data should not be held in internal registers. However, this
data may still be cached. Hence, marking data volatile is insufficient to ensure that it
is always accessed from off–chip memory. If memory can be modified by some other
device, either via dual–port memory or external DMA controller, it is important that
the cache be kept coherent with memory. This can be accomplished by signaling the
processor when a DMA type transfer is complete. The processor can then invalidate
the cache. Because the cache normally contains a copy of the memory data (due to the
write–through policy), all modifications to cached data are already reflected in the
memory state. Note that marking data volatile may reduce the compilers ability to
produce highly optimized code, as load scheduling is restricted across the boundary
created by a volatile memory access.
Cache invalidation due to DMA type access can be avoided if the data
concerned is never cached. There is no way with the Am29240 microcontroller of
marking the data as non–cacheable. However, data which is accessed via LOADL
(load and lock) instructions is never allocated for cache use. A convenient way of
ensuring that the compiler only generates code which accesses the critical data with
LOADL and STOREL instructions has been added to newer versions of the High C
compiler. When the key word _LOCK is used (along with volatile) to define the data
type of a variable, LOADL instructions are used in place of LOAD when accessing
the associated data. Consider the example below:
typedef _LOCK volatile unsigned char UINT_8;
unsigned char uart_data;
UINT_8 *uart_p;
/* cacheable copy of UART data */
/* uart_p must hold uart address */
uart_data = *uart_p;
/* access the UART */
If the _LOCK volatile approach is not available, it may be possible to take an
object–orientated approach to DMA affected data. The critical data could be only
modified with an object member function. The member function (probably a leaf)
could be written in assembler and use the LOADL instruction. Of course using such a
simple function to perform a task which would normally be accomplished with
in–line code, would have a performance impact. However, this may be better than
invalidating the whole cache with each DMA occurrence. Note that directly setting
the Lock (LK) bit in the Current Processor Status (CPS) register will ensure that the
Lock pin is asserted during load and store operations, but does not result in data cache
bypassing.
282
Evaluating and Programming the 29K RISC Family
When register stack filling occurs, a LOADM instruction is used to restore local
registers which were previously spilled. The data loaded during the filling will be
allocated in the cache and possibly displace other cached data. However, the filled
data is intended for the register file only, and will never be accessed by load and store
instructions. This somewhat reduces the effectiveness of the cache; but, since
register stack filling is a very infrequent occurrence it is unlikely to have any serious
impact on performance.
If filling always occurred in Supervisor mode, it would be very easy to add code
to the fill_handler (see section 4.4.5) which disabled the data cache on entry and
reenabled the cache after the LOADM instruction. Valid data is retained in the cache
while it is disabled. The cache is disabled and enabled in Supervisor mode by
respectively setting and clearing the DD bit in the CFG configuration register. This
would prevent any cached data being replaced by the fill operation. However, filling
is normally accomplished by trampolining from a Supervisor mode trap handler,
FillTrap (see section 4.4.3), to the User mode fill_handler. This introduces a
difficulty. It would be simple to disable the cache in the FillTrap code, but after
returning to User mode, access to the CFG register is not directly permitted. It would
be possible to take a trap at the end of fill_handler to reenter Supervisor mode,
enable the data cache and then IRET back, but it seems unlikely that the additional
overhead (although small) would produce a noticeable performance gain. Another
difficulty with temporarily disabling the cache is that an interrupt may occur. The
interrupt handler or operating system support code would then have the burden of
reenabling the cache. However, it may be worthwhile for operating system code to
disable the data cache while reloading the local register file during a task context
restore.
5.14.2 Am29040 2–bus Microprocessor
A block diagram of the Am29040 cache architecture is shown on Figure 5-9.
The precise silicon implementation may differ from the diagram but the data flow
paths can be seen.
The default policy of the cache is “copy–back” rather than “write–through”.
Stores do not always cause writes to off–chip memory, as is the case with a
write–through policy. Consider when a currently valid cache block is to be
reassigned to a new memory location. The write–through policy enables the block to
be simply reallocated without having to copy its contents to memory. The copy–back
policy eliminates the need to write all stores to memory, but requires that reallocated
blocks be copied–back to memory before they can be used for higher priority data.
To improve the performance of the copy–back policy, the processor has a four
word copy–back buffer which is loaded in a single cycle. This makes the selected
block immediately available for reload. The copy–back buffer data is transferred to
memory when the system bus becomes available –– certainly after reload is
Chapter 5
Operating System Issues
283
PC–Bus
Instruction
cache
Instruction
Prefetch
Buffer
I–Bus
IR
Instruction/Data
Bus
adds. copy–back
tag
buffer
4x32 bits
out
out
Data
cache in
32 bits
block reload
adds.
tag
reload
buffer
LOAD
(miss)
STORE
(miss)
LOAD (hit)
DI
32 bits
4x32 bits
STORE (hit)
STOREM
adds. write–through
buffer
tag
STORE
DO
Address Bus
Figure 5-9. Am29040 2–bus Microprocessor Cache Data Flow
complete. Thus, loads that miss in the cache do not need to wait for a block to be
written to memory before the data memory can be read.
The cache is accessed in the write–back stage of the pipeline. Tag comparison
and any required address translation is performed during the execute stage. This
makes data that hits in the cache available for the second instruction following the
load without any pipeline stalling. Compared to the Am29240 microcontroller, this is
an extra cycle of latency. The reason is the higher internal clock speeds of the
Am29040 processor. Scheduling of load instructions is always required. Particularly
in case of a data miss, which will stall the pipeline by an amount increasing with the
access latencies of the external memory.
Store operations that hit in the cache also require two cycles to complete. To
enable the cache to satisfy a load instruction which follows a store, the second cycle
needed for cache access can be postponed. Stores that hit in the cache make use of the
284
Evaluating and Programming the 29K RISC Family
write–through buffer just like stores that miss. The write–through buffer completes
the second required cycle of a store when the cache is free.
Because the write–through buffer can contain data for a store that hit in the
cache, the write–buffer must be flushed before cache reload can be performed. To
understand this, consider that the write–buffer may contain data for a modified block
which must be written back before the block can be reallocated. The write–buffer can
not forward the store data to the cache block after it has been assigned to a new
memory address.
Not all cache blocks need to be written back to the system memory. The format
of the cache tag and status information is shown on Figure 5-10. The tag information
contains a Modify (M) bit. When a block is first reloaded the valid bit is set and the M
bit is cleared. If a store (which is not write–through) is performed to an address in the
block, a hit occurs and the cache satisfies the access. At the same time the M bit is set
indicating the block has been modified. If the block is reallocated, it will be copied
back only if the M bit is set. Otherwise the block can be reloaded without the
copy–back being performed.
Address Tag
V
S
M
Figure 5-10. Am29040 Data Cache Tag and Status bits
Data cache reload always fills a complete block. Unlike the Am29240
microcontroller, reload with critical word first is not performed. The processor will
use burst mode when reloading a block and will start with the first word in the block.
When the critical word is accessed during reload it is forwarded to the execute unit.
This enables reload to continue in parallel with code execution. If the critical word
had been accessed first, and it was not the first word in the block, burst mode access to
the memory block would have to be disrupted. This would increase the overall reload
time and would be particularly noticeable for back–to–back loads which miss in the
cache. Data cache reload is given priority over instruction cache for access to the
system busses. Loads issued while the cache is disabled, or to non cachable data, only
fetch the critical word from memory.
There is a minimum access latency of 3–cycles for the first word in a reloaded
cache block. This is true even if the off–chip memory system has the minimum access
latency of 2–cycles. When a block is reloaded it is possible the block will be supplied
by another Am29040 processor (via data intervention) rather than the memory
system. Data intervention is not asserted until the third cycle after the address of the
first word in the block appears on the address bus. The memory system may supply
the data in two cycles, but the processor holds the data internally for one cycle in case
data intervention occurs. Because cache reload is always block orientated,
intervention only occurs with the first word of the block. If the memory system
Chapter 5
Operating System Issues
285
latency is 3–cycle or more, the processor does not delay the forwarding of the first
data word. For single–cycle burst–mode memories, the remaining data words are not
delayed internally by an additional cycle (given 2–cycle first access) unless a
load–multiple is being performed. Only accessed data, rather than reloaded data
values, are sent to the 29K data channel.
Peripheral devices such as a UART can be accessed at physical memory
locations determined by specific system hardware. Because the status and data of an
external device can change at any time, it is undesirable to cache their contents.
Access to these devices is normally accomplished in Supervisor mode. On entering
Supervisor mode the data cache could be disabled by setting the DD bit in the CFG
configuration register (this happens automatically if the FPD bit is not set in the CFG
register). This may be convenient for assembly level code as the cache may only be
disabled for a short time. Alternatively, assembly code could use LOADL
instructions (which, unlike the Am29240, may cause block allocation) when
accessing peripherals. The LOADL instruction always accesses off–chip memory.
However, if operating system code is implemented in C then it is desirable, for
performance reasons, that the operating system data also be cached. A note of
caution, when the cache is disabled its contents are retained. Consequently, if
memory locations, currently cached, are modified while the cache is disabled, the
cache will supply out–of–date data when it is reenabled. This must be avoided.
The key word volatile can be used in C to indicate that data should not be held in
internal registers. However, this data may still be cached. Hence, marking data
volatile is insufficient to ensure that it is always accessed from off–chip memory. As
described in the previous Am29240 section, defining the data type to be _LOCK
volatile is one way of instructing the compiler to use only LOADL instructions when
accessing peripherals. The Am29040 has an alternative; the MMU can be configured
to disable caching of selected memory pages. This means the operating system code
(or for that mater User mode code) must run with address translation turned on. When
TLB reload occurs, the memory management software must ensure the two–bit field
(PGM) of the TLB registers is set to “non cachable” for memory addresses
containing mapped peripherals. The PGM field format is shown on Table 5-4.
Table 5-4. PGM Field of the Am29040 Microprocessor TLB
PGM1 PGM0
286
Effect
0
0
normal (copy–back)
0
1
write–through
1
0
reserved
1
1
non cachable
Evaluating and Programming the 29K RISC Family
Data loads performed to memory locations which are marked non cachable are
not subject to data intervention. There is never any internal delaying of data in case
data intervention occurs late in the access. Hence, the critical word can be accessed in
a minimum of 2–cycles –– given a 2–cycle memory system. This requires that a data
region which is not cached by a processors, also not be cached by any other processor
(in a multiprocessor system).
With virtual addressing in use, the MMU can be configured to select a
write–through policy on a per–page bases. The write–through policy ensures that
data in external memory is always consistent with data held in cache as all load
instructions are applied to memory (and to the cache if they hit). Selecting this policy
for all memory pages would result in a poorer overall performance compared to the
default copy–back policy. However, regions of address space may be allocated to
peripherals which require immediate update. For example, video memory should be
marked as write–through rather than cachable. There is a definite advantage to
accessing video frame information from the cache when manipulating images.
Additionally, system implementations which fail to deal with the additional
hardware signals needed to support bus snooping may use write–through access to
assist with cache coherence problems. A write–through policy would only enable an
external agent to read shared data, it would not be able to modify the data.
To achieve the best performance, application code will likely use the data cache
with copy–back operation selected. However, there are situations when an
application will prefer write–through cache operation –– at least for portions of the
memory space. Memory locations are frequently used to pass data between operating
system and application code. If the operating system where to use copy–back data
cache operation (the default), there would be a danger that some data blocks
(accessed by the operating system) would be cached and their M bit set; later when
returning to the application, the block may be within a memory page which is marked
write–through, this would prevent the block being copied–back should the block be
reallocated. It is best to run the operating system with address translation turned on.
This enables the MMU to control the cache operation for memory pages which are
jointly accessed by the operating system and application code. To simplify this task,
the configuration register has an Freeze–PD (FPD) bit. When this bit is set the
Physical Data (PD) bit in the CPS registers is not set when the operating system is
entered via a trap or interrupt. The FPD bit enables the PD bit to remain unchanged.
Thus, if address translation was enabled in the application, it will remain enabled
after a trap or interrupt. The data cache need not be disabled when the operating
system is entered. However, the MMU must ensure a consistent cache policy for
memory pages which are jointly accessed by operating system and application code.
The Am29040 processor deals with other agents, such as a DMA controller,
accessing the same memory by performing bus snooping. Multiprocessor designs,
with on–chip caches, are also supported by the snooping protocol. The Shared (S) bit
Chapter 5
Operating System Issues
287
in the cache tag is used to support the protocol. The S bit becomes set when a write is
performed to an address which causes block reload, and the block is supplied by
another cache –– more on this in the following Cache Consistency section.
The Am29040 processor, unlike the Am29240, does not allocate cache blocks
for data fetched with a LOADM instruction. This prevents needless displacement of
valid cache blocks when a register stack fill is performed. Block allocation during a
LOADM in conjunction with a copy–back policy would have poor performance
given that the copy–back buffer is only four words deep. The copy–back buffer and
the LOADM instruction would both be competing for access to the system bus.
5.14.3 Cache Locking and Invalidating
Valid data cache blocks can be locked by appropriately setting the DL field of
the CFG configuration register. The entire cache can be locked or only column 0. If a
block is locked but still invalid, it can be allocated for caching. Critical data can be
placed in the cache by first locking the cache and then loading the required data. This
effectively turns the cache into a small fast RAM for critical data. (However, a
write–through policy, if used, will still cause all writes to be duplicated in off–chip
memory). If only column 0 is locked the remaining column 1 will still cache entries
with a direct–mapping replacement scheme. Typical applications show best
performance when the cache is not locked but left to the default scheme of caching
the most recently accessed data.
The cache can be invalidated in a single cycle by issuing an INV or IRETINV
type instruction. All blocks are marked invalid during this process unless the cache is
locked. A locked cache can only be marked invalid if it is first disabled before
invalidating.
The copy–back policy of the Am29040 makes cache invalidation more difficult.
Valid cache blocks which have been modified can not be simply marked invalid.
Failure to write–back modified blocks would leave the memory in an out–of–date
state. Because the data cache operates with physical address tags and performs bus
snooping, there is very little reason to invalidate the cache. Cache invalidation can be
safely performed by using the cache interface registers (CIR and CDR) to examine
each block to determine if the block is valid and if the modified bit (M bit) is set.
When set, the block must be written out to memory before an INV type instruction is
used.
5.14.4 Cache Consistency
The Am29040 is currently the only processor in the 29K family which contains
on–chip data cache consistency hardware. Cache consistency becomes an issue
when there is more than one cache in a multiprocessor system or when a DMA type
device is also accessing data regions which are cached. When there is more than one
agent trying to access data, it is important that all agents agree upon a single (and
288
Evaluating and Programming the 29K RISC Family
most recent) value. A solution to this problem is for each processor to make virtual
address access to the shared memory pages and mark the pages as non cachable.
However, unless all memory is marked non cachable the plan would require that
software arrange for data intended for shared memory to appear in a range of
contiguous non cached memory. There would need to be an agreement with the
operating system that the selected address range was not to be cached. Such a
mechanism would be undesirable, inflexible, and difficult to retrofit to existing
software.
With systems incorporating multiple Am29040 processors, each processor may
cache the same memory location. This is desirable, as access to the cache is much
faster than off–chip access. The processor supports three interface signal pins which
facilitate “bus watching” for data reads with cache block granularity. The technique
requires little software support, and existing programs can benefit without any
modifications. The on–chip protocol supporting the interface signals ensures that
each memory access is consistent.
When a load is performed, all processors watching the bus determine if they
have a currently cached copy of the requested data. If they do, they assert the HIT
signal pin. The protocol will enable one cache to identify itself as the owner of the
data. This cache will assert both the HIT and the DI (“data intervention”) signals. The
processor requesting the load is satisfied by the intervening cache. The load will
cause a block to be allocated with the S bit set in the tag. This indicates the data is
shared. The processor can continue to access the data from the cache. Additionally,
all processors asserting the HIT signal will realize that another processor is sharing
the data and will set the S bit in their cached copy. If any processor modifies a block
tagged with the same address, that processor will perform a “write broadcast” as a
result of the S bit being set. This does not cause the system memory to be updated, but
enables the snooping processors to update their cached copies. A processor asserts
the WBC signal pin during the write broadcast and becomes the owner of the shared
block. The processor will remain the owner of the block until another processor gains
ownership by performing a write broadcast itself. When a processor performs a write
broadcast it checks to see if another processor is asserting the HIT signal, if not then
the processor realizes it is now the only processor caching the data and therefor clears
the S bit.
To summarize, bus watching of reloads is used to detect sharing of data. When
data is shared all caches set the S bit in the cached block. The processor which
satisfied the block reload (in place of the memory) is the owner of the block and has
the S and M (modified) bits set in the block tag. Writes to shared data create write
broadcasts on the bus to inform other caches of the change of value. Ownership of a
block is transferred to the processor performing the write broadcast. Cache to cache
communications via write broadcasts is a lot faster than accessing slower system
memory.
Chapter 5
Operating System Issues
289
Bus watching monitors write–through and copy–back of cached data. Memory
regions which are accessed as write–through never have cache blocks which are
modified (that is, their tag M bit is never set). All writes to such regions are performed
to the system memory. Caches with matching blocks will update their data when the
write–through takes place. Only blocks which have been modified get copied back
when the block is reallocated. When a block is copied–back, other caches will retain
their clean copies of the shared data. There will now be no owner of the data. If
another cache performs a load for the data, no processor will intervene and the data
block will be fetched from memory. The data consistency protocol is some times
referred to as a “MOESI” protocol (reflecting the five states: Modified, Owned,
Exclusive, Shared, and Invalid).
The Am29040 processor supports an optimization for use with binary
semaphores. They are frequently used to enable or disable access to shared resources.
A processor can gain exclusive access to a resource via the LOADSET instruction.
The instruction atomically loads the value from the semaphore memory location and
then writes the set–value (0xffff,ffff) to the location. The loaded value can then be
tested; if it was already set, access is disallowed. Access to a shared resource is
granted when a zero semaphore is read. The process of accessing the semaphore with
a LOADSET instruction disables allocation of the resource to other requesting
agents. When acquiring unset semaphores, the processor maintains exclusive control
of the system bus.
When access is not granted, a processor will, typically, repeatedly access the
semaphore waiting for it to become unset. However, continually polling a memory
location which is held in shared memory can be a serious performance problem. To
prevent the associated bus activity, the Am29040 can cache binary semaphores. If a
processor busy–waits, the semaphore traffic is isolated to the processors data cache.
Additionally, when a semaphore value is found to be set, further LOADSET
instructions are not granted access to the external bus until the semaphore is cleared.
The processor knows the semaphore is set by testing bit–31 of the cached value; in
such case there is no need to perform the SET portion of the LOADSET as the
semaphore is already set. The processor currently holding access to the semaphore
will perform a write broadcast when it releases the shared semaphore. A STOREL
atomic instruction is used to clear the semaphore value. The STOREL instruction,
like a LOADSET to an unset semaphore, has exclusive control of the bus during its
execution. The mechanism ensures that at any time only one processor can gain
access to a shared resource.
5.15 SELECTING AN OPERATING SYSTEM
I am often asked by engineers about to start a 29K project, what they should
look for when selecting an operating system. There are a number of companies
offering operating systems with a range of different capabilities; alternatively a
290
Evaluating and Programming the 29K RISC Family
home–grown system could be constructed. The material covered in this chapter and
others should help in either constructing or selecting a suitable operating system. I
would certainly advice seriously considering purchasing rather than constructing.
The task may be enjoyable but probably more lengthy than most project time tables
will allow. However, for those who insist on building their own operating system,
AMD has a collection of useful routines which make a good starting point. Contact
AMD 29K customer support for a copy of the code.
There is usually no one right operating system. The choice depends on a number
of criteria which may vary from project to project. The following list presents several
questions which you need to ask yourself and possibly operating system vendors.
You can decide the importance of each item with regard your project requirements.
Are 3–bus family members as well as 2–bus members supported? If the
Am29000 or Am29050 processors are to be used, and the data bus and
instruction bus are not to be tied together, then the operating system must be
clear about maintaining code and data in separate regions. The Harvard
architecture, supported by 3–bus memory systems, typically achieves a 20%
performance gain over 2–bus memory systems. Additionally, when 3–bus
systems are supported, the operating system may require the support of a
hardware bridge allowing the instruction memory to be reached (usually with
access delays) via a data memory access.
Are interruptible SPILL and FILL code supported? By running them with
interrupts disabled the difficulties of performing repair of the register stack
support registers can be avoided, should they be interrupted. However, they
require the support of multi–cycle LOADM and STOREM instructions, which
results in increased interrupt latency. Additionally, SPILL and FILL support
with interrupts disabled, results in a larger overhead compared with
trampolining to support routines; thus it is non–optimal as SPILLing and
FILLing occur a lot more often then their interruption.
Given that SPILL and FILL are interruptible, their operation is interdependent
with the longjmp() library routine and the signal trampoline code. All four of
these services must coordinate their manipulation of register stack support
registers if interrupts are to be reliably supported.
Some operating systems support nested interrupts, others do not; without nested
interrupt support, interrupt latency can be increased. The use of kernel threads
to complete interrupt processing is one way to keep down latency. If interrupt
handlers are to be written in a high level language such as C, it may be desirable
to support Freeze mode handlers in C. This greatly reduces the interrupt support
overhead, because the overhead of preparing the register stack for use by
non–leaf procedures is relatively high. Does the operating system under
Chapter 5
Operating System Issues
291
consideration use interrupt tagwords to support interrupt context caching for
Freeze mode handlers?
An interrupt can be configured to generate a task context switch, the new task
being responsible for completing interrupt processing. This method has a
greater overhead associated with it than processing the interrupt in the context
of the interrupted task. Task context switching requires the register cache to be
flushed and reloaded with the incoming task’s register data. A C–level interrupt
handler can use the stack cut–across technique to avoid flushing the register
cache. Certainly some interrupts must cause task context switching to occur, but
it is best to avoid this approach as a general mechanism for dealing with
interrupts. Additionally, if tasks run in User mode, the instruction cache must be
flushed on a task context switch. It is best to reduce the number of cache flushes
due to interrupt support.
If the system is to support a high interrupt throughput, then processing interrupts
with a Dispatcher will be more efficient. The Dispatcher can execute in
assembly level or C level. If C, then the interrupted register stack condition need
only be repaired once before entering the Dispatcher, rather than for every
interrupt (see section 2.5.6).
Interrupt latency can be reduced if Freeze mode interrupt processing is never
disabled. For a HIF conforming operating system, the technique was described
in section 2.5.7 (Minimizing Interrupt Latency).
Synchronous context switching times are greatly improved by only restoring
the activation record of the procedure about to start execution. This can only be
done for tasks which were synchronously switched out; but is a better method
than restoring the register stack to the exact position in use at the time of the task
context save.
Many embedded operating systems run tasks in Supervisor mode rather than
User mode. This gives each task direct access to critical resources, there is no
need to use system calls (which use a trap instruction to enter Supervisor mode)
to gain access to restricted resources. Always running in Supervisor mode has
the additional advantage that the Instruction cache need not be flushed on a task
context switch. However, the benefits of memory access protection are typically
reduced or unavailable with such systems.
Operating systems each have their own system call interface which is usually a
little different from HIF (see Appendix C). However, it may be still useful to
have HIF services available. The HIF services can often be supported by
translating them into the underlying operating system serves. The High C 29K
and GNU library services generate HIF service calls. These libraries can be used
with a non–HIF operating system; but care must be taken as library routines
292
Evaluating and Programming the 29K RISC Family
such as printf() are not reentrant. The OS–boot operating system, most often
used with HIF conforming library services, does not support task switching, but
other operating systems will, and the reentrant library procedure limitations will
become a problem.
Not all members of the 29K family support floating–point instructions directly
in hardware. It is the operating system’s responsibility to ensure that the desired
floating–point emulation routines (trapware) are installed. The operating
system vendor should also supply the appropriate transcendental library
services (sin(), cos(), etc.) for the chosen processor.
Floating–point instruction emulation is typically configured to operate with
interrupts not enabled. This avoids the need to save interrupted floating–point
context. However, the addition of floating–point environment saving during
application context switching is a requirement for some systems and an
unwanted burden for other systems. It is worth knowing the options an
operating system supports in this area.
It is often desirable and less expensive to purchase an operating system in
linkable or binary form, rather than source. This makes it more difficult to make
changes to the operating system code; this can be required to incorporate
support for specialized peripheral devices. It is best that the operating system
not consume all of the 32 global registers assigned for operating systems use
(gr64–gr95). Additionally, linkable operating system images can use link–time
register assignment rather than compile time. This enables the user to rearrange
the global register usage and utilize unassigned registers for peripheral support
tasks.
The 29K family has no hierarchical memory management unit policy built into
the hardware. Support of the translation look–aside buffers is left to software.
This offers great flexibility, but generates questions about the MMU support
policy adopted by the operating system. Even if address translation is not
supported by an operating system, it is still desirable to use the MMU hardware
(where available) to support address access protection with one–to–one address
translation.
There is a movement in the operating system business, which includes real–time
variants, to support POSIX conforming system calls. It may be worth knowing
how, and to what extent, the operating system vendor plans to support POSIX.
Support for debugging operating system activity and application code is very
important. Often operating systems have weaknesses in this area. The
Universal Debug Interface (UDI) has been influential in the 29K debug tool
business. It offers flexibility in debug tool configuration, flexibility and
Chapter 5
Operating System Issues
293
selection. Debug tools are generally more available for DOS and UNIX based
cross development environments.
5.16 SUMMARY
Typical RISC processors, including the 29K, require more complex system
software. The manageability of such software development is very much a function
of the particular RISC processor implementation. Increased knowledge of how the
compiler utilizes the processor registers is required to achieve best performance. The
availability of a large number of internal registers leads to improved operation
speeds; although the performance gains are at the cost of a somewhat more complex
application task context switch.
The use of interrupt processing via lightweight interrupts and signal handling
methods, along with the relative infrequency of context switching, enable the system
designer to implement a supervisor of generally much improved performance,
vis–a–vis CISC processors. Fortunately, application developers can make use of
RISC technology without having to solve the supervisor design problems
themselves, as there are a number of operating system products available.
294
Evaluating and Programming the 29K RISC Family
Chapter 6
Memory Management Unit
Address values generated by the program counter and data load and store operations appear on the Am29000 processor address bus. Certain members of the 29K
family contain instruction caches, which eliminates the need for the processor to request instructions from external memory when the required instruction can be obtained from the cache. However, unless the Memory Management Unit (MMU) is in
operation, address values will flow directly on to the pins assigned to the address bus.
The MMU enables address values to be translated, to some extent, into a different physical address. This means that the address values generated by a program need
not directly correspond to the physical address values which appear on the chips address pins. The program generates virtual addresses for data and instructions which
are located in physical memory at addresses determined by the MMU address
translation hardware.
With the Am29000 processor, virtual address space is broken into pages of 1K
byte, 2K byte, 4K byte or 8K byte size. The first page begins at address 0 and subsequent pages are aligned to page boundaries. The MMU does not modify the lower
address bits used to address data within a page. For example, with a 4K page size, the
lower 12 address bits are never modified. However, the MMU translates the upper 20
virtual address bits into a new 20–bit value. The translated upper 20–bits and the
original lower 12–bits are combined to produce a 32–bit physical address value.
The use of an MMU enables a program to appear to have memory located over
the complete 32–bit virtual address space (4G bytes). The physical memory system
is, of course, much smaller. Virtually addressed pages are mapped (via address
translation) into physical pages located in the available memory, typically 1M to 4M
bytes. A secondary memory is used to store virtually address pages which are not currently located in the physical memory due to its limited size.
295
The secondary memory is typically a disk. When the MMU identifies the program’s need to access data stored on a page currently out on disk, it must instruct the
operating system to page–in the required page into the physical memory. The page
may be located almost anywhere in physical memory, but the address translation capability of the MMU will make the page appear at the desired virtual address accessed by the program. In the process of paging–in from disk, the operating system
may have to page–out to disk a page currently located in physical memory. In this
way memory space is made available for the in–coming page.
Within the 29K family, the MMU unit is located on–chip, and is constructed using Translation Look–Aside Buffers (TLBs). This chapter describes in detail how the
TLB hardware operates, and how it can be used to implement a virtual address capability. The TLBs provide other functions in addition to address translation, such as
separate access permissions for data read, write and instruction execution. These important functions will be explained and highlighted in example code.
6.1
SRAM VERSUS DRAM PERFORMANCE
As already stated, secondary memory is typically disk. However, it is difficult to
show example code relying on disk controller operation. The example code would be
too large and too much time would be spent dealing with disk controller operation.
This is not our intention. I have chosen to use SRAM devices for physical memory
and DRAM and EPROM devices have been chosen to play the role of secondary
memory.
SRAM devices are much faster than most DRAM memory system arrangements. Thus, by paging the program into SRAM, a very desirable speed gain should
be obtained. Certainly the secondary memory capacity is limited to the typically 1M
to 4M bytes made available by the DRAM and EPROM combination. But programs
will execute from SRAM alone, which may be limited to as little as 128K bytes. For
large programs this is likely to result in SRAM pages being paged out to secondary
DRAM to make space available for incoming pages.
The SRAM will effectively be a memory cache for the secondary DRAM; the
Am29000 processor MMU being used to implement a software controlled cache
mechanism. The performance difference shown by programs executing from SRAM
versus DRAM is large. Figure 6-1 shows the average cycles required per instruction
execution for four well know UNIX utility programs. The influence of memory performance on these benchmarks is likely to be similar to that experienced by large embedded application programs. The DRAM memory system used is termed 4–1. This
terminology is used throughout this chapter. In this case it means the memory system
requires four cycles for a random access and one cycle for a burst–mode access.
Burst–mode enables multiple instructions and data to be accessed consecutively
after a single start address has been supplied. The first data or instruction word in the
burst suffers the access penalties of a random access, but subsequent accesses are
296
Evaluating and Programming the 29K RISC Family
much less expensive in terms of access delay cycles. The external memory system is
responsible for generating access addresses after the processor has supplied the start
address for the burst. This can be simply achieved with an address latch and counter.
4
3
2
1
0
compress
diff
nroff
assembler
Average cycles per instruction
Joint I/D 4–1 DRAM memory system
Separate I/D
Figure 6-1. Average Cycles per Instruction Using DRAM
The Am29000 processor can execute a new instruction every cycle if supported
by the memory system. Figure 6-1 shows that the desired 1 cycle per instruction is far
from achievable by the utility programs using a 4–1 memory system. Certain members of the 29K family (the Am29000 and the Am29050 processors) support a 3–bus
architecture. One bus is used for physical address values, and there are separate
busses for instruction and data information. This bus structure allows simultaneous
instruction and data transfer. Once the address bus has been used to supply the start
address of an instruction burst, the address bus is free for use in random or burst–
mode data accesses. Figure 6-1 shows performance values for both separate (separate I/D), and joint instruction and data (joint I/D) busses. It can be clearly seen that
separate busses offer a significant performance gain. Figure 6-2 shows the average
cycles per instruction for the same four benchmarks executing on a 2–1 memory system.
Implementing a 2–1 memory system at 25M Hz processor speeds, in particular
obtaining a 2–cycle first access, requires SRAM memory devices. The results on
Figure 6-2 show that 1–cycle per instruction is almost achieved when a separate
instruction and data bus is used with 2–1 memory.
29K family members supporting separate busses do not have any means within
the chip of reading data which is located in instruction memory. If instructions and
data are to be located in the same memory pages, then an off–chip bridge must be
Chapter 6
Memory Mangement Unit
297
2
1
0
compress
diff
nroff
assembler
Average cycles per instruction
Joint I/D 2–1 SRAM memory system
Separate I/D
Figure 6-2. Average Cycles per Instruction Using SRAM
constructed between the the data and instruction busses. Accessing data located in
the instruction memory system via the bridge connected to the data bus, will require
more access cycles than accessing data located in the data memory system connected
to the data bus directly. The bridge could support accessing instructions located in
data memory, but the performance penalties seem too great to implement. The bridge
mechanism is acceptable if used for the occasional read of data located in EPROM
attached to the instruction bus. It can also be used for reading, as data, an instruction
which has caused an execute exception violation.
The construction of two memory systems, one for data and a second for instructions, is undesirable. But it does allow a performance gain. This chapter shall deal
with an example system with a joint I/D. This is because the code example is simplified. A separate I/D memory system would require separate instruction and data
memory caches and associated support data structures. A block diagram of the example system is shown in Figure 6-3.
Even with a joint I/D memory system it may still be necessary to build two
memory systems to achieve a low number of cycles per instruction. This is because it
is difficult to achieve single cycle burst–mode access with current memory devices at
25M Hz processor rates. Two memory systems are required and are used alternatively. This technique is often called memory system interleaving. One memory system
supplies words lying on even word boundaries and the second memory system supplies words lying on odd word boundaries. In this way each memory system has twice
as many cycles to respond to consecutive memory accesses compared to a single
memory system acting alone.
298
Evaluating and Programming the 29K RISC Family
address bus
data/instruction bus
Cache
SRAM
Am29000
Secondary
Memory
DRAM
ROM
Figure 6-3. Block Diagram of Example Joint I/D System
Interleaving can not guarantee a faster random or burst–mode first access, because the first access can not be overlapped with an another access in the way achievable by consecutive burst–mode accesses. However, some implementations may
achieve some savings if the first access happens to fall to the memory system which
did not provide the previous access.
With joint I/D systems, 4 cycle first access is very punishing on performance.
This is because instruction bursts must be suspended when a data access occurs. To
start a data access costs 4 cycles. After it has completed, the joint I/D bus can restart
the instruction burst at a cost of 4 cycles. Thus accessing a single data word will effectively cost 8 cycles. The 4 cycle memory response latency is hidden by the branch
target cache (BTC) for branches and calls but not interruption of contagious instruction execution. Separate I/D systems do not suffer to the same extent from memory
latency effects, as the instruction bus can continue to supply instructions in parallel
with the data bus operation. Members of the 29K family, such as the Am29030 processor, which only support joint I/D systems, have instruction cache memory on–
chip rather than BTC memory. This will enable the effects of instruction stream interruption to be better hidden, as the on–chip cache can be used to restart the instruction
stream after data access has occurred.
Figure 6-4 shows average cycles per instruction for the four benchmark programs running on various joint I/D memory systems. The 4–2 DRAM system does
not support single cycle burst–mode (2–cycle burst), and the performance reduction
from a 4–1 DRAM system is apparent. The MMU and associated software will be
used in the example system to construct a software controlled cache. The TLB supChapter 6
Memory Mangement Unit
299
4
3
2
1
0
ÈÈÈÈ
ÈÈ
ÈÈ
ÈÈÈÈ
ÈÈÈÈ
compress
diff
ÈÈÈÈ
ÈÈ
ÈÈ
ÈÈÈÈ
ÈÈÈÈ
nroff
assembler
Average cycles per instruction
4–2 DRAM Joint I/D memory system
4–1 DRAM
2–1 SRAM
Figure 6-4. Average Cycles per Instruction
port software is based on an Am29000 TLB register format. Members of 29K Family
supporting two TLBs will require some small changes to the example code. The secondary memory shall be a 4–2 or 4–1 DRAM memory system. Programs shall be
paged into a small 2–1 SRAM memory. If the paging activity can be kept to a minimum, it is possible that the effective average cycle per instruction will approach that
of SRAM acting alone.
Current costs for DRAM devices are about $5 for 256kx4 DRAMs and $10.50
for 32Kx8 SRAMs. At these prices 1M byte of DRAM would cost $40 and 1M byte
of SRAM $336. Prices will of course continue to fall on a per–byte basis. However, a
large difference between SRAM and DRAM prices will remain, and SRAM memory
system costs will remain an obstacle in obtaining the highest system performances. A
128K byte SRAM memory cache would cost $42. Using such a cache in conjunction
with a secondary DRAM memory is a cost effective way of achieving high performance. Because the Am29000 processor implements TLBs and lightweight interrupts (see section 4.3.3) on–chip, it is an ideal processor to implement a software
cache mechanism.
6.2
TRANSLATION LOOK–ASIDE BUFFER (TLB) OPERATION
The Am29000 processor has a number of special purpose support registers accessible only by the processor operating in Supervisor mode. Special register 2,
300
Evaluating and Programming the 29K RISC Family
know as the Current Processor Status (CPS) register has two bits which are used to
enable or disable the MMU operation. Bit PI, if set, disables the MMU for all instruction accesses. Bit PD, if set, disables the MMU for all data accesses. When these bit
fields are both set, program address values flow directly to the address unit unmodified. This is simply known as physical addressing.
By clearing both bits PI and PD, program instruction address values and data
address values are presented to the MMU for translation and other checking. The
Am29000 generates addresses early. This means addresses are presented to the
MMU during instruction execution. The MMU completes the translation during the
execution cycle, making the translated address available at the start of the next processor cycle. The MMU does not need to check every address value; all data access
LOAD and STORE instruction addresses are translated. For instruction accesses,
only JMP and CALL type instructions are translated, as well as whenever the current
execution address crosses a page boundary. Figure 6-5 shows the probability of an
instruction requiring an address translation for the four utility programs previously
studied. Typically about 30% of instructions are shown to require address translations.
%
40
35
30
25
20
15
10
5
0
compress
diff
nroff
assembler
Joint I/D 2–1 SRAM memory system
Figure 6-5. Probability of a TLB Access per Instruction
The MMU is constructed using a 64 entry Translation Look–Aside buffer
(TLB). Let’s first deal with how the TLB registers are configured, and how address
translation is performed. Later, the additional functions supported by the TLB registers will be studied. TLB registers are arranged in pairs which form a single TLB
entry.
The Am29000 processor can support 1K, 2K, 4K, and 8K byte page sizes. Special register 13, the Memory Management Unit configuration register (MMU register), has a two bit field (PS) which is used to select the page size. For the following
discussion let’s assume the PS bits are set to give a page size of 4K bytes.
Chapter 6
Memory Mangement Unit
301
The lower 12 address bits will be unmodified by the MMU translation, they will
flow directly to the address pins. The next five address bits (bits 12 to 16) will be used
to select a TLB set. See Figure 6-6 for address field composition. If the page size had
been 2K bytes then address bits 11–15 would be used to obtain five bits for TLB set
selection. Whatever the page size, five bits are required to select from one of 32 TLB
sets. The Am29000 processor has actually 64 TLB entries arranged as two per TLB
set.
31
23
Virtual Address Tag Comparison
15
TLB set
11
7
0
Address offset within Page
Figure 6-6. TLB Field Composition for 4K Byte Page Size
Each TLB entry contains an address translation for a single page. Therefore the
MMU contains translations for a maximum 64 pages. It is possible the address requiring translation does not have a match with any of the current TLB entries, but this
will be discussed later. The virtual address space is divided into 32 sets of equal sized
pages (known as sets 0 to 31). Page 0 starting at address 0 belongs to set 0. Page 1
belongs to set 1 and so on. Pages 32, 64 and many more also belong to set 0. And
likewise page 31, 63 and more belong to set 31. All addresses falling on pages which
are members of the set must obtain an address translation from the TLB entrees
which are associated with the set. This is know as Set Associative Translation. If a
page address could be translated by an entry in any TLB, then the translation technique is known as Fully Associative.
Compared to full associative mechanisms, set associative translation requires
less chip area to implement than full associative mechanisms, and can more easily
operate at higher speeds. However, there are still many pages which compete with
each other to get their address translation stored in a TLB assigned to the associated
TLB set. For this reason the Am29000 processor supports two TLB entries per set.
This is often expressed as “two columns per set”. A page associated with a particular
set can have its address translation located in any of the two possible TLB entries.
This leads to the title: Two–way Set Associative Translation.
To determine which TLB entry has a valid entry for the page currently being
translated, the upper address bits, 17–31 in our 4K byte page example, are compared
with the the VTAG filed in the TLB entry. The VTAG contains the corresponding
upper bits for the TLB entries current translation. If a mach occurs, and other TLB
permission bit field requirements are also satisfied, then the TLB RPN field supplies
the upper address bits for the now translated physical address. In our 4K page example the RPN (Real Page Number) field would supply upper address bits 12 to 31,
302
Evaluating and Programming the 29K RISC Family
which when combined with the page, offset bits 0 to 11, produce a 32–bit physical
address. See Figure 6-7 for a block diagram of the TLB layout.
TLB Column 0
TLB Column 1
this entry a member of set 0
this entry a member of set 0
this entry a member of set 1
this entry a member of set 1
this entry a member of set 31
this entry a member of set 31
Figure 6-7. Block Diagram of Am29000 processor TLB Layout
TLB entries are constructed from fields requiring 64–bit storage. This results in
128 TLB registers supporting the 64 TLB entries (32 sets 2–ways per set). Two TLB
registers are required to describe a TLB entry. The first TLB register holds entry word
0 and a second register holds entry word 1. Figure 6-8 shows the TLB register layout.
Now that the address translation mechanism has been discussed, the TLB entry
fields can be examined in more detail. The VTAG and RPN fields have already been
Chapter 6
Memory Mangement Unit
303
31
23
15
7
VTAG
TID
TLB Entry Word 0
VE
31
0
23
15
RPN
SR SE UW
SW UR UE
7
res PGM
TLB Entry Word 1
0
res
U
IO
Figure 6-8. Am29000 Processor TLB Register Format
described. Word 0 contains access permission fields. First look at the TID field of
word 0. For a TLB entry to match with the current translation, not only must the
VTAG match with the upper virtual address bits, but the current process identifier
(PID) must match with the task identifier in the TID field. The PID is located in an
8–bit field in the MMU configuration register.
Multi–tasking operating systems assign a unique PID to each task. Whenever a
context switch occurs to a new task the MMU register is updated with the PID for the
currently executing task. This enables the MMU to support multi–tasking without
having to flush the TLB registers at every context switch. TLB entries are likely to
remain until a task is again restored and the TLB entries reused. TLB entries are only
valid if the VE bit is set, the VE bit for each TLB entry should be cleared before address translation is enabled.
When the processor is running in Supervisor Mode (the SM bit in the CPS register is set), then the current PID value is zero, regardless of the PID value located in the
MMU register. Each TLB entry can separately enable read, write and execute permissions for accesses to the mapped page. The SE, SR and SW bits control access
permissions for Supervisor accesses to the page. The UR, UW and UE bits control
access permissions for the TID identified user.
If no currently valid mapping can be found in the two associated TLB entries,
then a TLB miss trap occurs. There are four traps assigned to support address translation misses, two are reserved for the processor operating in Supervisor mode, and a
additional two can be taken when a translation is not found when the processor is operating in User mode. Each mode has separate traps for instruction address transla-
304
Evaluating and Programming the 29K RISC Family
tion and data address translation. A subsequent section describes the process of taking a trap.
Two additional traps are assigned to Supervisor and User mode protection
violations. These occur when a TLB entry has a valid entry but the permission fields
do not allow the type of access being attempted. For example unless the UW bit is set
a User mode process can not write to the mapped page, even if all other TLB entry
fields indicate a match with the translation address.
Now examine the bit fields of word 1. The IO bit is little used, it enables a virtual
address to be associated with a physical page in I/O space. The U bit is maintained by
the Am29000 processor. Whenever a TLB set is used in a valid translation the U bit
associated with the set is updated to indicate which of the two TLB entries was used.
In other words, the U bit selects the column within the set. The U bit is used to supply
the most significant bit in the least–recently used (LRU) register. Special register 14
has a 6–bit field which is updated whenever an address translation fails and a TLB
access trap occurs. The lower 5–bits of the LRU register are loaded with the TLB set
number. Thus the LRU register supplies to the trap handler a recommendation for
TLB entry replacement. The trap handler typically builds a new valid TLB entry at
the recommended location before execution of the interrupted program is continued.
The 2–bit PGM field is not assigned a task by the Am29000 processor, these bits
are placed on the PGM[1:0] out put pins when a translation occurs. Developers can
place any information they wish in the PGM bits. These bits are particularly useful
for multiprocessor applications when one processor wishes to signal other processors
about page cache–ability information.
All data accesses have their translated address and corresponding PGM value
presented on the the chip pins in the cycle following the cycle executing the LOAD or
STORE instruction. Pages containing instructions have their corresponding PGM
bits presented to the chip bins when a jump or call to an address within the page first
occurred. However, if the target of the jump or call is found in the on–chip instruction
cache and the address bus is currently in use when jump or call instruction is in
execute, the PGM bits for the target instruction page will not be presented to the chip
PGM[1:0] bins.
In this chapter, the software controlled cache code example shall use the PGM
bits to store page–lock and page–dirty information in bits PGM[0] and PGM[1], respectively.
6.2.1 Dual TLB Processors
Newer microprocessor and microcontroller members of the 29K family do not
have the full complement of 64 address translations cached in their TLB. A smaller
TLB size of 16 entries enables valuable silicon space to be used for on–chip functions; such as peripherals. To support the smaller number of TLB entries, the maximum page size has been increased from 8k bytes to 16M bytes. This enables a large
Chapter 6
Memory Mangement Unit
305
amount of virtual memory to be mapped with the reduced number of translation entries. Note, the Page Size (PS) field in the MMU configuration register is increased
from 2–bits to 3–bits to support wider page–size selection.
A consequence of the smaller number of TLB sets (8 for 16 two–way entries) is
a larger VTAG field. The Am29000 processor uses 5–bits to select from its 32 sets
(64 entries). The Am29240 only requires 3–bits to select the correct set. The loss of
2–bits for set selection causes a corresponding increase in the VTAG field. With a
minimum page size of 1k bytes (10 address bits), a maximum VTAG field of 19–bits
is required. To enable the VTAG field to fit within the TLB Entry Word 0, two permission bits are omitted. The Supervisor Read (SR) and Supervisor Execute (SE) access protection is not available with processors supporting larger page sizes. Consequently, Supervisor mode programs can always read and execute code/data from
pages which have a currently valid mapping. The TLB register format is shown on
Figure 6-9.
31
23
15
7
0
VTAG
TID
TLB Entry Word 0
31
23
SW UW
VE UR UE
15
RPN
7
0
res PGM
TLB Entry Word 1
U
res
IO
D16
PCE GLB
Figure 6-9. TLB Register Format for Processor with Two TLBs
The Am29243 microcontroller supports two TLBs. This enables valid translations for a larger virtual address space to be maintained at any time. Each TLB operates independently and they can be programmed with different page sizes. The MMU
configuration register has two Page Size (PS) fields; one for each TLB. Dividing the
TLB register space (128 registers) into two TLBs enables up to 32 translations to be
held in each TLB. Each Am29243 TLB implements 16 of the possible 32 translations. The Least Recently Used (LRU) register has two LRU–recommendation
fields, one for each TLB. The fields are arranged such that future processors can implement the complete complement of 16 sets (32 translations) per TLB. When a TLB
306
Evaluating and Programming the 29K RISC Family
miss occurs both LRU fields are update. Support software must decide which LRU
field to use and consequently which TLB to update. If the TLBs are allocated to different address regions, the miss address can be used to select the appropriate field.
TLB Entry Word 1 has an additional entry compared with the Am29000 register
format –– the Global Page (GLB) bit; when set, the mapped page can be accessed by
any processes regardless of its process identifier (PID). This can be very useful when
dealing with regions of shared code or data. Multiple processes can accessed, say, a
shared library, without each process having to have valid translation entries for the
memory pages containing the shared information
The Am29040 2–bus processor also supports two TLBs. The TLB register format is the same as used with the Am29240 microcontroller. However, there are a
number of additional fields implemented in Entry Word 1. The width of data bus used
for external memory accesses can be reduced to 16–bits if the D16 bit is set. When
set, a 32–bit data object is accessed via two 16–bit accesses. The D16 bit simplifies
access to memory or other device which must be accessed with a 16–bit width format. The PCE bit enables parity checking for the mapped page. Parity is odd or even
depending in the POE bit in the Configuration Register (CFG).
Table 6-1. PGM Field of the Am29040 Microprocessor TLB
PGM1 PGM0
Effect
0
0
normal (copy–back)
0
1
write–through
1
0
reserved
1
1
none cachable
With virtual addressing in use, the Am29040 TLB entries enable a data cache
maintenance policy to be selected on a perpage bases (see Table 6-1). The default
copy–back policy generaly achieves the highest performance. When the MMU is not
in use (physical addressing) a copy–back policy is applied for cached data. See section 5.14.2 for more details about Am29040 data cache policy. Note, when the D16
bit is set, the access is considerd non cacheable.
The example code presented in this chapter for a software controlled cache is
based on the Am29000–type TLB register format. To make the code work with an
Am2924x or Am29040 processor would require some small changes. The code sequences requiring modification would be in the construction of TLB entry Word 0
and Word 1. This does not detract from the value of the example code.
6.2.2 Taking a TLB Trap
The address translation performed by the MMU is determined by the trap handler routines which are used to update the TLB registers. When the current processor
status register bits PD and PI are both clear, enabling the MMU hardware for both
Chapter 6
Memory Mangement Unit
307
data and instruction address translation, the DA and FZ bits in the CPS register must
also be cleared. Clearing these bits disables Am29000 special register freezing and
enables traps to be taken.
When the MMU does not contain a match for the current address translation, a
trap is taken by the processor. This also happens for valid translations not meeting
permission requirements. The software executed by the trap handler must construct a
TLB entry for the failing address from page table entries (PTEs) stored in memory.
The TLB registers simply act as a cache for the currently–needed translations stored
in off–chip data memory.
Many CISC–type processors have algorithms in the chip microcode for automatically updating the MMU hardware from more extensive data located in external
data memory. Because the Am29000 does not implement this function in hardware,
the user is free to construct a software algorithm for TLB reloading which best suits
the memory management architecture. This increased flexibility outweighs any reduction in TLB register reload time that may occur for some configurations. The
flexibility is what makes possible the software controlled cached described later.
When the Am29000 takes a trap the processor enters Supervisor mode with frozen critical support registers. This is known as Freeze mode. A more complete explanation is given in Chapter 4 (Interrupts and Traps). The frozen special registers
describe the state of the processor at the time of the address translation failure. Examining these registers enables the trap handler software to determine the necessary
action and eventually restart the instruction in execute when the trap occurred. After
the trapware routines have constructed the required TLB entry, the faulting instruction will be able to complete execution.
Later sections will deal with the trapware in detail for the example software controlled cache system. The interesting details of the trapware will be covered then.
Since the code is memory architecture specific, the operation of the software controlled cache needs to be discussed first. This discussion is in the later section entitled
Software Controlled Cache Memory Architecture (section 6.4).
6.3
PERFORMANCE EQUATION
Performance has been considered in terms of average number of cycles per
instruction execution. This is a useful metric when considering memory system architectures. Figure 6-1, Figure 6-2 and Figure 6-4 give average cycles per user
instruction execution (AC/I). However, if a TLB miss occurs during instruction
execution, a number of Supervisor mode trapware instructions will be required to
prepare the TLB registers before the user’s code can continue. If TLB trapware is activated in support of too many instructions, then the effective number of cycles required per application instruction will increase.
The effective average cycles per instruction is given by: Aeffective = P AC/I where
AC/I is the average number of cycles per instruction for the program running in physi-
308
Evaluating and Programming the 29K RISC Family
cal mode, without the MMU in operation. The multiplying factor, P, determines how
much performance is reduced by the use of the MMU hardware. The value of P is
given by:
P
=
1
+
PTLB/I Pmiss Tcycles
AC/I
We shall look at the terms of this equation individually to determine their effect.
Term PTLB/I is the probability an instruction shall cause a TLB access. Figure 6-5
showed average figures for PTLB/I observed with the four benchmark programs examined. Given that a TLB access occurs, we are then interested in the probability that
an entry is not found and a miss trap is taken. This conditional probability is given by
term Pmiss, and Figure 6-10 shows average Pmiss values for the four benchmark programs running on the software controlled cache system .
% 4
3
2
1
0
ÈÈ ÈÈ
ÈÈÈÈ
1k
2k
4k
8k page size (bytes)
directly mapped
compress
nroff
assembler
diff
Figure 6-10. TLB Miss Ratio for Joint I/D 2–1 SRAM
System
What matters at present is we observe that TLB miss rates increase as we decrease page size. This is expected because smaller page sizes mean a smaller portion
of the program’s pages have mappings currently cached in the TLB registers. Given
that the Am29000 processor has a fixed number of TLB entries, it is best to have large
page sizes if TLB misses are to be reduced. However, the better granularity of small
page sizes may lead to better physical memory utilization. An additional consideration is the size of pages transported from secondary memory such as disk or network
connections. Secondary memory communication may be improved by better communication efficiencies. These efficiencies may be achieved with larger page sizes.
Chapter 6
Memory Mangement Unit
309
5400
4800
4200
3600
3000
2400
1800
1200
600
0
1k
2k
4k
8k
page size
nroff 4–1 DRAM with 128 page 2–1 cache
assembler
Figure 6-11. Average Cycles Required per TLB Miss
The final term of the Tcycles equation is the average number of cycles required to
process a TLB miss. Figure 6-11 shows values for the four benchmark programs running on the cache system. When a TLB miss occurs for a page which is not currently
located in the physical memory but in secondary memory, a large number of processor cycles is required to first transfer the page from secondary memory to physical
memory and then build a valid TLB entry. As the page size increases the TLB miss
trap handler execution time increases substantially.
The product, PTLB/I Pmiss Tcycles gives the average number of cycles overhead
added to each application instruction in order to support the MMU operation. After
studying the software cache memory architecture, the effective number of cycles per
instruction achieved will be reexamined and compared with the non–cache memory
architecture performance.
6.4
SOFTWARE CONTROLLED CACHE MEMORY ARCHITECTURE
By studying a software controlled cache mechanism we can achieve three objectives: First, a better understanding of the non–TLB–cached page–table layout. Second, further understanding of TLB trapware implementation detail.Thirdly, an
awareness of software controlled cache benefits.
When a TLB miss occurs, the trap handler must determine the replacement TLB
entry data. It does this by indexing a table of Page Table Entries (PTEs). Each PTE
contains information on how to map a physical page into its corresponding secondary
memory page. In our example system, the physical memory is SRAM and the secondary memory is DRAM. In fact, the secondary memory is physically addressable,
310
Evaluating and Programming the 29K RISC Family
but the execution of all programs from within the limited sized SRAM cache will be
attempted, and the DRAM will only be accessed when a page–to or page–from secondary memory needs copying.
There are many different PTE table arrangements. Some systems have multiple
layers of PTEs, where a higher level PTE points to tables of lower level PTEs. In multi–tasking systems, each task may have its own table of PTEs. And if the Supervisor
code also executes with address translation, then it may also have a table of PTEs. To
simplify our example system, we will assume the supervisor always runs in physical
mode, and there is a single table of PTEs shared by all User mode programs. To evaluate the system performance, only single User mode tasks will be run, in particular the
nroff and assembler utility programs.
PTEs need not have the same structure as TLB entries. They typically do not.
This enables the memory management system to keep additional page information in
memory and only cache critical data in the TLB registers. In addition it may be possible to compact information into a smaller PTE structure, which results in a substantial space saving in systems which keep extensive PTE tables permanently in physical memory (in our case SRAM). For the example system, PTEs shall have exactly
the same format as TLB entries. The method has the benefit that TLB entries can be
loaded from PTE memory location directly without additional processor cycles being expended in reformatting.
The PTE format will be 4–way set associative. The number of sets shall be limited by the amount of available SRAM cache memory, but a lower limit of 32, established by the Am29000, is required. Given a minimum page size of 1K bytes, the
SRAM can not be smaller that 128K bytes (1K x 4 x 32). If the number of PTE sets is
greater than 32, then the cache has more set resolution than the TLBs. In this case a
TLB set caches entries for more than one PTE set, and the TLB VTAG field has more
address resolution than the PTE VTAG field requires.
Each TLB entry indicates how the user’s virtual address is mapped into an
SRAM page number (given by the TLB RPN entry). The PTE entries must have a
mapping relationship with DRAM memory pages and SRAM memory pages. The
entries use the PTE RPN field to store the DRAM page number. PTEs also have a
mapping relationship with SRAM pages. This enables the memory page maintained
by the PTE to be moved between SRAM and DRAM. The PTE SRAM mapping is
simple. PTEs and SRAM pages are stored consecutively in memory, as are SRAM
pages. Given the PTE address, the corresponding SRAM page address can be found
by determining the PTE address displacement from the PTE table base. The PTE displacement, multiplied by the page size, will locate the SRAM page relative to the
base address of SRAM pages. Figure 6-12 outlines the system.
Chapter 6
Memory Mangement Unit
311
upte
set 0
set 1
upte
column 0
column 1
column 2
column 3
column 0
column 1
column 2
column 3
start of SRAM
page 0
page 1
page 2
page 3
page 4
kmsp
PTE entries
four per set
page 5
page 6
locked and
invalid page
page 7
Cache memory pages
Figure 6-12. PTE Mapping to Cache Real Page Numbers
Because the PTE entries are not an exact cache of PTE entries, due to the RPN
field differences, TLB register word 1 must be adjusted accordingly before the TLB
register can be updated form the PTE entry.
The Am29000 C language calling convention reserves processor registers
gr64–gr95 for operating system use. To improve trap handler performance a number
of these registers are used by these critical routines. For temporary use, six registers
are required, and for static information caching two registers are used. The particular
registers used are described later along with the example code. The two static registers are of particular interest; they will give them synonyms upte and kmsp.
It is desirable to keep critical data and routines in SRAM memory. For example,
the TLB miss handler routines should be stored in cache memory. Cached pages can
be marked as locked–in, this will prevent them from being paged–out to DRAM.
However, the SRAM is only intended to hold User mode application pages. Trap handlers and other critical operating system routines run in Supervisor mode, and in our
example system, without address translation. In practice, a larger SRAM could be
implemented and, say, half allocated for cache use; the other half being reserved for
operating system code and data. This may not lead to the most efficient use of such an
effective resource as SRAM. The problem can be overcome by marking certain PTE
entries as invalid but locked. The SRAM pages corresponding to these PTE can then
be accessed in non–translated address mode by Supervisor mode code.
312
Evaluating and Programming the 29K RISC Family
Since the PTE table is frequently accessed by TLB trapware, it is important that
quick access to the table is supported. For this reason register upte is initialized to
point to the base of the PTE table, and the table is located in the first SRAM page. One
SRAM page can contain 32 sets of PTE data. In multi–tasking systems, with each
task having its own PTE table, the upte value is normally stored in a per–task data
structure know as the Process Control Block (PCB), and the upte register is updated
from the PCB data at each context switch.
The Am29000 takes traps very quickly, without expending a number of internal
processor cycles preparing an interrupt processing context for the processor. This advantage over typical CISC processor operation enables the Am29000 to process the
trap quickly in Freeze mode and return to the user’s program. It is the Freeze mode
processing capability of the Am29000 that makes a soft cache mechanism attractive.
However, TLB miss handlers can not always complete their handling quickly in
Freeze mode code. In such cases they must signal the operating system to continue
with further processing, Freeze mode is departed, and Supervisor mode with freeze
disabled is entered. Before Freeze mode can be exited, the frozen special registers
must be stored on a Supervisor mode memory stack. They will have to be restored
from this stack once the operating system completes the TLB miss processing. The
operating system stack is located on page 4, which is in a different set from the PTE
table. Operating system accessible register kmsp is used as a stack pointer.
Using the cache architecture described, the nroff and assembler utilities were
observed running in a 128 page SRAM based system. The page–in activity is shown
on Figure 6-13. It appears the two programs were too large to execute in 128K byte
SRAM (1K byte page size). The paging activity is at a minimum with a 256K byte
cache (2K byte page size). It is possible the increased paging activity is due to cache
sets being only 4–way. In the case of nroff, it is more likely the page replacement algorithm was having difficulty in keeping the desired pages in the cache for such a
large program.
As page sizes get larger, the probability of a TLB miss diminishes. Since the
cache gets larger for a given SRAM of fixed number of pages, expect the probability
of a page–in to increase as page size increases. Reflecting the fact that with large
caches, a TLB miss causes a page–in and the TLB maintains a cached entry for the
permanently resident page. Figure 6-14 gives the probability of a page–in given a
TLB miss has occurred.
With the nroff utility, the probability actually reduces when the page size is increased from 1K byte to 2K byte. This is because of the cache–thrashing occurring
with the 128K byte cache used with the 1K byte page size.
6.4.1 Cache Page Maintenance
The example software controlled cache system only supports User mode address translation. This means Supervisor mode TLB miss handlers will not be considChapter 6
Memory Mangement Unit
313
700
630
560
490
420
350
280
210
140
70
0
1k
2k
4k
8k page size (bytes)
nroff 4–1 DRAM with 128 page 2–1 cache
assembler
Figure 6-13. Software Controlled Cache, K bytes paged–in
% 100
90
80
70
60
50
40
30
20
10
0
1k
2k
4k
8k page size (bytes)
nroff 4–1 DRAM with 128 page 2–1 cache
assembler
Figure 6-14. Probability of a Page–in Given a TLB Miss
ered. TLB entries shall always enable instruction execution for each page, this eliminates support for the TLB instruction access protection violation trap. Pages will be
initially marked as non–writeable, as will be seen this supports maintenance of the
page–dirty bit. So in total, we need only deal with three traps: Instruction access miss,
data access miss, and data access protection violation.
314
Evaluating and Programming the 29K RISC Family
The Am29000 has 65 global registers (gr1, gr64–gr127), of these 32 are reserved for operating system use only (gr64–gr95). To improve the performance of
the trapware, several of the operating system registers have been assigned TLB handler support functions. The following code uses register synonyms, so the actual register assignments can be easily changed.
.reg
.reg
.reg
.reg
it0,gr64
it1,gr65
it2,gr66
it3,gr67
;Freeze mode
;temporary regs
.reg
.reg
kt0,gr68
kt1,gr69
;temporary regs
.reg
.reg
kmsp,gr93
upte,gr95
;supervisor M–stack
The code shown within this chapter makes use of a number of macros for pushing and popping special registers to an external memory stack. These macros, push ,
pushsr, pop and popsr, were described in section 3.3.1 (Useful Macro–Instructions).
The example code can be used to construct a cache of various number of PTE
entries (ways or columns) per set, and total number of sets. The constant definitions
shown below are used to control the cache size.
.equ
.equ
.equ
.equ
.equ
.equ
.equ
.equ
PGSIZE,10
C_SETS,6
C_COLUMNS,2
WSIZE,512
SIG_ICMISS,1
SIG_DCMISS,3
SIG_PROTECT,5
CTX_CHC,3*4
;Page size
;cache sets
;columns per sets
;window size
;signal I–miss
;signal D–miss
;signal W–protect
;context offset
.sect cache,bss
.use cache
cache_adds:
.block (1<<PGSIZE)*(1<<C_SETS)*(1<<C_COLUMNS)
The operating system code, which is not shown, is responsible of initializing
support registers, kmsp, and upte. It must also mark the PTEs locked and invalid for
any SRAM pages which are not to be used for caching, but by the operating system.
The example code uses pages 0 and 4 to store performance critical support data.
6.4.2 Data Access TLB Miss
When a read or write data access occurs for a page whose translation from virtual to physical address is currently not in the TLB registers, a TLB miss is taken. This
causes execution to vector to trap number 9. The address of the trapware handler,
UDTLBMiss, is at location 9 in the vector table. A miss may occur because the acChapter 6
Memory Mangement Unit
315
cessed page is currently not in the cache, or, more importantly, because the PTE mapping the cached page is currently not cached by the TLB registers. The PTEs for the
appropriate set must be scanned to determine if the page is in the cache.
When a trap is taken, the Am29000 processor special support registers are frozen, their contents report the state of the processor at the time of the trap. Special register CHA contains the virtual address for the failing data access. Using the CHA value, the cache set is determined and the 4 PTE columns assigned to the set are scanned.
The PTE valid bit must be set and the PTE VTAG field must match with the upper bits
of the CHA address for a match to be found. Note, the example code does not
compare the TID field; this would be necessary if the cache were supporting a multi–
tasking operating system.
UDTLBmiss:
mfsr
const
srl
and
sll
add
;
scan_columns:
srl
sll
const
next_column:
jmpt
sub
load
add
sll
jmpf
srl
sll
cpeq
jmpf
mfsr
sub
it0,cha
kt1,SIG_DCMISS ;signal number
it2,it0,PGSIZE ;select cache set
it2,it2,(1<<C_SETS)–1
it2,it2,3+C_COLUMNS
it2,it2,upte
;adds of 1st PTE
it0,it0,PGSIZE+5
it0,it0,PGSIZE+5
kt0,(1<<C_COLUMNS)–1
kt0,not_cached
kt0,kt0,1
;dec column count
0,0,it1,it2
;load word 0
it2,it2,8
;next PTE entry
it3,it1,31–14
;test VE–bit
it3,next_column
it3,it1,PGSIZE+5;mask PTE VTAG
it3,it3,PGSIZE+5
it3,it0,it3
;compare VTAG
it3,next_column
it3,LRU
it2,it2,4
;adds word 1
If a PTE is found in the set which matches with the CHA address, then the TLB
entry of the associated set, selected by the LRU register, is updated with the contents
of the matching PTE. Field RPN of word 1 of the PTE is not filled with the secondary
memory (DRAM) page number taken from the PTE, but with the page number of the
SRAM cache page.
in_cache:
;Word 0 in it1,it2 points to PTE word 1
load
0,0,it0,it2
;load word 1
mttlb
it3,it1
;assign Word 0
add
it3,it3,1
and
it0,it0,0xc1
;mask out RPN
316
Evaluating and Programming the 29K RISC Family
sub
srl
sll
add
or
mttlb
iret
it1,it2,upte
it1,it1,3
it1,it1,PGSIZE
it1,it1,upte
it0,it1,it0
it3,it0
;set offset;
;set index;
;cache page offset
;cache RPN
;or in cache RPN
;assign Word 1
When the required page is found in the cache, the TLB handler executes very
quickly without ever leaving Freeze mode. After the TLB entry has been updated an
IRET instruction causes execution to be restarted from the state defined by the frozen
special registers. The trapware is arranged so the most frequently occurring events
are processed first and suffer the lowest support overhead. However, if the page is not
found in the cache (no matching PTE) then the trapware must call on the operating
system to complete the necessary processing. It does this by sending a signal. The
code following label not_cached pushes the contents of the special registers as well
as other signal information onto a signal frame on the Supervisor memory stack.
Execution is then forced to continue in Supervisor mode with non–translated addressing at tlb_sig_handler. The signal frame shall be used to repair the special registers
after the higher level operating system support code has completed.
not_cached:
;Send a signal
push
push
push
const
sub
sub
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
pushsr
;
push
cpeq
jmpt
mfsr
mfsr
;
i_miss:
mtsrim
mtsrim
;
const
consth
add
mtsr
Chapter 6
to the operating system
kmsp,kt1
;push signal number
kmsp,gr1
;push gr1
kmsp,rab
;push rab
it0,WSIZE
gr1,rfb,it0
;set gr1=rfb–WSIZE
rab,rfb,it0
;set rab=rfb–WSIZE
kmsp,it0,pc0
;push pc0
kmsp,it0,pc1
kmsp,it0,pc2
kmsp,it0,cha
kmsp,it0,chd
kmsp,it0,chc
kmsp,it0,alu
kmsp,it0,ops
;push ops
kmsp,tav
;push tav
tav,kt1,SIG_ICMISS
tav,i_miss
tav,pc1
;pass address
tav,cha
chc,0
ops,0x70
;cancel load/store
;set PD|PI|SM
it1,tlb_sig_handler
it1,tlb_sig_handler
it0,it1,4
;trampoline signal
pc1,it1
; handler
Memory Mangement Unit
317
mtsr
iret
pc0,it0
The signal frame has a signal number field which is used to report the type of
TLB trap which occurred. The layout of the frame is given in Figure 6-15. Global
register tav (gr121) is used to pass the address causing the trap to occur. For a TLB
data miss, the address is already contained in the CHA register, but copying it to tav is
convenient because the signal handler code is also shared by other routines.
supervisor
memory
stack, higher addresses at top of
figure
TLB signal Frame
signal number
gr1
rab
PC0
PC1
PC2
CHA
CHD
CHC
ALU
OPS
tav
kmsp
Figure 6-15. TLB Signal Frame
6.4.3 Instruction Access TLB Miss
Instruction access TLB misses are dealt with in the same way as data access misses. Only the signal number is different and the faulting address is contained in special register PC1 rather than CHA. Register PC1 contains the address of the instruction in execute at the time of the failing address translation. Since cache pages contain both instructions and data, the same set of PTE apply for data and instruction address values. Via the interrupt vector table, the User mode instruction access trap
number 8 causes execution to continue at address label UITLBmiss.
UITLBmiss:
mfsr
const
srl
and
sll
jmp
add
318
it0,pc1
kt1,SIG_ICMISS ;signal number
it2,it0,PGSIZE ;select cache set
it2,it2,(1<<C_SETS)–1
it2,it2,3+C_COLUMNS;PTE set offset
scan_columns
it2,it2,upte
;adds of 1st PTE
Evaluating and Programming the 29K RISC Family
6.4.4 Data Write TLB Protection
The following signal handler code is responsible for moving pages from secondary DRAM to SRAM cache memory (paging–in). When pages are first paged–in
they are given read and execute permissions only, unless the initial faulting access is
due to a data write. At some time later during program execution, a write to the
cached page may occur. When this happens, a data write protection trap is taken, and
execution is vectored to address label tlb_data_prot.
In the same way as a data TLB miss, the associated PTE entries are scanned to
find the matching entry. There must be a matching entry and, in addition, a cached
TLB entry which is disallowing write access. Once the PTE has been found, the CHA
address value is again used to find the associated TLB entry. Note, the LRU register
can not be used because it is only updated on TLB misses. To find the TLB entry, the
VTAG portion of the CHA address is compared with the only two possible TLB entries associated with the set.
;A write request to a read–only page has occurred.
tlb_data_prot:
mfsr
const
srl
and
sll
add
;
scan:
srl
sll
const
nxt_column:
jmpt
sub
load
add
sll
jmpf
srl
sll
cpeq
jmpf
;
mfsr
srl
and
mfsr
srl
sll
mftlb
srl
sll
Chapter 6
it0,cha
kt1,SIG_PROTECT ;signal
it2,it0,PGSIZE ;select cache line
it2,it2,(1<<C_SETS)–1
it2,it2,3+C_COLUMNS;PTE set offset
it2,it2,upte
;adds of 1st PTE
it0,it0,PGSIZE+5;adds VTAG
it0,it0,PGSIZE+5
kt0,(1<<C_COLUMNS)–1
kt0,not_cached
kt0,kt0,1
;dec column count
0,0,it1,it2
;load word 0
it2,it2,8
;next PTE entry
it3,it1,31–14
;test VE–bit
it3,nxt_column
it3,it1,PGSIZE+5 ;mask PTE VTAG
it3,it3,PGSIZE+5
it3,it0,it3
;compare VTAG
it3,nxt_column
it3,cha
;find TLB entry
it3,it3,PGSIZE–1;get TLB set
it3,it3,0x3e
kt0,cha
kt0,kt0,PGSIZE+5;form adds VTAG
kt0,kt0,PGSIZE+5
it0,it3
;read Word 0
it0,it0,PGSIZE+5;form TLB VTAG
it0,it0,PGSIZE+5
Memory Mangement Unit
319
cpeq
jmpt
sub
add
it0,it0,kt0
it0,entry_found
it2,it2,8
;PTE adds word 0
it3,it3,64
;Word 0 in set 1
Once the PTE and TLB entries have been found execution continues at label
entry_found. Both entries must now be updated to set the UW bit enabling User
mode write access. In addition, the PGM[1] bit used to keep a record of any data
writes to the SRAM page is also set. This bit, known as the dirty–bit, will be used in
the page–out algorithm. Once the TLB register reporting the access permission fault
has been updated, an IRET instruction is used to restart the program using the contents of the still frozen special registers.
entry_found:
;Word 0 in it1, it2 points to PTE word 0
const
kt1,0x200
;UW–bit
or
it1,it1,kt1
store
0,0,it1,it2
;store new word 0
mttlb
it3,it1
;assign Word 0
;
add
it2,it2,4
load
0,0,it0,it2
;load word 1
add
it3,it3,1
or
it0,it0,0x80
;set PGM[1] dirty
store
0,0,it0,it2
;store new word 1
and
it0,it0,0xc1
;mask out RPN
sub
it1,it2,upte
;set offset
srl
it1,it1,3
;set index
sll
it1,it1,PGSIZE ;cache page offset
add
it1,it1,upte
;cahe RPN
or
it0,it1,it0
;or in cache RPN
mttlb
it3,it0
;assign Word 1
iret
6.4.5 Supervisor TLB Signal Handler
When trapware code is unable to complete the necessary TLB update, for example, if the corresponding address is for a page not currently in the cache, the operating
system receives a signal and information on its memory stack required to continue
the TLB update process. An IRET instruction is used to trampoline to the signal handler address tlb_sig_handler. The IRET does not cause the faulting User mode
instruction to restart, because after the frozen special registers are saved on the stack,
the PC registers are loaded with the address of the signal handler. Additionally, the
OPS status register is modified to cause Supervisor mode with non–translated address to commence after the IRET, rather than the interrupted User mode with address translation on.
A small number of support registers were required to support the trapware routines. The higher level signal handler code requires registers for its own operation. It
320
Evaluating and Programming the 29K RISC Family
is undesirable to use some of the remaining operating system registers in the
gr64–gr95 range to support this code. Global registers are a scarce resource and likely needed by other critical operating system tasks. The registers used by the trap handlers (it0–it3) are by convention used by all Freeze mode handlers, since during
Freeze mode, interrupts are disabled and therefore there are no register access conflicts. However, the signal handler code runs with interrupts turned on. An interrupt
occurring during signal processing would likely use the interrupt temporary registers
(it0–it3), and therefor the signal handler must acquire additional registers for its operation. It does this by pushing some of the User mode assigned global registers
(gr96–gr127) onto the Supervisor stack, just below the signal frame.
;Try and find an empty PTE entry in the column.
;Register tav has the offending address.
tlb_sig_handler:
push
kmsp, gr96
;get some registers
push
kmsp, gr97
push
kmsp, gr98
push
kmsp, gr99
push
kmsp, gr100
;
mfsr
gr96,tmc
;get random value
;
srl
gr98,tav,PGSIZE ;select cache set
and
gr98,gr98,(1<<C_SETS)–1
sll
gr98,gr98,3+C_COLUMNS;PTE set offset
add
gr98,gr98,upte ;PTE column 0 address
;
const
gr100,(1<<C_COLUMNS)–1
column_loop:
jmpt
gr100,page_out
and
gr96,gr96,((1<<C_COLUMNS)–1)<<3
add
gr99,gr98,gr96 ;column wrap–around
load
0,0,gr97,gr99
;load word 0
add
gr96,gr96,8
;next PTE entry
sll
gr99,gr97,31–14 ;test VE–bit
jmpt
gr99,column_loop
sub
gr100,gr100,1
;dec column count
;
sub
gr96,gr96,8
call
gr100,store_locals;destroys gr96
add
gr98,gr98,gr96 ;PTE adds of Word 0
page_in:
;Page–in code follows . . .
The four PTE entries associated with the set are then scanned to find an unused
entry (i.e., the VE bit is not set). If all PTEs are marked valid, then execution continues at page_out. Once a empty entry is found a call to routine store_locals is made.
This call causes all 128 local registers within the Am29000 processor to be copied
onto the Supervisor memory stack just below the user’s saved global registers. Note,
when the set of four PTEs are scanned, a random column in the set is initially seChapter 6
Memory Mangement Unit
321
lected. This may initially reduce column scan times. After the local registers have
been made available for signal handler use, execution continues at label page_in.
6.4.6 Copying a Page into the Cache
Once a PTE for the in–coming page has been selected the corresponding SRAM
cache page can be easily determined with a little address–based calculation. Words 0
and 1 for the TLB entry are now formed and stored in the TLB selected by the LRU
register. The TLB entry is also copied to the PTE location, with the one difference
that PTEs have the DRAM page number in the RPN filed rather than the SRAM page
number.
The Dirty bit, PGM[0] , is cleared and the page is marked for read and execute
permissions, unless the signal is from a failing data write access; in this case, the page
is marked dirty and write permission is granted. To determine if a write access failed,
the channel control register CHC is checked for a valid data write access in progress.
The CHC register is obtained by referencing the signal frame stored on the Supervisor memory stack. Fortunately, the LRU register did not need to be saved on the
memory stack, because the LRU will remain unchanged during signal code execution. The LRU register is only updated when an address translation fails, this can not
happen when the operating system is running in physical address mode.
The DRAM page is copied into SRAM memory in bursts of 128 words. Bursting
is repeated several times depending on page size. Using long data bursts to transfer
data is most efficient. The LOADM and STOREM instructions remain in execute until all their data has been transferred, which is only dependent on the access delay of
the memory. Once the SRAM page has been filled, the user’s local registers are repaired via a call to load_locals and a jump to ret_usr starts the process of restoring
the processor to its state at the time of the trap.
page_in:
srl
sll
mfsr
and
or
;
const
const
add
load
mtsrim
extract
jmpf
const
mtsrim
extract
jmpt
nop
322
gr96,tav,PGSIZE+5;form VTAG
gr96,gr96,PGSIZE+5
gr97,mmu
;get TID
gr97,gr97,0xff
gr96,gr96,gr97 ;or in TID
gr100,0x00
;PGM[1]=0 clean
gr97,512 + 5*4 + CTX_CHC
gr97,kmsp,gr97 ;get chc
0,0,gr97,gr97
fc,31–0
gr97,gr97,gr97 ;rotate
gr97,i_page
;test CV–bit
gr99,0x4500
;VE|UR|UE
fc,1+31–15
gr97,gr97,gr97 ;rotate LS–bit
gr97,i_page
;jump for
;data load
Evaluating and Programming the 29K RISC Family
const
const
i_page:
or
store
mfsr
mttlb
;
add
add
srl
sll
or
store
sub
srl
sll
add
mttlb
;
mtsrim
const
sub
const
srl
sll
more_in:
loadm
storem
add
jmpfdec
add
;
call
nop
jmp
nop
gr99,0x4700
gr100,0x80
;VE|UR|UW|UE
;PGM[1]=1 dirty
gr97,gr96,gr99
0,0,gr97,gr98
gr96,lru
gr96,gr97
;or in permissions
;store Word 0
;assign TLB word 0
gr96,gr96,1
gr98,gr98,4
;PTE adds Word 1
gr97,tav,PGSIZE;
gr97,gr97,PGSIZE
gr97,gr97,gr100;assign PGM[1]
0,0,gr97,gr98
;store Word 1
gr99,gr98,upte ;set offset
gr99,gr99,3
;set index
gr99,gr99,PGSIZE;cahe page offset
gr99,gr99,upte ;cache RPN
gr96,gr99
;assign TLB word 1
cr,128–1
gr96,(1<<PGSIZE)/512;busrt count
gr96,gr96,2
gr100,512
gr97,tav,PGSIZE ;get page address
gr97,gr97,PGSIZE
0,0,lr0,gr97
0,0,lr0,gr99
gr97,gr97,gr100
gr96,more_in
gr99,gr99,gr100
;read in a block
;copy out a block
;advance pointer
;advance pointer
gr100,load_locals;destroys gr96
ret_user
6.4.7 Copying a Page Out of the Cache
If a TLB miss occurs and all PTE entries for the associated set are marked valid,
then a PTE must be selected and the corresponding SRAM page copied back to
DRAM. This makes room for the page containing the miss addresses to be copied
into the space made available by the out–going page. The PTEs for the set are
scanned and if a non–dirty page is found, it is selected for paging–out. If all pages are
marked dirty, then a jump to label all_dirty is taken, as further column scanning is
required to determine if a page can be paged–out.
;All columns are in use. Select a column which is not locked and
not ;dirty for paging out.
;Register gr98 points to a random column in current set.
Chapter 6
Memory Mangement Unit
323
page_out:
call
add
mfsr
const
gr100,store_locals;destroys gr96
gr98,gr98,4
;pnts to PTE word 1
gr96,tmc
;get random number
gr100,(1<<C_COLUMNS)–1;column counter
dirty_loop:
jmpt
and
add
load
add
sll
jmpt
sll
jmpt
sub
gr100,all_dirty
gr96,gr96,((1<<C_COLUMNS)–1)<<3
gr99,gr98,gr96 ;column wrap
0,0,gr97,gr99
;load PTE word 1
gr96,gr96,8
;next TLBT entry
gr99,gr97,31–7 ;test PGM[1] dirty
gr99,dirty_loop
gr99,gr97,31–6 ;test PGM[0] locked
gr99,dirty_loop
gr100,gr100,1
;dec column count
Once a PTE for the out–going page is selected, the two TLB entries for the
associated set must be checked to determine if they are caching an entry for the selected PTE. If there is a valid TLB entry, then it must be marked invalid as the
associated SRAM page is about to be assigned to a different virtual page address.
page_selected:
;Must first page–out selected cache page before filling the cache
;with the new selected page.
sub
gr96,gr96,8
add
gr98,gr98,gr96 ;adds of PTE Word 1
;
;Invalidate any processor TLB entries for the outgoing page.
;Could check VE bit in each TLB entry first.
srl
gr96,gr97,PGSIZE+5;form VTAG
sll
gr96,gr96,PGSIZE+5
srl
gr100,gr97,PGSIZE–1;get TLB set
and
gr100,gr100,0x3e
mftlb
gr99,gr100
;read Word 0
srl
gr99,gr99,PGSIZE+5;form VTAG
sll
gr99,gr99,PGSIZE+5
cpeq
gr99,gr99,gr96
jmpf
gr99,test_column_1
invalidate_tlb:
const gr99,0
;clear TLB VE–bit
jmp
tlb_clear
mttlb gr100,gr99
test_column_1:
add
gr100,gr100,64 ;Word 0 in column 1
mftlb
gr99,gr100
srl
gr99,gr99,PGSIZE+5;form VTAG
sll
gr99,gr99,PGSIZE+5
cpeq
gr99,gr99,gr96
jmpt
gr99,invalidate_tlb
nop
324
Evaluating and Programming the 29K RISC Family
It is during the page–out routine that the maintenance of a dirty–bit pays back its
dividend. If the page is not dirty then there is no need to copy it back to DRAM, because the DRAM copy is exactly the same as the SRAM copy. If no writes have occurred to the page then the copy–out is avoided.
tlb_clear:
sll
jmpf
sub
srl
sll
sub
srl
sll
add
;
mtsrim
const
sub
const
gr96,gr97,31–7 ;test dirty bit
gr96,page_in
gr98,gr98,4
;gr98 pnts. word 0
gr97,gr97,PGSIZE;secondary mem RPN
gr97,gr97,PGSIZE
gr99,gr98,upte ;set offset
gr99,gr99,3
;set index
gr99,gr99,PGSIZE;cache page offset
gr99,gr99,upte ;cache RPN
cr,128–1
gr96,(1<<PGSIZE)/512;burst count
gr96,gr96,2
gr100,512
The page–out routine, like the page–in routine makes use of burst–mode data
copying to greatly speed up the processes of data moves.
more_out:
loadm
storem
add
jmpfdec
add
jmp
nop
0,0,lr0,gr99
0,0,lr0,gr97
gr97,gr97,gr100
gr96,more_out
gr99,gr99,gr100
page_in
;read in a block
;copy out a block
;advance pointer
;advance pointer
;gr98 pnts word 0
6.4.8 Cache Set Locked
The signal processing software, like the trapware, has its code ordered to deal
with the most frequently occurring events first. This results in shorter processing
times. There is no need to burden the simpler tasks with overheads supporting the
operation of less frequently occurring events. However, this does lead to some repetition in code for the most infrequent signal processing events. For example, if a page
must be copied–out and all the pages are marked dirty, then the PTEs in the set must
be scanned again to find a unlocked page. The selected page is then paged–out.
;All pages are dirty, page–out a non locked page
all_dirty:
const
lock_loop:
jmpt
Chapter 6
gr100,(1<<C_COLUMNS)–1;column counter
gr100,cache_locked
Memory Mangement Unit
325
and
add
load
add
sll
jmpt
sub
gr96,gr96,((1<<C_COLUMNS)–1)<<3
gr99,gr98,gr96 ;column wrap
0,0,gr97,gr99
;load word 1
gr96,gr96,8
;next PTE entry
gr99,gr97,31–6 ;test PGM[0] lock
gr99,lock_loop
gr100,gr100,1
;dec column count
jmp
nop
page_selected:
;
If all pages associated with the current set are marked locked, then the signal
handler arranges to have the DRAM page mapped directly to the faulting virtual address. This reduces the access times for all data and instructions contained in the
page. The algorithm does not try and restore the page to SRAM at a later date
;All columns for the current set are locked.
;Map the virtual address to non–cache secondary memory.
cache_locked:
srl
sll
mfsr
and
or
const
or
mfsr
mttlb
;
add
srl
sll
mttlb
gr96,tav,PGSIZE+5;form VTAG
gr96,gr96,PGSIZE+5
gr97,mmu
;get TID
gr97,gr97,0xff
gr96,gr96,gr97 ;or in TID
gr97,0x4700
;VE|UR|UW|UE
gr97,gr96,gr97 ;or in permissions
gr98,lru
gr98,gr97
;assign Word 0
gr98,gr98,1
gr96,tav,PGSIZE ;form RPN
gr96,gr96,PGSIZE
gr98,gr96
;assign Word 1
6.4.9 Returning from Signal Handler
When the signal handler has completed its processing, the context of the processor at the time of the original TLB trap must be restored and execution continued.
First, the user’s global registers, temporarily made use of by the operating system,
must be restored from the Supervisor memory stack. Interrupts must be disabled and
the processor state frozen while the special support registers are restored from the
signal frame. Once this has been accomplished and the memory stack is restored to its
pre–trap value, an IRET instruction is used to restart the instruction in execute at the
time the translation trap was taken.
;Pop registers of supervisor mode stack and
;return to program causing the TLB miss.
ret_user:
326
Evaluating and Programming the 29K RISC Family
pop
pop
pop
pop
pop
gr100,kmsp
gr99,kmsp
gr98,kmsp
gr97,kmsp
gr96,kmsp
mtsrim
pop
mtsrim
popsr
popsr
popsr
popsr
popsr
popsr
popsr
popsr
pop
pop
add
add
iret
cps,0x73
tav,kmsp
cps,0x473
ops,it0,kmsp
alu,it0,kmsp
chc,it0,kmsp
chd,it0,kmsp
cha,it0,kmsp
pc2,it0,kmsp
pc1,it0,kmsp
pc0,it0,kmsp
rab,kmsp
it1,kmsp
gr1,it1,0
kmsp,kmsp,4
;
;disable interrupts
;restore tav
;turn on FREEZE
;pop rab
;pop rsp
;alu operation
;discount signal
6.4.10 Support Routines
The example code used two support routines to copy the 128 32–bit local registers to and from the Supervisor memory stack. Most operating systems assign all of
the local registers for use by the user’s application code. The large number of registers effectively implements a data cache. The advantage to having several registers is
that, unlike data memory, the register file supports simultaneous read and write access. In order to support maximum length data bursts on page transfers, the register
file is made available to the signal processing routine.
;Push local registers onto Supervisor M–stack
store_locals:
const
sub
mtsrim
jmpi
storem
gr96,512
kmsp,kmsp,gr96
cr,128–1
gr100
0,0,lr0,kmsp
;Window Size
;save 128 registers
;return
;Pop local registers off Supervisor M–stack
load_locals:
const
mtsrim
loadm
jmpi
add
Chapter 6
gr96,512
cr,128–1
0,0,lr0,kmsp
gr100
kmsp,kmsp,gr96
Memory Mangement Unit
;Window Size
;load 128 registers
;return
327
6.4.11 Performance Gain
The benefits of using a software controlled cache to take advantage of limited
SRAM availability should be seen in reduced average number of cycles per application instruction. Ideally the cache performance should approach that of a single large
SRAM memory system. However, the cost of TLB and cache maintenance is not insignificant, especially when small page sizes are used. Figure 6-16 and Figure 6-17
show the effective average cycle times per instruction observed for a 128 page cache
system. The cache memory was 2–1 and the secondary memory 4–1.
3
assembler
2
Effective Number
of Cycles per
Instruction
1
0
1k
2k
4k
8k
page size (bytes)
4–1 DRAM joint I/D memory system
4–1 DRAM with 128 page 2–1 cache
Figure 6-16. Cache Performance Gains with the Assembly Utility
Compare results for the smallest cache system of 128 1K byte cache pages. The
effective performance is more divergent from the maximum achievable SRAM performance with this cache size. When the page size is 2K bytes or greater, the cache
overhead reduces noticeably. With a DRAM–only system, an 8K byte page size
would be selected to reduce TLB handler support overheads. This means the 128K
byte cache model should really be compared with the 8K byte DRAM only model. In
this case, the cache achieved an average performance gain of 28% for the two utility
programs tested.
Using a cache has some additional benefits for embedded systems. Often initialization code and data are placed in EPROM, which can be slow to access. When the
EPROM is accessed, the associated page would be automatically copied to SRAM.
Additionally, application read/write data which is not located in a uninitialized data
(BSS) section, and therefore requires initialization before the application program
commences, would be automatically initialized from the EPROM data pages. This
will remove the burden from the operating system routines responsible for application environment initialization.
328
Evaluating and Programming the 29K RISC Family
3
nroff
Effective
Number of
Cycles per
Instruction
2
1
0
1k
2k
4k
8k
page size (bytes)
4–1 DRAM joint I/D memory system
4–1 DRAM with 128 page 2–1 cache
2–1 SRAM
Figure 6-17. Cache Performance Gains with NROFF Utility
The software controlled cache benefits become larger when the secondary
DRAM memory becomes relatively slower. Figure 6-18 shows a comparison of a
4–2 DRAM system with a 128 1K byte page SRAM. The benchmark programs show
an average performance gain of 39.4%.
4
Effective
Number of
Cycles Per
Instruction
3
2
1
0
nroff
assembler
Joint I/D memory system
4–2 DRAM 8K pages
4–1 DRAM 8K pages
4–2 DRAM with 128 1K byte page 2–1 cache
2–1 SRAM 8K pages
Figure 6-18. Comparing Cache Based Systems with DRAM Only Systems
Chapter 6
Memory Mangement Unit
329
330
Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Chapter 7
Software Debugging
This chapter supports software engineers wishing to develop application code
or operating system code for execution on a 29K RISC microprocessor.
Debugging tools which can be used in both a hardware and software debugging
role, such as in–circuit emulators and logic analyzers, are not described; that is left to
the individual tool manufacturer. The material presented concentrates on describing
the operation of inexpensive tools based on the MiniMON29K debug monitor and
Universal Debug Interface (UDI). Figure 7-1 shows the various tools used during
the different stages of an embedded processor–based project. Debug monitors are
typically used during the initial processor evaluation and selection stage, and later
when software is debugged with a working hardware system.
Also described are processor features which were specifically included in the
design for the purpose of debugging. The precise details of how these features are
configured to build a debug monitor such as MiniMON29K will not be described in
detail. This chapter is not intended to show how debug tools are constructed, but rather to show how existing tools can be utilized and describe their inherent limitations.
However, readers wishing to build their own tools will be able to glean the information required.
7.1
REGISTER ASSIGNMENT CONVENTION
The 29K processor calling convention divides the processor registers into two
groups: those available to the run–time application, and hence used by compiler generated code, and those reserved for operating system use.
All the 29K processor’s 128 local registers, used to implement the register stack
cache, are allocated to application code use. In addition, 32 (gr96–gr127) of the 64
331
October 13 1995, Draft 1
Processor Selection
Packaged Benchmarks
AMD Literature
Compiler Tool Chain
Architectural Simulator
Evaluation Board
Application Benchmarks
Start system design
s/w path
h/w path
Compiler Tool Chain
ASIC
Instruction Set
Simulator
Evaluation
Board
Waveform
Simulator
Logic
Analyser
Hard Work +
Careful Analysis
Initial h/w without
working memory
system
Initial s/w
Logic
Analyzer
Homework +
Scope + Time
ICE
S/w and h/w
Development
Debuged
Memory
System
s/w running on h/w
ICE
S/w and h/w
Integration
Analyser +
Disasemmbler
JTAG
Emulator
ROM
Emulator
MiniMON29K
Working h/w + s/w
Figure 7-1. 29K Development and Debug Tools
332
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
global registers are assigned to application use, and the remaining group of 32
(gr64–gr95) are for operating system use.
The processor does not assign any particular task to the global registers in the
operating system group. However, over time a convention has evolved among 29K
processor users. The subdivision of global registers gr64–gr95 into sub groups was
described in section 3.3, and is widely adhered to; the methods presented in this chapter shall continue with the convention.
The subgroups are known as: The interrupt Freeze mode temporaries (given
synonyms it0–it3); the operating system temporaries (kt0–kt11); and the operating
system static support registers (ks0–ks15).
7.2
PROCESSOR DEBUG SUPPORT
7.2.1 Execution Mode
The processor is in Supervisor mode whenever the SM–bit in the Current processor Status register (CPS) is 1. If the SM bit is 0, the processor is executing in User
mode. When operating in User mode the processor cannot access protected resources or execute privileged instructions.
Generally a processor maintains context information which refers to operating
system status and various user processes. Operating in User mode is a means of preventing a User mode process from accessing information which belongs to another
task or information that the operating system wishes to keep hidden.
If a User mode task breaks any of the privilege rules described in the processor’s
User Manual, then a protection violation trap is taken. Traps cause the operating system to regain control of execution. Typically the operating system will then send a
software signal to the User mode process reporting its violation and possibly stopping its execution. The exact action which takes place is particular to each operating
system implementation.
Besides preventing User mode programs from using processor instructions
which are reserved for operating use only, an operating system can precisely control a
processes access to memory and registers. This can be very useful when debugging
User mode software. The following section describes the processor’s memory management support. The register protection scheme is very simple. Special register
RBP is used to restrict banks of global registers to Supervisor mode access only. Each
bank consists of 16 registers and a 1 in each RBP bit position restricts the corresponding bank to Supervisor mode access only. Thus, it is normal to set RBP=0x3F, which
allows User mode processes to access global registers gr96 and higher. These are the
only registers which can be affected by compiler generated code. Note however,
global registers gr0 and gr1 which perform special support tasks are effected by compiler generated code and their access is not restricted by the RBP protection scheme.
Chapter 7
Software Debugging
333
October 13 1995, Draft 1
7.2.2 Memory Access Protection
A number of the 29K processor family members are equipped with a Translation Look–aside buffer (TLB). It is intended for construction of a Memory Management Unit (MMU) scheme. A complete description of the TLB operation is given in
Memory Management Unit (Chapter 6).
An MMU is normally used to provide virtual memory support. However, it can
also play an important debugging role, even in embedded applications. Note, this
function is not intended to be performed by the Region Mapping facility provided on
some family members. The Region Mapping facility does not support the address
space granularity supported by the TLB hardware. In addition, Region Mapping in
some cases only allows address mapping to a limited region of physical memory. For
example, on the Am29200 microcontroller, only the DRAM memory and not the
ROM memory can be accessed in virtual address space.
When code is being developed, often an erroneous data reference will occur. If
no memory is located at the particular address then the target memory system should
generate a hardware access error (such as DERR or IERR on some family members).
However, address aliasing often results in the access being performed on some other
address location for which address decoding determines physical memory has been
assigned. This kind of programming bug can be difficult to detect. Using the TLB,
address access errors can be immediately detected and reported to the operating systems via access protection violations.
The OS–boot operating system, used by many customers, can provide memory
access protection by mapping virtual address to physical addresses in a one–to–one
format. This is adequate for many embedded applications where memory paging
does not occur and application programs can be completely located in available
memory. When an access violation occurs OS–boot informs the MiniMON29K
monitor who reports the violation to the process controlling debugging. The details
of this mechanism are described in later sections.
Whether you intend using OS–boot or some other operating system, it is likely
you would benefit from using the on–chip TLB hardware to support a more powerful
debug environment, via the detection of invalid memory references.
7.2.3 Trace Facility
Using the Trace Facility, a program can be executed one instruction at a time.
This allows the execution of a program to be followed and the state of the processor
to be examined and modified after each instruction has executed.
The 29K family has a four stage pipeline: Fetch, Decode, Execute and Write–
back. Tracing is enabled by setting the Trace enable (TE) bit in the CPS register.
When an instruction passes from the execute stage of the pipeline into the write–back
stage, the TE bit is copied into the TP bit. The Trap Pending (TP) bit is also located in
the CPS register, and when it becomes set the processor takes a trace trap. The Super-
334
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
visor mode code normally arranges for the vector table entry for the trace trap to
cause the debug monitor to gain control of the processor.
The debug monitor, normally MiniMON29K, uses the IRET instruction to restart program execution after the Trace trap handler has completed. Execution of an
IRET causes the Old Processor Status register (OPS) to be copied into the CPS register before the next program instruction is executed. The TP bit in the OPS is normally
cleared by the debug monitor before the IRET is executed. If the TE bit in the OPS is
set then tracing of the restarted instruction sequence shall continue after executing
the IRET.
Note, when the disable all (DA) bit in the CPS register is set the trace trap cannot
be taken unless the processor supports Monitor mode (described below). Should the
program being debugged issue an instruction such as ASNEQ, it will then take a trap
and the DA bit will become set. The OPS and CPS registers will have the TP bit set
but a trace trap will not be taken. This means that Freeze mode code (trap handlers
which execute with the DA bit set) cannot be debugged by a software debug monitor
unless the processor supports Monitor mode. Most members of the 29K processor
family do not support Monitor mode.
7.2.4 Program Counter register PC2
The instruction following a branch instruction, known as the delay instruction,
is executed regardless of the outcome of the branch. This performance improving
technique requires that two registers be used to record the addresses of the instructions currently in the execute and decode stages of the processor pipeline. When a
branch is taken the PC0 register contains the address of the target instruction as it enters the decode stage of the pipeline. Register PC1 always contains the address of the
instruction in execute. When the target instruction of a branch enters decode the
instruction in execute is the delay slot instruction following the branch.
Program counter registers PC0 and PC1 are required to restart the processor
pipe–line in the event of a trap or an interrupt occurring. Many of the synchronous
traps, such as a register access privilege violation, cause execution to be stopped with
the address of instruction causing the violation held in PC1 (execute address).
Asynchronous traps, such as an external interrupt, and instruction traps, such as ASSERT instructions, cause the address of the instruction following the one in execute
at the time of the interrupt to be held in the PC1 register. In fact when a trap or interrupt is taken the PC register values are frozen and used to restart program execution
later. The frozen PC values are held in a 3 register PC–buffer. Of course, the actual PC
registers continue to be used. Instructions such as MTSR and MFSR (move–to and
move–from special register) can be used to modify the PC–buffer register values.
The address of the instruction previously in execute and now in write–back is
held in the PC2 register. This is very convenient because a debugger can determine
the instruction which was in execute at the time the interrupt or trap occurred. The
Chapter 7
Software Debugging
335
October 13 1995, Draft 1
trace trap is an asynchronous trap, and thus after the trap is taken the next instruction
about to execute is addressed by PC1. Some family members support Instruction
Breakpoint registers, which can be used to stop execution when a certain address
reaches execute. When this occurs a synchronous trap is taken and the instruction is
stopped before execution is completed.
Debug monitors, such as MiniMON29K, understand the operation of the PC
registers and can use them to control program execution. When MiniMON29K is
used with a processor which has no Breakpoint registers, a technique relying on temporarily replacing instructions with illegal opcode instructions is used to implement
breakpoints. Illegal opcode instructions are used in preference to trap instructions because execution is stopped with the PC–buffer recording execution a cycle earlier.
That is, the breakpoint address is in PC1 rather than PC2, as would happen with a trap
instruction.
One further useful feature of the PC2 register occurs when breakpoints are set to
the first instruction of a new instruction sequence — typically the first instruction of a
procedure. When the breakpoint is taken and program execution is stopped, the PC2
register contains the address of the delay slot instruction executed before the new
instruction sequence started. This is very useful in determining where a program was
previously executing.
7.2.5 Monitor Mode
Monitor Mode currently only applies to a limited numbr of 29K processors, see
Table 7-1. If a trap occurs when the DA bit in the CPS register is a 1, the processor
starts executing at address 16 in instruction ROM space. Monitor Mode is not entered
as a result of asynchronous events such as timer interrupts or activation of the
TRAP(1–0) or INTR(3–0) lines.
On taking a Monitor Mode trap the Reason Vector register (RSN) is set by the
processor to indicate the cause of the trap. Additionally, the MM bit in the CPS register is set to 1. When the MM bit is set, the shadow program counters (SPC0, SPC1,
and SPC2) are frozen, in a similar way to the FZ bit freezing the PC0–PC2 buffer
registers. Because the shadow program counters continue to record PC-BUS activity
when the FZ bit is set, they can be used to restart Freeze Mode execution. This is
achieved by an IRET or IRETINV instruction being executed while in Monitor
Mode.
Monitor mode traps are used by monitors in the debugging of trap and interrupt
handlers and are not intended for operating system use.
7.2.6 Instruction Breakpoints
Some members of the 29K processor family support Instruction Breakpoint registers, see Table 7-1. These registers can be used to stop a program’s execution when
336
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Table 7-1. 29K Family On–chip Debug Support
instruction
data breakpoints
data
breakpoints breakpoints
value ranges
Processor
Monitor mode
virtual
memory
Am29000
no
yes
–
–
–
Am29005
no
no
–
–
–
Am2903x
no
yes
–
–
–
Am29040
no
yes
2
1
no
Am29050
yes
yes
2
–
–
Am2920x
no
no
–
–
–
Am2924x
no
yes
–
–
–
Am29460
yes
yes
2
1
yes
an instruction at a specified address enters execute. The control mechanism for
Breakpoints is flexible, allowing a User process ID to be specified.
With 3–bus processors, breakpoints can be assigned to Instruction space or
ROM space. Both of these spaces normally contain instructions but the ROM space
typically contains ROM rather than RAM memory devices. No matter which kind of
memory device is utilized the Breakpoint registers can be used.
When a processor does not support Breakpoint registers, illegal instructions or
traps are used to stop execution at desired address locations. Debug monitors are,
however, unable to manipulate instructions which are located in ROM devices. Thus
the main uses of the Breakpoint register is to support breakpoints when ROM devices
are in use. Additionally, they are used in the rare case where a 3–bus Harvard architecture memory system is constructed without providing a means for the processor to
read and write instruction space. In this case the processor will not be able to replace
instructions at a breakpoint addresses with temporary illegal instructions.
The MiniMON29K debug monitor, described in detail later, must make some
decision about the values to put in the breakpoint register fields: BTE (break on
translation enable) and BPID. The debug tool user (Debugger Front End user) normally selects the process identifier (PID) of the application process containing the
breakpoint. However, the DFE often does not know if the 29K operating system is
running with address translation turned on. The DebugCore accesses a data structure
shared with the operating system to determine the value for the BTE field, see section
D.3.4. The operating system is required to fill in the appropriate sections of the shared
data structure, informing the DebugCore of the CPS register value to be used during
program execution. When the PI (physical instruction) bit in the CPS register is clear
the BTE bit is set.
Chapter 7
Software Debugging
337
October 13 1995, Draft 1
7.2.7 Data Breakpoints
Currently only the Am29040 processor supports Data Breakpoint registers, see
Table 7-1. These registers can be used to stop a program’s execution when data is either read or written from an address which lies within a specified range. The control
mechanism is flexible and shares much of the characteristics provided by the Instruction Breakpoints Control registers described in the previous section.
When an address match is detected, a trace trap is taken after the load or store
instruction is completed (this is also true for loadm and storem instructions). When a
trap is taken, the PC1 registers points to the instruction following the load or store and
the data transfer has occurred.
To make effective use of data breakpoints it is important that the selected debugger has support for controlling the operation of the on–chip support registers. Data
Breakpoint Control registers are a relatively new feature and many debuggers have
not yet been extended to incorporate the necessary command and control functions.
7.3
THE MiniMON29K DEBUGGER
Developers of software for embedded applications are used to working with
emulators. They enable code to be down–loaded to application memory or installed
in substitute overlay memory. This avoids having the development delays associated
with running code from EPROM. The use of emulators may be a necessary stage in
first getting the target hardware functional; for this task their ability to work with partially functioning hardware makes them indispensable. However, once the processor
is able to execute out of target system memory and a communications channel such as
MonDFE
Debug process
running on DFE
host processor
UDI
link
MonTIP
Target–interface
Process on TIP
host processor
private link
29K RISC processor and
application–specific hardware,
with MiniMON29K DebugCore
and OS–boot in ROM memory
Application downloaded from
MonDFE into RAM memory
Figure 7-2. MinMON29k Debugger Components
338
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
a serial link is available, the need for an emulator is reduced. Emulators are expensive, and it is not always possible to make one available to each team member. The
use of a debug monitor such as the MiniMON29K monitor during the software debug
stage of a project is an economical alternative to an emulator.
The MiniMON29K monitor is not intended to be a standalone monitor. That is,
it requires the support of a software module known as the Target Interface Process
(TIP). The TIP executes on a separate host processor. The embedded 29K target processor communicates with the TIP via a serial link or other higher performance channel (see Figure 7-2). The User–Interface process, known as the Debugger Front End
(DFE), communicates with the TIP via the inter–process communication mechanism
known as UDI which is described later.
Most monitors do not offer high level language support. Assembly code instructions must be debugged rather than the original, say C, code. Using GDB in conjunction with the MiniMON29K monitor enables source level code to be debugged,
which is far more productive and necessary for large software projects. (More on this
in the UDI section).
MiniMON29K has a small memory requirement, for both instruction memory
and data memory of the target 29K system. The size is reduced by implementing
much of the support code in the TIP host machine, and communicating with the target
via high–level messages. The amount of communication required is reduced by incorporating sophisticated control mechanisms in the target DebugCore.
Much of the following discussion in this section, is concerned with describing
the operating principles of target hardware software components. Other MiniMON29K components such as MonTIP and MonDFE are described in the later UDI
sections.
7.3.1 The Target MiniMON29K Component
The embedded portion of the MiniMON29K monitor must be installed in target
system ROM or down–loaded by the host via a shared memory interface. The target
application code and additional operating system code can then be down–loaded via
the message system. If changes to the code are required, then the message system can
be used to quickly down–load new code without changing any ROM devices.
The software installed in the target hardware consists of a number of modules,
described in Figure 7-3. When the embedded Am29000 processor is reset, the initial
operating system module, OS, takes control. This module initializes the processor
and the other support modules. The monitor components are required to implement a
message communications driver and a debug control core (DebugCore).
The operating system module is not part of the MiniMON29K monitor. This allows developers to build their own operating system or make use of a 3rd–party real–
time executive product. However, AMD does supply processor initialization code
and HIF system call support routines. HIF is an embedded system call interface specChapter 7
Software Debugging
339
October 13 1995, Draft 1
OS–boot
run–time
support
HIF
MiniMON29K
MSG
message system
SER
communication drivers
DBG
CFG
floating–point
trapware
memory
management
initialization
Application
OS
MiniMON29K
DebugCcore 2.0
link to
MonTIP
DebugCore
configuration
Monitor
Figure 7-3. 29K Target Software Module Configuration
ification, which many of the 29K processor support library services make use of. The
AMD supplied operating system code is known as OS–boot, and it is normally supplied in the same ROM containing the MiniMON29K target component software.
(All of the OS–boot and MiniMON29K 29K source code is freely available from
AMD).
7.3.2 Register Usage
The DebugCore, message driver and other MiniMON29K monitor modules do
not require any processor registers to be reserved for their use. This means that all the
processor registers are available for use by the operating system and application
code.
What this really means is that any registers temporarily used by MiniMON29K
code are always restored. The only exception to this occurs with global register gr4
and the TE and TP bits of the CPS special register.
Global register gr4 is implemented in some members of the 29K family but not
reported in the relevant User Manual, as it is never used by application or operating
system code. With family members which have no gr4 register, the ALU forwarding
logic can be used to keep a temporary register alive for 1 processor cycle following its
modification. The gr4 data is lost during the write–back stage when there is no real
gr4 register in the global register file. Note, software such as the MiniMON29K DebugCore can be difficult to debug because emulators also make use of gr4 in analyzing processor registers.
340
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
The TE and TP bits, located in the Current Processor Status register, belong to
the MiniMON29K DebugCore. However, the CPS register really belongs to the operating system and the OS should not modify the TE and TP bits which are maintained by the DebugCore. When the operating system issues an IRET instruction it
updates the CPS register with the contents of the OPS register. Normally the DebugCore will set the TE bit in the OPS before the operating system performs an IRET.
However, initially the operating system must call the support routine dbg_iret() to
perform the IRET on behalf of the operating system. This gives the DebugCore an
opportunity to gain control of the TE bit.
7.3.3
The DebugCore
The TIP host processor controlling the target 29K processor sends messages via
the available link to the DebugCore module. The message system enables the host to
examine and change registers and memory in the target hardware. Program execution can also be controlled via instruction breakpoints and single stepping instructions. Messages are provided specifically for controlling processor execution.
The DebugCore decodes the messages, giving access to the 29K processor registers and the target system memory. However, it does not access the non–processor
resources directly. The Configuration Control Module supports the peek and poke
functions shown below. These functions are used for all non–register space target
hardware access. Note, all functions and data variables defined in the configuration
MiniMON29K module begin with the cfg_ prefix.
void cfg_peek(to, from, count, space, size)
void cfg_poke(to, from, count, space, size)
The peek function is used to read from target space into temporary debug core
BSS memory space. The poke function is used when writing to target space. The
‘space’ parameter is used to indicate the target address space, according to the received message parameters. Typical space field values would enable instruction
space, data space or I/O space access. The ‘size’ field is used to indicate the size, in
bytes, of the objects being transferred. The CFG module normally tries to make
memory accesses in the size indicated. However, if a memory system does not support, say, byte–write access to ROM–space, then the CFG access functions can be
configured to perform byte manipulation via word–sized memory accesses. By keeping the access functions separate, a user can configure the peek and poke functions
for any special requirements without having to understand the DebugCore module.
Peek and poke functions are supplied for typical existing target hardware.
For example, if a system uses Flash memory devices, the erase and programming sequences required to write to Flash should be built into the cfg_poke() proceChapter 7
Software Debugging
341
October 13 1995, Draft 1
dure. If Flash and other device types, such as DRAM, are used in the same memory
space, then the cfg_poke() procedure can examine the address value in the ‘to’ parameter to determine the correct operation. Recent versions of MiniMON29K have
included CFG module support for Flash memory devices.
When the target processor stops executing operating system or application
code, a context switch occurs into the DebugCore context. The state of the processor
is recorded when switching context, thus enabling execution to be resumed without
any apparent interruption. The DebugCore context may be entered for a number of
reasons, such as: a message was received from the TIP host, an instruction breakpoint
was encountered, a memory access violation occurred. Whenever the DebugCore
gains control a ‘halt’ message is sent to the TIP host processor. The TIP host and target can then exchange messages as necessary to analyse or change the state of the
processor or memory.
DebugCore 2.0 shares a data structure with the operating system (OS). Vector
table entry 71 is initializes by the OS to point to the data structure. Appendix D describes the DebugCore and OS interface in detail. The data structure is mainly used to
pass the address of entry points within the two software modules. Address labels can
be determined at link time. However, when, say, a new OS is loaded at run time it
must reconnect with the DebugCore. This requires address labels be available at run
time. In addition to address labels are various fields which support the DebugCore
installing per process breakpoints and requesting OS supplied service functions.
7.3.4 DebugCore installation
It is very simple to install the DebugCore with any operating system. Mainly,
what is required is use of a number of Vector Table entries which are not normally
required for operating system operation. And to call the DebugCore initialization
routine dbg_control(). Figure 7-4 shows the vector table entries required. The two
most obvious entries are for trap number 0 (illegal opcode) and number 15 (trace
342
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
trap). These table entries point to the DebugCore entry labels dbg_V_bkpt and
dbg_V_trace, respectively. Note, all the DBG module functions and data structure
names begin with the dbg_ prefix.
OS Cold–Start
The operating system is responsible for inserting the necessary address labels
into the vector table. Any vector table entries which are not required by the operating
system can be defaulted to the DebugCore, via the dbg_trap entry. When this entry is
used, register gr64 should contain the original trap number (see section D.4.1 for alternatives). It can be very useful to direct traps such as protection violation (number
5) and Data TLB protection Violation (number 13) into the DebugCore. This is much
better than just issuing a HALT instruction in the operating system. When a trap is
taken into the DebugCore a message is sent to the MonTIP process which will inform
the DFE process when execution has halted. DFEs such as MonDFE and GDB understand the 29K trap number assignment and can report a trap number 13 as a User
mode data protection violation (Segmentation fault in Unix language).
Initializing the vector table is part of what is known as the operating system
cold–start code. The operating system start–up sequence is shown in Figure 7-5.
When the processor’s power is applied, or when the *RESET pin is asserted, the
CPU begins executing instructions at location 0 in ROM instruction space (ROM
space and instruction space are the same in many 29K family members). Control is
usually passed directly to the operating system cold–start code. To save the contents
of all the processor’s registers before the system is initialized, the user may modify
VAB
OS/DebugCore
shared data structure
0
V_BKPT
dbg_V_bkpt
5
V_PROT
dbg_trap,
gr64=5
15
V_TRACE
dbg_V_trace
example 16
V_INTR0
msg_intr
71
V_71
75
V_DBG_MSG
76
V_OS_MSG
255
V_RESET
dbg_V_msg
os_V_msg
virtual
interrupts, 75
and 76
Figure 7-4. Vector Table Assignment for DebugCore 2.0
Chapter 7
Software Debugging
343
October 13 1995, Draft 1
the code in the operating system to jump to the debugger core–dump facility. Once
the registers have been saved, then the cold–start code is executed by passing control
to the os_cold_start label in ROM space.
Normally an operating system will begin cold–start code immediately at address 0. However, certain software bugs may cause program execution to become
out–of–control, and the only way to regain control is to activate the processor reset
pin. This is particularly the case when the TLB registers are not used by the operating
system to implement address access protection. A jump to dbg_coredump at address
0, enables the processor states to be recorded at the time reset was asserted. By examining the PC and channel special registers some understanding of the cause of the
loss of proper program execution may be observed. To restart execution after the
core–dump data has been examined, a MiniMON29K RESET message must be issued by MonTIP. This causes the dbg_trap_num variable to be cleared and the processor state to be restored to the hardware reset condition before execution is started
at address os_cold_start.
DebugCore 2.0 requires that vector table entry 71 point to a memory region
shared by the DebugCore and the operating system. The operating system must initialized several fields of the shared data structure, see section D.3. For, DebugCore
1.0 comparability, the data structure can be initialized to zero. After the interrupt and
trap handler vectors are installed, the cold–start code performs one–time initialization of target system hardware, then calls msg_init() to initialize the message system
and and underlying communication drivers. The precise action taken by msg_init() is
dependant on the communications hardware used to support message sending.
When the cold–start sequence is complete, a call is made to dbg_control()
which initializes the DebugCore. The point at which the entry point to the Debugaddress
0
jmp dbg_coredump
16
jmp dbg_m_trap
os_cold_start
operating system
cold–start code
ÉÉÉ
Monitor mode
DebugCore entry
set VAB
call msg_init()
ÉÉÉÉÉÉÉ
ÉÉÉÉÉÉÉ
ÉÉÉÉÉÉÉ
ÉÉÉÉÉÉÉ
ÉÉÉÉÉÉÉ
call dbg_control()
warm–start
code
Coredump entry
(optional)
built–in breakpoint
analyze return
values
jmp dbg_iret
enable tracing
Figure 7-5. Processor Initialization Code Sequence
344
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Core is made actually defines the boundary between operating system cold–start and
warm–start code. The parameters passed to the function are shown below:
return_struct dbg_control(
int
dbg_trap_num,
int*
os_info_p)
/* lr2 value */
/* lr3 value */
It is called just like a C language routine. Local register lr2 contains a copy of the
value held at memory location dbg_trap_num. Register lr3 contains the address of a
data structure which describes the memory layout of the target system. The operating
system is responsible for determining the amount and address range of available
memory. Although this information is passed to the DebugCore, it does not itself require this information. It merely keeps a record of the relevant data structure address
so it can pass the information to the DFE process. Debug tool users interact with the
DFE and generally like to know about the target memory availability. Figure 7-6
shows the layout of the structure passed to the DebugCore. Note, where a 29K system
is based on a single memory space containing both instructions and data, the d_mem,
i_mem and r_mem parameters are the same.
The lr2 parameter is required to know if a call to dbg_coredump has already
been performed. Whenever the DebugCore is entered the variable dbg_trap_num
takes on the trap number causing DebugCore invocation; for example number 15
when a trace trap occurs. When a core dump has been performed then trap number
255 is recorded. And when the DebugCore is reentered with this number the state of
the processor is not recorded again.
This is necessary because the call to dbg_control() appears as a built–in breakpoint. Whenever a breakpoint is taken the complete state of the processor is recorded,
in effect a context switch into the DebugCore occurs. The original context is restored
when the DebugCore receives a GO or STEP message from the MonTIP process.
Whenever the DebugCore gains control a HALT message is sent to MonTIP. Under
higher addresses
OS_version
Am29027_prl
r_mem_size
all structure members
are 32–bit in size
r_mem_start
d_mem_size
d_mem_start
i_mem_size
Register lr3 points to the operating system information
structure when dbg_control()
is called
i_mem_start
Figure 7-6. Operating System Information Passed to dbg_control()
Chapter 7
Software Debugging
345
October 13 1995, Draft 1
DFE direction, MonTIP can then send messages to the DebugCore to examine and
change the saved processor status.
OS Warm–Start
The DebugCore records the return address for dbg_control() when it is first
called. The address is important because it is the start of the operating system warm–
start code. When an application program is down–loaded to the target hardware, an
INIT message is normally sent. The message contains information extracted from the
application COFF file. This information along with other operating system run–time
support data is passed to the operating system when the dbg_control() function returns. As is normal for C procedures, the return information is placed in global registers starting with gr96. Figure 7-7 shows the format of the operating system warm–
start data.
After the DFE (MonDFE for example) has been instructed to load a new program into memory, the return registers can be examined to verify their contents.
Note, with some DFEs it is possible to load a COFF file without sending an INIT message. In this case the return registers are not affected and the PCs are not forced to the
dbg_control() return address.
After loading a program a user will normally start execution, which causes the
DebugCore to switch out of context and restore the context described in the register
shadow memory. If an INIT message was received then execution will commence in
the operating system warm–start code. Otherwise, it continues from wherever the restored PC registers are located. Warm–start code normally examines the return structure values and prepares the operating system run–time support accordingly. For example, register gr100 contains the start address of the down–loaded application program. The address value may be loaded in the PC–buffer registers before an IRET
instruction is used to start program execution. However, it is important to note that
gr105
gr104
Operating system control info.
gr103
start of command line args (argv)
gr102
register stack size
this register always 0
g101
memory stack size
gr100 first instruction of User loaded code
gr99
end address of program data
gr98
gr97
start address of program data
gr96
start address of program text
end address of program text
Figure 7-7. Return Structure from dbg_control()
346
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
the warm–start operation is entirely operating system dependent, and the code need
pay no attention to the return structure information. The operation of OS–boot, normally supplied along with MiniMON29K, is described in a later section.
7.3.5 Advanced DBG and CFG Module Features
Normally the call to dbg_control() implies that a built–in breakpoint should be
taken. This gives the user an opportunity to down–load an application program before execution is continued. However, by setting the call lr2 parameter to V_NOBRK
(254), no breakpoint will be taken and the call will return with no need for a GO message from MonTIP. This enables the DebugCore to be initialized for operation, and is
useful where there is no requirement to download an application program. Of course
there are no call return values for the operating system warm–start to examine. The
facility enables the DebugCore to remain in a final system and only be called upon in
a emergency such as memory access violation.
The CFG module is used to configure the operation of the DBG module. There
is really no need to have the source code for the DBG module, only the CFG module.
After configuring the CFG, it can be assembled and linked with the .o debug core
modules (dbg_core.o and dbg.o). The CFG supplies the cfg_peek() and cfg_poke()
functions, as well as defining the number of breakpoints supported and the size of the
DebugCore message send buffer. Note, however, that there is conditional assembly
code in the CFG module for a wide range of target hardware systems. In practice configuring CFG normally means defining the correct symbol value during assembly.
Whenever the DebugCore is entered, the routine cfg_core_enter() is called.
This gives the DebugCore user an opportunity to control the state of the processor
during DebugCore operation. For example, normally the DebugCore runs with the
on–chip timer turned off. This means no timer progress is made and no timer interrupts will occur while the DebugCore is in context. The timer can be re–enabled by
changing the code in cfg_core_enter(). The supplied code also locks the processor
cache (only with processor members supporting cache). This prevents application
and operating system relevant data being displaced with DebugCore information.
The DebugCore is mainly written in the C language and makes use of application space processor registers during its operation. On taking, say, a breakpoint and
entering the DebugCore, all the processor registers are copied to shadow memory
locations. Users examine and change the shadow values before they are returned to
registers when the DebugCore context is exited. It is possible that an external hardware device could generate an interrupt when the DebugCore is in–context (interrupts may be enabled in the cfg_core_enter() procedure). This could cause some
confusion as the interrupt handler may wish to modify some operating system assigned registers to record a change in the interrupting device status. The change
would be lost when the DebugCore exited. To overcome this problem, global regisChapter 7
Software Debugging
347
October 13 1995, Draft 1
ters gr64–gr95 are not shadowed if memory location dbg_shadow_os contains a 0
(normally set to –1). This can be done in the cfg_core_enter() procedure.
When dbg_shadow_os is cleared, physical registers gr64–gr95 are always accessed with MiniMON29K READ and WRITE messages. However, messages such
as FIND and COPY operate on the shadow copies only, and this creates some minor
restrictions in DebugCore operation.
If cfg_core_enter() is modified to enable the on–chip timer to continue interrupting during DebugCore operation, then memory location dbg_shadow_timer
should also be set to 0 (normally –1). This prevents the TMR and TMC timer special
registers from being restored from their corresponding shadow memory locations
when the DebugCore context is exited.
Interrupts must be enabled during DebugCore operation if, say, an interrupt
driven UART is being used for MiniMON29K message communication. It is sometimes possible to use the message system in a poll–mode (described in the following
section), in this case interrupts can be disabled. Additionally, it may be possible to
selectively enable device interrupts in cfg_core_enter(). However, care should be
taken if any of the interrupts require C level context for interrupt processing. The DebugCore continues to use the register stack in place at the time the DebugCore was
entered. The DebugCore will not need to lower the stack support registers, but any C
level interrupt handler may make temporary use of the stack (this is very much operating system dependent). Further, it is important that no attempt is made to reenter
the DebugCore, via, say, a memory access error during an interrupt service routine
which interrupted the DebugCore operation.
Breakpoints located at both physical and virtual addresses are supported if the
processors has on–chip breakpoint control registers. Without breakpoint registers,
breakpoints are always located at physical addresses. However, per–process breakpoints are supported even if the processor has no on–chip MMU support; or if the
MMU is not in use because separate processes are each running is Supervisor mode.
Breakpoint capabilities are presented in detail in section D.3.
7.3.6 The Message System
After the message system has been initialized with a call to msg_init(), the DebugCore responds to MonTIP host messages appropriately and sends acknowledge
messages to the host containing any requested data. The operating system can also
make use of the message system to support application services such as access to the
file system on the TIP host machine. The msg_send() function is used to request a
single message be sent. A similar function is made available by the message system
module on the TIP host processor.
int
348
msg_send(struct message *msg_pointer);
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
The function returns 0 if the message was accepted for sending and –1 if the
message system is currently too busy. Variable msg_sbuf_p is maintained by MSG
to point to the message buffer currently being sent. When this variable becomes 0, the
message system is ready to send another message. The message buffer pointer passed
to msg_send() is copied into msg_sbuf_p the contents of the buffer are not copied.
Thus the user must be careful not to modify the buffer data until the message has been
completely sent.
Messages are received by asking the message system to poll the message driver
hardware until a message is available. Function msg_wait_for() is provided for this
task. Alternatively, the message system can interrupt the operating system or the DebugCore when a message is received from the TIP host processor. Received messages are normally located at address msg_rbuf. There is no danger of the receive
buffer being over written by a new in–coming message, as the MonTIP always expects to receive a message before it will reply with a new message to the target.
7.3.7 MSG Operation
The MSG module may require the support of communications port specific
driver modules, most notably the SER module. This module contains the code necessary to talk to serial communication UARTs which support target and MonTIP connection. The MSG contains a number of shared memory communication drivers for
IBM PC–AT plug–in cards, such as, the PCEB, EB29K, YARC and others.
Messages all have the same format, a 32–bit message number then a 32–bit
length field, followed by any message related data. When the MSG determines that a
new message has been received, and its message number is greater than 64, the operating system is interrupted (if interrupts are enabled), and execution continues at the
address given in the vector table for entry number V_OS_MSG (76). In OS–boot this
is address os_V_msg. This means that the operating system does not have to poll the
message interface for service request completion. Polling is required when the message system can continue to operate with interrupts turned off. The message system
can be used to support HIF services (see the later OS–boot section).
Received messages with identification numbers less than 64 are intended for the
DebugCore. The MSG causes the DebugCore to be interrupted via vector table entry
V_DBG_MSG (75). This causes execution to continue at address dbg_V_msg.
When execution begins at this address, the processor state appears as if a hardware
interrupt has just occurred while executing User mode code or an operating system
service. The virtual interrupt mechanism is used to support this technique and is described below.
7.3.8 MSG Virtual Interrupt Mechanism
Consider what happens when a UART receives a character and an interrupt is
generated:
Chapter 7
Software Debugging
349
October 13 1995, Draft 1
1
The UART serial driver enters Freeze mode and execution continues at the
address given in the vector table for the interrupt handler. (Note, it is the
operating system cold–start code’s responsibility to install the trap handler for
this interrupt, even if a MiniMON29K SER module driver is used).
2
Next the SER driver saves some global registers to memory.
3
The driver talks to the UART, receives the character and places the new data into
the msg_rbuf buffer at the location given by the pointer msg_next_p. The
registers are restored and the pointer incremented.
4
The SER driver then jumps (virtual vectors) to address msg_V_arrive in the
MSG module. This whole procedure appears to the message system as if the
interrupt had been directed to msg_V_arrive when a character arrived in its
buffer.
5
The MSG saves its own working register space and examines the size of the
incoming message and decides if it is complete or if more data is required. If
incomplete the registers are repaired and an IRET is issued. When complete,
working registers are repaired and the PC–buffer registers are updated with
address of the operating system handler or DebugCore handler accessed from
the vector table.
Using the sequence described above, messages arrive via a V_DBG_MSG or
V_OS_MSG virtual interrupt directly to the appropriate message processing handler. The operating system and the DebugCore need never be concerned about any
registers used by the MSG or SER modules in the process of preparing the received
message, as their temporary register usage is kept hidden.
When interrupts are being used, rather than polling for a new message to arrive,
the msg_wait_for() function simply returns 0 indicating that no message is available. If the SER module is making use of polling and interrupts are turned off, then
the msg_wait_for() function returns –1 when a complete new message is available in
the msg_rbuf. In fact the MSG sets variable msg_rbuf_p to point to the just–received message buffer. The DebugCore interrupt handler dereferences this pointer
when accessing any received messages.
7.4
THE OS–BOOT OPERATING SYSTEM
MiniMON29K is a debugger. It does not initialize the processor, service interrupts, support HIF system calls or even install itself into the target system. All these
tasks must be performed by an operating system. It does seem a rather grand title but
OS–boot does perform these tasks. If a user does not build an operating system or buy
an operating system from a third party then OS–boot may be adequate for their project needs.
350
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
AMD generally supplies OS–boot along with MiniMON29K for each 29K
evaluation system. Because OS–boot supports the HIF system call services it is useful for running evaluation software. However, OS–boot is a simple operating system,
it does not support multi–tasking or other grander operating system concepts. As well
as supplying MiniMON29K and OS–boot in EPROM, users get the source to OS–
boot, enabling them to make any necessary changes.
Typically, users will add operating system code to support additional peripheral
devices. Or, use OS–boot as a means of launching into another more sophisticated
operating system. This is described in more detail later. The technique is useful because it avoids the need to install MiniMON29K with the new operating system in
EPROM. The new operating system need merely be down–loaded via MiniMON29K debugger messages into available target memory.
This section does not describe OS–boot in detail. It is mainly an overview of its
operation. Hopefully users will gain an understanding of its relevance in the debug
processes.
7.4.1 Register Usage
According to the register usage convention, an operating system is free to use
global registers in the range gr64–gr95. OS–boot uses a good number of these registers. Many of the floating–point instructions and some integer instructions are not
implemented directly by hardware with some members of the 29K family. This requires that trapware be used to support the non–existing instructions. The floating–
point trapware included with OS–boot requires as much as 15 temporary registers
and three static registers to support the trapware code. OS–boot is typically configured to assign registers it0–kt11 (gr64–gr79) for temporary use and ks13–ks15
(gr93–gr95) for static use.
The exact register assignment for OS–boot is determined by file register.s in the
osboot directory. Other than trapware support, registers are required for run–time
management and HIF services. These registers are typically allocated from the range
ks0–ks12 (gr80–gr92). There are a number of free registers for those requiring to add
operating system support code.
7.4.2 OS–boot Operation
Operation begins at address label os_cold_start. The processor special registers, such as CPS and CFG, are initialized to enable the processor start–up sequence
to commence. OS–boot does not contain very much cold–start code. However, the
code is complicated by the incorporation of options enabling any member of the 29K
family to be dealt with.
The vector table entries are constructed. Most of the unused entries are set to
cause DebugCore entry. Thus, should any unexpected trap or interrupt happen the
DebugCore will be able to report it. The vector table is normally placed at the start of
data memory.
Chapter 7
Software Debugging
351
October 13 1995, Draft 1
The memory system is then analyzed in the process of building the data structure passed to dbg_control(). In some cases this involves the operation of dynamic
memory sizing code. The floating–point trap handlers are then prepared for operation. Initialization of floating–point support is a one–time operation, so it occurs before dbg_control() is called.
Before the cold–start operation is complete, additional vector table entries are
made to support DebugCore operation, entries such as V_TRACE. The DebugCore/
OS shared data structure is then initialized and vector table entry 71 is set to point to
the base of the data structure. The message system is then initialized with a call to
msg_init() and dbg_control() is called, indicating the completion of operating system cold–start code.
The return from dbg_control() causes execution of the operating system
warm–start code to commence at address warm_start. The run time environment is
now prepared. Much of this is concerned with memory management. The memory
and register stack support registers are assigned values before any loaded application
code starts. The warm–start code examines the return parameters from dbg_control() in preparing the run–time environment.
With 29K family members which have TLB hardware, OS–boot is normally
configured to start application code execution in User mode with address translation
turned on. Warm–start code gets the application code start address from return registers gr100. This address is loaded into the frozen PC–buffer registers and an IRET
used to depart the operating system supervisor mode code and enter the application
code in User mode. Register gr104 is used to select operating system warm–start options. If bit 31 is set then application code is started with no address translation enabled. (To use this feature set gr104 to –1 after using the MonDFE y command to
yank–in application code into target system memory.) Note, warm–start code does
not issue an IRET instruction directly, it jumps to the DebugCore service dbg_iret.
This enables the DebugCore to set the TE bit in the OPS register and so enable single
stepping of the first application code instruction. Additionally the BTE and BPID
fields of any breakpoint registers in use are also set by dbg_iret.
7.4.3 HIF Services
Once application code has started, operating system code will only be again
called into play when: a floating–point trap occurs; a peripheral generates an interrupt; or when a HIF service is requested. HIF is a system call interface specification.
OS–boot supplies the necessary support code which is accessed by a system call trap
instruction. Many of the library calls, such as printf(), result in HIF trapware being
called. HIF trapware support starts at address label HIFTrap.
HIF services are divided into two groups, those that can be satisfied by the 29K
itself (such as the sysalloc service), and those that need MonTIP support (such as
open). The HIF specification states that the service request number be placed in reg-
352
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
ister gr121, if this number is less than 256 then MonTIP must assist. A request for
MonTIP assistance, to say, open a file for writing, is accomplished by the operating
system sending a MiniMON29K message to the TIP process. There are currently
three types of messages used by the OS: HIF–request, CHANNEL1 (used when
printing to stdout), and CHANNEL0_ACK (used when acknowledging data from
stdin). Note, it is easy to extend the operating system message system usage and
create new operating system message types. This may be useful if virtual memory
paging was being supported by an operating system, where the MonTIP was acting as
the secondary memory controller.
MonTIP replies to HIF MiniMON29K messages by sending messages to the
DebugCore to accomplish the requested task. It then sends a HIF_ACK message to
the operating system acknowledging the completion of the requested service.
CHANNEL1 and CHANNEL0 messages are used by the operating system to
support display and keyboard data passing between the application program and the
user. Note these are the only operating system messages which the MonTIP passes
via UDI to the MonDFE process. MonTIP responds to stdout service requests with
CHANNEL1_ACK message, and supplies new keyboard input characters with a
CHANNEL0 message sent to the operating system. (Note, some early versions of
MonTIP did not make use of the operating system *_ACK messages, they used the
DebugCore instead. This created difficulties for multitasking operating systems. If
you have this problem, you need to update your MonTIP program.)
Previously, the OS–boot implementation entered Wait mode after issuing a
MiniMON29K message. This is accomplished by setting the WM bit in the OPS register before using an IRET to return to application code from the HIF trap handler.
Wait mode is exited when the message system interrupts the operating system in response to a MonTIP reply–message to the operating system. Because Wait mode is
used OS–boot must run with interrupts turned on. However, the MiniMON29K DebugCore has no such restriction and can operate in a poll–mode fashion. Recent versions of OS–boot can also operate the message system in poll–mode and need not
have interrupts permanently enabled. The latest OS–boot code no longer uses Wait
mode while waiting for a message system interrupt. Either the message system or a
flag variable is continually polled. The flag being set by the message system interrupt
handler which previously cleared the WM bit.
7.4.4 Adding New Device Drivers
OS–boot is a very simple operating system and it does not offer support for additional I/O devices. However, the HIF specification states that file descriptors 0,1 and
2 are assigned to: standard in, standard out and standard error. Normally any open()
library calls issued by an application program will result in the HIF open service returning a new file descriptor for a file maintained on the TIP host by MonTIP.
Chapter 7
Software Debugging
353
October 13 1995, Draft 1
Target hardware can often have additional UART or parallel port hardware
available for communication. If OS–boot is not completely replaced with a new OS,
then these devices should be accessed via the normal library/HIF interface. OS–boot
can be extended to include a driver to support any new peripheral device. Each device
should be pre–allocated a file descriptor value starting with number 3. All access to
peripherals can then be to the pre–allocated file descriptors. If the application code
calls open() then the HIF open service should initially return 4, or some larger number depending on the number of peripheral devices added.
The Metaware libraries, supplied with the High C 29K compiler package, pre–
allocate buffer and MODE settings for file descriptors 0, 1 and 2. Assuming no access
to the library source file _iob.c, then calls to open() should be placed inside the crt0.s
file. These open() calls should be for each of the pre–allocated file descriptors and
will result in library initialization. The code inside crt0.o runs before the application
main() code. Note, the MODE value for the open() calls may be restricted due to driver or peripheral limitations. And communication with the devices may be required in
RAW mode rather than any buffered mode supported by the library when a device is
opened in COOKED mode.
When library calls, or HIF calls such as _read() or _write(), are issued for the
file descriptor associated with a peripheral the OS–boot trapware for the HIF services shall call upon the required device driver to perform the requested task.
7.4.5 Memory Access Protection
The OS–boot operating system includes an optional memory access protection
scheme which is useful with embedded system debugging. It only functions with
29K family members which contain TLB hardware. When used, the operating system runs application programs in User mode with address translation turned on.
Thus, all application addresses are virtual, but the memory management hardware is
configured to map virtual to physical addresses with a one–to–one scheme. No
memory paging takes place and the entire program is at all times located in the available target system memory.
The benefit of the system is that bad addresses, generated by unexpected program execution, can be detected immediately. The operation of the 29K Translation
Look–aside Buffer (TLB) used to construct the management scheme was briefly described in previous section 7.2.2 entitled Memory Access Protection. This section
deals with the OS–boot code implementation. For more information about operation
of TLB hardware see Chapter 6 (Memory Management Unit)
354
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
First consider the typical OS–boot memory configuration shown in Figure 7-8.
Some 29K family members have a 3–bus architecture. This enables two memory
systems to be utilized, one for instruction memory and the second for data memory. If
instructions are to be accessed from data memory devices, or data placed in
instruction memory, then a bridge must be built between the data and instruction
memory busses. Note, a single address bus is shared by both memory systems.
Typically, designers will build a bridge enabling instruction memory to be accessed
from data memory address space. In such case the two addresses spaces do not
overlap. However, without a bridge it is possible to have physical memory located in
the different address spaces but the same address offset location.
Most of the newer 29K family members have a conventional 2–bus architecture,
which results in instructions and data being located in the same memory devices loHIGHMEM
gr1
Register Stack
Data Memory
Memory Stack
msp
unused
heapptr
Heap
end (gr99)
program data
start (gr98)
possible
DebugCore +
OS–boot
unused
end (gr97)
program text
start (gr96)
DebugCore +
OS–boot
Instruction Memory
Note, Data and Instruction memory space may
overlap, or be only one single address space
Figure 7-8. Typical OS–boot Memory Layout
Chapter 7
Software Debugging
355
October 13 1995, Draft 1
cated in a single address space. OS–boot caters for all the different memory configuration options, and this is reflected in the layout shown in Figure 7-8.
Operating system warm–start code knows the address regions allocated to a
loaded application program by examining the data structure returned from the
dbg_control() call. OS–boot actually saves the data to memory locations for future
use, as we will see. Applications can be expected to access a limited number of regions out–with the data region loaded from the application COFF file. This is required to support the memory allocation heap and the register and memory stacks.
The allocated access regions are shown shaded in Figure 7-8. An attempt to access an
address out–with allowed regions will cause the DebugCore to gain control of program execution.
During normal code execution, instruction and data TLB misses will occur. This
requires that the TLB registers be refreshed with a valid address translation. OS–boot
trap handlers are used to perform this task. If a bad address is generated the trap handlers must detect it.
Two kinds of traps are expected: Instruction TLB misses and data TLB misses.
The trap handler for instruction misses is shown below. The return values from the
dbg_control(), shown in Figure 7-7, are stored by OS–boot in a structure at address
ret_struct. The PC1 value is compared with the start and end addresses of the loaded
program. If the PC1 address is within this range then a new valid TLB entry is built
and program execution restarted. If the address is out of the allowed range then a
jump to the DebugCore entry point, dbg_trap, is taken.
UITLBmiss:
mfsr
const
consth
load
cpltu
jmpt
add
load
cpgtu
jmpt
const
one_to_one:
it0,pc1
it1,ret_struct
it1,ret_struct
0,0,it2,it1
it2,it0,it2
it2,UIinvalid
it1,it1,4
0,0,it2,it1
it2,it0,it2
it2,UIinvalid
it2,(VE|UE)
;PC address
;TEXT start
;jump if
;PC < start
;TEXT end
;jump if
;PC > end
TLB register Word 0 has access control bits which separately enable Read,
Write and Execution of data for the addressed page. The example code assumes that
data and instructions are not located on the same page as pages containing instructions are marked for execution only.
one_to_one:
mfsr
srl
and
add
356
;it2 has RWE bits
it3,mmu
;need page size
it3,it3,8
;get PS bits.
it3,it3,3
;1k page min
it3,it3,10+5
;32–sets
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
srl
sll
or
mfsr
and
or
it1,it0,it3
it1,it1,it3
it1,it1,it2
it2,mmu
it2,it2,0xFF
it1,it1,it2
;form
; VTAG
;add RWE bits
;get PID
sub
srl
sll
it3,it3,5
it2,it0,it3
it2,it2,it3
;page size
;form
; RPN
mfsr
mttlb
add
mttlb
iret
it0,lru
it0,it1
it0,it0,1
it0,it2
;select column
;word 0
dbg_trap
gr64,8
;enter DebugCore
;add PID bits
;
;
;
UIinvalid:
jmp
const
;word 1
The data–miss trap handler is a little more complicated. The address under consideration appears in channel register CHA. The address is first tested to see if it is
greater than the data region start address and less than the current heap pointer. The
operating system maintained heap was initialized just above the end of the loaded
program data region. If the address is not within this range then it is tested to determine if it is within the memory or register stack regions. The stacks are located at the
very top of physical data memory.
UDTLBmiss:
mfsr
const
consth
load
cpltu
jmpt
cpltu
jmpt
const
stacks:
const
consth
load
cpgeu
jmpt
cpgeu
jmpt
const
;
UDinvalid:
jmp
const
Chapter 7
it0,cha
;data address
it1,ret_struct+8
it1,ret_struct+8
0,0,it2,it1
;DATA start
it2,it0,it2
;jump if
it2,UDinvalid
;adds < start
it2,it0,heapptr;adds < heapptr
it2,one_to_one
it2,(VE|UR|UW)
it2,HIGHMEM
it2,HIGHMEM
0,0,it2,it2
;DATA end
it2,it0,it2
;jump if
it2,UDinvalid
;adds >= end
it2,it0,msp
;jump if
it2,one_to_one ;adds>=msp
it2,(VE|UR|UW)
dbg_trap
gr64,9
Software Debugging
;enter DebugCore
357
October 13 1995, Draft 1
The example trap handler marks data pages for read and write access only. If the
CHA address does not fall within the allowed region, then a TLB entry is not built,
and, normally, program execution not restarted. Instead, the DebugCore is entered
and the trap number passed.
7.4.6 Down Loading a New OS
One way to replace OS–boot with another operating system is to simply link the
new operating system with the MiniMON29K modules and, if necessary, place the
result in EPROM memory. Or alternatively down–load the linked image to the target
29K system using MiniMON29K messages. However, many users like to keep the
existing OS–boot/MiniMON29K combination in place and down–load only the new
operating system (or a portion of it) –– this can create complications. Assuming no
changes are made to the supplied OS–boot, then, when the loaded OS’s execution is
started, with say a MonDFE ‘g’ command, warm–start code will prepare for execution to begin at the first instruction of the new operating system. Generally a HIF setrap service call is made followed by an assertion of the assigned trap number. This
allows Supervisor mode to be entered.
The new operating system must initially run in Supervisor mode to take over
processor resources initially under OS–boot control. If the floating–point trap handlers are to remain installed, then the new operating system must be careful to remember their global register support requirement. If the new operating system is still
supporting HIF services then it must also pay attention to the HIF trapware register
usage. HIF traps will occur if any application code run by the new OS is linked with
libraries intended for use with a HIF conforming operating system. However, often a
new operating system will replace the HIF libraries with new libraries which do not
call HIF, but make use of the system call services of the new operating system.
The HIF trapware code can be replaced with new code, whose register usage is
better integrated with the new operating system, by the new system taking over the
HIF vector table entry. If this is done, then it is likely that the operating system message interrupt handler will also be taken over. Unless the os_V_msg trap handler address is replaced, the message system will continue to call the OS–boot interrupt handler. And the associated operating system register usage should be taken into account. Alternatively, AMD supplies driver routines which make it easy for a new operating system to use the original message system for standard input and output communication. This eliminates the need for the new OS to to takeover the message system interrupt handlers.
The MiniMON29K message system is typically supported by low level driver
code which is often interrupt driven. Most often this is a UART interrupt handler. The
message system will not generate virtual interrupts if the low level handler vector
table entry is taken over. This can be necessary because of interrupt overloading. For
example, the Am29200 interrupt INTR3 is used for all peripheral devices including
358
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
the on–chip UART. A new operating system may wish to add support, for say, DMA
activity, which was not supported by OS–boot. This may require an interrupt handler
activated by INTR3. If the MiniMON29K message system is to continue operation,
then the new operating system must take over the INTR3 vector table entry. But, after
the new operating system handler is complete it must jump to the original vector handler address rather than IRETing. This gives the message system low level interrupt
handler an opportunity to run. A better alternative is to use the technique described in
section 2.5.5 to deal with INTR3 overloading.
7.5
UNIVERSAL DEBUG INTERFACE (UDI)
Code development for an embedded processor is generally more costly than development of code of equivalent complexity intended for execution on a engineering
workstation. The embedded application code can not benefit from an underlying support operating system such as UNIX. In some cases, developers may chose to first
install a small debug support monitor, such as MiniMON29K, or third–party executive which can offer a somewhat improved development environment. In the process
of getting an embedded support monitor running or developing application code to
run directly on the processor, emulation hardware may be employed. The availability
of debug tools and their configurability is an important factor when selecting a processor for an embedded project.
The architecture of the latest RISC processors may be simplified compared to
their CISC predecessors, but the complexity of controlling the processor operation
has not been reduced. The use of register stacks and instruction delay slots and other
performance enhancing techniques has lead to increased use of high level programming languages such as C. The compiler has been given the responsibility of producing efficient assembly code, and the developer rarely deals with code which manipulates data at the processor register level. The increased productivity achievable by
this approach is dependent on high level debug support tools.
Developers of products containing embedded processors are looking to RISC
for future products offering increased capability. The greater performance relative to
RISC processor cost should make this possible. The suitability, cost and productivity
of the tools available for code development are likely to be the major factor in deciding the direction ahead in preparing to tool–up for RISC.
The following sections describe the Universal Debug Interface (UDI), which is
processor independent and enables greater debug tool configurability. A number of
emulator and embedded monitor suppliers, as well as high level language debug
tools suppliers, are currently configuring their tools to comply with the proposed
UDI standard. Current implementations are targeted for RISC processor code development. UDI should ease the choice in selecting tools and, consequently, selecting
RISC. This section shall concentrate on describing the Free Software Foundation’s
GDB C language source debugger’s integration with UDI.
Chapter 7
Software Debugging
359
October 13 1995, Draft 1
7.5.1 Debug Tool Developers
A debug tool developer typically arranges for their product to be available for a
range of popular processors. This normally means rebuilding the tool with the knowledge required to understand the peculiarities of each processor. If an enhancement is
made to the debugger user–interface, then normally the debugger source and the processor specific information must be recompiled and tested before customers are updated.
When developing code to run on an engineering workstation, the processor supporting the debugger execution is the same processor running the program being developed. This means the debugger can make use of operating system services such as
ptrace() (see section 7.5.3), to examine and control the program being debugged.
When developing code for an embedded application, the program being developed is
known as the Target Program and executes on the Target Processor which is usually a
different processor than the one supporting the debugger, known as the Host Processor. The host processor and target processor do not communicate via the ptrace() system call, but via whatever hardware communication path links the two processors.
The portion of the debugger which controls communication with the target processor
is known as the target interface module, and whenever a change or addition is required in the communications mechanism, the debugger must be once again recompiled to produce a binary executable which is specific to the target–processor and target–communications requirements.
When the chipmakers turn out their latest whiz–bang RISC processor, the tool
developer companies are faced with considerable development costs in ensuring
their tools function with the new architecture. It is not uncommon for the availability
of debug tools to lag behind RISC chip introduction. Often tools are introduced with
limited configuration options. For example, target processor communication may be
according to a low level debug monitor protocol, or an in–circuit emulator (ICE) protocol. Each debugger product has its own target interface module; this module must
be developed for each debugger in order to communicate with the new target RISC
processor.
An embedded application developer may have prior experience or a preference
for a particular debug tool, but the only available communications path to the target
may not be currently supported. This incompatibility may discourage the developer
from choosing to use a new processor. It is desirable that debuggers share communication modules and be more adaptable to available target processor interfaces.
Ideally a debugger from one company should be able to operate with, say, an
emulator from another company. This would make it possible for a customer to select
a little used debugger with a popular target monitor or vice versa.
The goal of the Universal Debug Interface (UDI) is to provide a standard interface between the debugger developer and the target communications module, so the
two can be developed and supplied separately. In fact, an applications developer
360
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
could construct their own communications module, for some special hardware communications link, as long as it complied with the standard.
7.5.2 UDI Specification
If UDI were a specification at procedural level, then debugger developers and
communication module developers would have to supply linkable images of their
code so the debug tool combination could be linked by the intended user. This is undesirable because it would require a linked image for every tool combination. Additionally, the final linked program would be required to run on an single debug host.
UDI actually relies on an interprocess communication (IPC) mechanism to connect
two different processes. The debugger is linked into an executable program to run on
the host processor, this process is known as the Debugger Front End (DFE). The communications module is linked as a separate process which runs on the same or a different host processor, this process is known as the Target interface Process(TIP). The
two processes communicate via the UDI interprocess communication specification.
Two IPC mechanisms have so far been specified: one uses shared memory and is
intended for DOS developers, the second uses sockets and is intended for UNIX and
VMS developers. Of course, when the shared memory IPC implementation is used
the DFE and TIP processes must both execute on the same host processor. Using
sockets with Internet domain communication enables the DFE and TIP to each
execute on separate hosts on a computer network. Thus an applications developer
can, from the workstation on his desk, debug a target processor which is connected to
a network node located in a remote hardware lab. Using sockets with UNIX domain
addresses (the method used to implement UNIX pipes) enables both processes to run
on the same host.
Some of the currently available UDI conforming debug tools are presented in
Figure 7-9. The interprocess communications layer defined by UDI enables the applications developer to select any front end tool (DFE) with any of the target control
tools (TIP).
Because developers of UDI conforming tools must each have code which interfaces with the IPC mechanism according to the UDI protocol, the UDI community
freely shares a library of code know as the UDI–p library. This code presents a procedural layer which hides the IPC implementation. For example, consider the following procedure:
The DFE code calls the UDIRead function which transports the function call to
the TIP process. The TIP code developer must resolve the function request, by adding
code which is specific to controlling the particular target. The IPC layer is effectively
transparent, the TIP developer is unaware that the procedure caller is from a different
process, possibly on a different host machine. Table 7-2 lists most of the UDI–p procedures available.
Chapter 7
Software Debugging
361
October 13 1995, Draft 1
Debugger
Front–end Process
UDI–IPC
Layer
GDB source–level
debugger
XRAY source–level
debugger
Remote–target
Interface Process
ISS instruction
ISSTIP
ICE In–Circuit
Emulator
Emulator Cable
Emulator Pod
SDB source–level
debugger
MiniMON29K interface
MonDFE
Logic Analyser
HP16500B
MiniMON29K
MonTIP
optional
ROM Emulator
UDB source–level
debugger
Private Link, maybe rs232
Target Monitor
DebugCore
Figure 7-9. Currently Available Debugging Tools that Conform to UDI Specification
Because the DFE and TIP processes may be running on different machines, care
must be taken when moving data objects between hosts. An “int” sized object on the
DFE supporting machine may be a different size from an “int” on the TIP supporting
machine. Further, the machines may be of different endian. The UDI–p procedures
make use of a machine independent data description technique similar to the XDR
library available with UNIX. Data is converted into a universal data representation
(UDR) format before being transferred via sockets. On being received, the data is
converted from UDR format into data structures which are appropriate for the receiving machine. The UDI–p procedures keep the UDR activity hidden from the UDI
user.
362
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Table 7-2. UDI–p Procedures (Version 1.2)
Procedure
Operation
UDIConnect
UDIDisconnect
UDISetCurrentConnection
UDICapabilities
UDIEnumerateTIPs
UDICreateProcess
UDISetCurrentProces
UDIDestroyProcess
UDIInitializeProcess
UDIRead
UDIWrite
UDICopy
UDIExecute
UDIStep
UDIStop
UDIWait
UDISetBreakpoint
UDIQueryBreakpoint
UDIClearBreakpoint
Connect to selected TIP
Disconnect from TIP
For multiple TIP selection
Obtain DFE and TIP capability information
List multiple TIPs available
Load a program for debugging
Select from multiple loaded programs
Discontinue program debugging
Prepare runtime environment
Read data from target–processor memory
Write data to target–processor memory
Duplicate a block of data in target memory
Start/continue target–processor execution
Execute the next instruction
Request the target to stop execution
Inquire about target status
Insert a breakpoint
Inquire about a breakpoint
Remove a breakpoint
In later sections of this chapter, the development of a UDI conforming GDB, a
source level debugger from the Free Software Foundation and Cygnus Support, is
discussed in more detail. GDB is an example of a DFE process. As an example of a
TIP process, we shall look at the MiniMON29K monitor and the Instruction Set Simulator from AMD. Most users of GDB will have some knowledge of the ptrace()
system call which enables GDB to examine the state of the process being debugged.
A brief description of ptrace() is beneficial along with further explanation of its unsuitability for embedded application software development.
7.5.3 P–trace
UNIX system call, ptrace(), provides a means by which a process may control
the execution of another process executing on the same processor. The process being
debugged is said to be “traced”. However, this does not mean that the execution path
of a process is recorded in a “trace buffer” as is the case with many processor emulators. Debugging with ptrace() relies on the use of instruction breakpoints and other
hardware or processor generated signals causing execution to stop.
ptrace(request, pid, addr, data)
There are four arguments whose interpretation depends on the request argument. Generally, pid is the process ID of the traced process. A process being debugged behaves normally until it encounters some signal whether internally (procesChapter 7
Software Debugging
363
October 13 1995, Draft 1
sor) generated, like illegal instruction, or externally generated, like interrupt. Then
the traced process enters a stopped state and the tracing process is notified using the
wait() system call. When the traced process is in the stopped state, its core image can
be examined and modified using the ptrace() service. If desired, another ptrace() request can then cause the traced process either to terminate or to continue. Table 7-3
lists the ptrace() request services available.
Table 7-3. ptrace() Services
Request
Operation
TraceMe
PeekText
PeekData
PeekUser
PokeText
PokeDate
PokeUser
Cont
Kill
SingleStep
GetRegs
SetRegs
ReadText
ReadData
WriteText
WriteData
SysCall
Declare that the process is being traced
Read one word in the process’s instruction space
Read one word in the proceses’s data space
Examine the processes–control data structure
Write one word in process’s text space
Write one word in process’s data space
Write one word in process–control data structure
Startup process execution
Terminate the process being debugged
Execute the next instruction
Read processor register
Write processor register
Read data from process’s instruction space
Read data from process’s data space
Write data into process’s instruction space
Write data into process’s data space
Continue execution until system call
Because both the process with the user–interface controlling the debugging, and
the application process being debugged, may not be executing on the same processor,
it is not possible to use the ptrace() system call mechanism to debug embedded application software. The debugger process (DFE) must run on a separate processor
and communicate with the processor supporting execution of the application code.
The Free Software Foundation’s source level debugger, GDB, makes use of the
ptrace() system call. However, it can alternatively use a collection of procedures
which support communication to a remote processor. These procedures implement
the necessary protocols to control the hardware connecting the remote processor to
the “host” debug processor. By this means, GDB can be used to debug embedded application software running on application specific hardware. The following section
discusses the method in more detail.
7.5.4 The GDB–UDI Connection
GDB can, in place of ptrace(), make use of a procedural interface which allows
communication with a remote target processor. Newer versions of GDB (version
3.98 and later) achieve this via procedure pointers which are members of a target_ops
364
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
structure. The procedures currently available are listed in Table 7-4. According to
GDB configuration convention, the file remote–udi.c must be used to implement the
remote interface procedures. In the case of interfacing to the IPC mechanism used by
UDI, the procedures in Table 7-4 are mapped into the UDI–p procedures given in
Table 7-2. With the availability of the UDI–p library, it is a simple task to map the
GDB remote interface procedures for socket communication with a remote target
processor.
Table 7-4. GDB Remote–Target Operations
Function
Operation
to_open()
to_close()
to_attach()
to_detach()
to_start()
to_wait()
to_resume()
to_fetch_register()
to_store_register()
to_xfer_memory()
to_insert_breakpoint()
to_remove_breakpoint()
to_load()
Open communication connection to remote target
Close connection to remote target
Attach to a loaded and running program
Detach for multitarget debugging
Load program into target–system memory
Wait until target–system execution stops
Startup/Continue target–system execution
Read target–system processor register(s)
Write register(s) in target–system processor
Read/Write data to target–system memory
Establish an instruction break address
Remove a breakpoint
Load a program into target–processor memory
7.5.5 The UDI–MiniMON29K Monitor Connection, MonTIP
MiniMON29K monitor code can not function without the support of a software
module located in a support processor; the software module is known as the target
interface process (TIP). The 29K target processor communicates with the processor
running the TIP process via a serial link or other higher performance channel. This
link supports a message system which is private to the MiniMON29K monitor, by
that I mean it is completely independent of the UDI protocol. (See Figure 7-2.)
MiniMON29K must be installed in target system ROM memory or down–
loaded by the TIP host via a shared memory interface. The target application code,
and additional operating system code, can then be down–loaded via the message system. If changes to the code are required, then the message system can be used to
quickly down–load new code without changing any ROM devices.
The MiniMON29K TIP process, montip, converts UDI service requests into
MiniMON29K messages. The montip program which runs on UNIX machines, typically communicates with the target using an rs232 link. When run on DOS machines, it may communicate using an rs232 connection or a PC plug–in board shared
memory scheme. Note, UNIX machines can be also used to debug PC plug–in cards;
the pcserver program, run on DOS machines, enables the PC serial port to be conChapter 7
Software Debugging
365
October 13 1995, Draft 1
nected to a UNIX machine. The MiniMON29K messages, transferred to the DOS
host via plug–in card shared memory, are sent to the TIP host via the rs232 connection. The montip program supports several command–line options, as shown below.
Not all are applicable to both DOS and UNIX host machines.
montip –t target [–r OS–boot] [–m msg_log] [–com serial_port]
[–re msg_retries] [–mbuf msg_bufsize] [–bl msg_loopcount]
[–to timeout] [–seg PC_seg_addr] [–port PC_port_base]
[–baud baudrate] [–le] [–R|P]
A explanation of the command line options can be obtained by just entering
montip on your TIP host machine. When the montip process is started it advertises
its readiness to service UDI requests. A DFE process will typically connect to the
TIP process and a debug session will commence. Alternatively, there is no need to
first start the TIP process. When a DFE process is started, such as mondfe, it will look
for the advertised TIP; if the TIP process is not found the DFE will automatically start
the TIP. This is how montip is normally started. The start–up montip parameters are
taken from the “UDI Configuration File”. The format of this file is explained in the
following section discussing mondfe.
7.5.6 The MiniMON29K User–Interface, MonDFE
The MiniMON29K DFE process, mondfe, is a primitive 29K debugger. It provides a basic user–interface for the MiniMON29K product. It is fully UDI compliant
(at least UDI version 1.2 ); and it can be used with any of the available TIP processes
such as isstip, mtip, montip, etc. It is very easy to operate but has less debugging
capability compared to other DFEs, such as gdb, xray29u or UDB (see section 7.7)
etc.; for example it does not support symbolic debugging.
It is very useful for simply loading application programs and starting their
execution where no debugging support is required. Its simple command set also
makes it easy to learn; when running, simply type the h command to obtain a complete list of available commands. The h command can also be used to explain each
command’s operation; for example, “h s” will explain the operation of the set command. Several command–line options are supported.
mondfe
[–D] –TIP tip_id [–q] [–e echo_file] [–c command_file]
[–ms mem_stack_size] [–rs reg_stack_size] [–le]
[–log logfile] [pgm_name [arg_list]]
A list of command line options can be had by entering mondfe on your DFE host
processor. The process is typically started by entering a command such as “modfe
–D –TIP serial”. The “–D” option causes an interactive debug session to commence.
The UDI conforming TIP process communicating with mondfe is identified by the
“–TIP serial” command line option.
DFEs and TIPs establish communication via a UDI Configuration File. On
UNIX machines this file is called udi_soc; on DOS machines it is called udiconfs.txt.
366
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
The Configuration File is found by first looking in the current working directory. If
not found, the file given by environment variable UDICONF is searched for. Lastly,
the executable PATH is searched. The format of these files is very similar, on UNIX:
session_id
session_id
AF_UNIX
AF_INET
socket_name
host_name
tip_exe tip_parameters
port
<not required>
The first column gives the session_id, which is used to select the appropriate
line. The serial key–word used with the “–TIP” option in the example, is compared
with the session_id for each line in the Configuration File. The first matching line
provides all the necessary data for connecting to a TIP process which is already running; or, if necessary, starting TIP process execution.
The second column gives the socket domain used by the socket IPC mechanism
connecting the two processes. Two domains are supported. The AF_UNIX domain
indicates both processes reside on the same host processor. Use of the AF_INET domain indicates the TIP process is on another networked host machine. In such a case,
the host name and socket port number are supplied in the following columns. The
UDI specification does not support DFEs starting TIP processes on remote hosts.
When the AF_INET domain is used to connect to the TIP, the TIP process must be
first up and running before connection is attempted.
When the AF_UNIX domain is used, the third column gives the name of the
socket used by the TIP to advertise its UDI services. If the DFE is unable to connect to
the named socket, it will assume the TIP is not running. In such a case the remaining
line information gives the name of the TIP executable and the start–up parameters.
Below is example udi_soc file contents.
mon
serial
iss
pcserver
cruncher
netrom
AF_UNIX
AF_UNIX
AF_UNIX
AF_UNIX
AF_INET
AF_UNIX
mon_soc
*
iss_soc
pc_soc
hotbox
net_soc
montip –t serial –baud 38400 –com /dev/ttya
montip –t serial –baud 9600 –com /dev/ttya
iss –r ../../src/osboot/sim/osboot
pcserver –t serial –baud 9600 –com /dev/ttya
7000
montip –t netrom –netaddr 163.181.22.41 ...
The relative path names given with montip start–up parameters, are relative to:
<montip executable directory>/../lib . The path given with the “–r” option is required
to find the OS–boot code for 29K start–up. When the DFE is always used to automatically start the TIP process, a “*” can be used for the socket name field. This causes the
DFE to generate a random name for the socket file. This file will be removed when
the DFE and TIP discontinue execution at the end of the debug session.
The DOS Configuration File (udiconfs.txt) format is a little simpler. There are
only three entry fields, as shown by the example below:
Chapter 7
Software Debugging
367
October 13 1995, Draft 1
mon
serial
sim
eb29K
yarcrev8
montip.exe
montip.exe
iss.exe
montip.exe
montip.exe
–t
–t
–r
–t
–t
serial –baud 38400
serial –baud 9600
..\..\src\osboot\sim\osboot
eb29K –r ..\..\src\minimon\eb29K\mon.o
yarcrev8 –r ..\..\src\minimon\yarcrev8\mon.os
The first field is again the session identifier. The second and third fields contain
the TIP executable file name and its start–up option switches. All DFEs have some
kind of command–line or interactive command which allows the session_id value to
be entered. The DFE then reads the UDI Configuration File to determine the TIP with
which communication is to be established. Most DFEs (mondfe has the disc command) have a command which enables the DFE to disconnect from the TIP, cease to
execute but leave the TIP running. Because the TIP is still alive and ready to service
UDI requests, a DFE can start–up later and reconnect with the TIP. However, typically the DFE and TIP processes are terminated at the same time.
7.5.7 The UDI – Instruction Set Simulator Connection, ISSTIP
An Instruction Set Simulator, isstip, is available for DOS and UNIX type hosts.
The isstip process is fully UDI conforming and can be used by any DFE. Because of
existing contract limitations, AMD normally ships isstip in binary rather than source
form. Using the simulator along with, say the gdb DFE, is a convenient and powerful
way of exercising 29K code without ever having to build hardware. Thus, software
engineers can use the simulator while a project’s hardware is still being debugged.
The Instruction Set Simulator can not be used for accurate application benchmarking, as the system memory model can not be incorporated into the simulation.
AMD supplies the architectural simulator, sim29, for that purpose (see Chapter 1).
The simulator supports several command line options, as shown below. For an explanation of these options, enter isstip or man isstip, on your TIP host machine.
isstip [–r osboot_file] [–29000|–29050|–29030|–29200] [–t] [–tm]
[–id <0|1>] [–sp <0|1>] [–st <hexaddr>] [–ww] [–le] [–p|v]
With the –r option, the osboot_file is loaded into memory at address 0. This is
useful for installing operating systems like OS–boot before application code starts
executing. With processors which support separate Instruction and ROM memory
spaces, the osboot_file is loaded into ROM space. If the –r option is not used, the simulator will intercept HIF service calls and perform the necessary operating system
support service. The simulator always intercepts HIF services with service numbers
255 and less regardless of the –r option. These HIF services are provided directly by
the simulator.
The simulator is very useful for debugging Freeze mode code. It will allow
single stepping through Freeze mode code which is not possible with a real processor
unless it supports Monitor mode. Freeze mode code is normally supplied in the op-
368
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
tional osboot_file. Thus, the –r option must be used to enable Freeze mode debugging. Additionally, to enable debugging of Freeze mode timer interrupts the –tm option must also be selected to enable timer interrupt simulation. The simulator normally intercepts floating–point traps and performs the necessary calculation directly.
Simulation speeds are reduced if floating–point trapware is simulated. However, if
the trapware is to be debugged the –t option must be used to enable trapware simulation.
When the isstip process is started it advertises its readiness to service UDI requests. A DFE process will typically connect to the TIP process and a debug session
will commence. However, it is more typical to get the DFE to start the TIP. The
mondfe process starts the TIP during the DFE start–up process. The gdb DFE starts
the TIP after the target gdb command is used. The start–up isstip parameters are taken from the “UDI Configuration File”. The format of this file is explained in the previous section discussing mondfe.
7.5.8 UDI Benefits
A number of debug tool developers are currently, or will be shortly, offering
tools which are UDI compliant. Typically the DFEs are C source level debuggers.
This is not surprising, as the increased use of RISC processor designs has resulted in a
corresponding increase in software complexity. The use of a high level language,
such as C, is more productive than developing code at machine instruction level. And
further, the use of C enables much greater portability of code among current and
future projects. The low cost of GDB makes it an attractive choice for developers.
Target processors and their control mechanisms are much more varied than
Debugger Front Ends (DFEs). I have briefly described the MiniMON29K TIP, which
is a process which controls the execution of a 29K processor. A small amount of code
known as the DebugCore is placed in processor ROM memory and enables
examination of the processor state. The MiniMON29K TIP communicates with the
DebugCore via a hardware link which is specific to the embedded application
hardware.
Other TIPs already exist and more are under development. There is a 29K
simulator (ISS) which runs on UNIX and DOS hosts. The DFE communicating with
the simulator TIP is unaware that the 29K processor is not present, but being
simulated by a process, executing on, say, a UNIX workstation. There are also tool
developers constructing TIP programs to control processor emulators. This will
make possible a top–of–the–line debug environment.
UDI makes possible a wider tool choice for application code developers.
Debugger front end tools are supplied separately from target control programs. The
user can consider cost, availability and functionality when selecting the debug
environment. This level of debug tool configurability has not been available to the
embedded application development community in the past.
Chapter 7
Software Debugging
369
October 13 1995, Draft 1
Because debuggers like GDB are available in source form, developers can add
additional debug commands, such as examination of real–time operating system
performance. This would require adding operating system structural information
into GDB. When the debugger front end and, for example, an emulator interface
module, are supplied as a single executable, adding new commands is not possible.
Via the use of Internet sockets the debugger may execute on a different networked
host than the node supporting the emulator control process.
7.5.9 Getting Started with GDB
To demonstrate the operation of GDB debugging a program running on an
Am29000 processor, the program below was compiled using the Free Software
Foundation’s GCC compiler. The example is simple, but it does help to understand
the GDB–MiniMON29K monitor debug mechanism. A stand–alone Am29000
processor development card was used. It contains a UART and space for RAM and
EPROM devices. The MiniMON29K monitor modules were linked with a HIF
operating system support module (OS–boot) and an Am85C30 UART message
driver module [AMD 1988]. The linked image was installed in EPROM devices in
the target hardware. A serial cable was then used to connected the UART to a port on a
SUN–3/80 workstation.
The demonstration could have been equally as well been performed on a
386–based IBM–PC; the target hardware being connected via a PC serial port.
Alternatively, there are a number of manufactures building evaluation cards which
support a dual–ported memory located on a PC plug–in card containing the RISC
processor. The 386 communicates with the target processor via a shared memory
interface. This requires a TIP which can communicate via shared memory with the
DebugCore running on the target hardware. A number of such TIP control processes
have been built. A board developer has only to implement the TIP portion of the
debug mechanism to gain access to a number of debuggers such as GDB which are
UDI conforming. Note, due to an implementation limitation of the current DOS
version of GDB, it is necessary to start the TIP process manually. GDB is unable to
automatically start the montip or isstip. The command shown below most be used to
start montip on a DOS host before GDB can communicate with the target 29K
system.
montip montip.exe
The demonstration program, listed below, simply measures the number of characters in the string supplied as a parameter to the main() function.
main(argc, argv)
int
argc;
char
*argv[];
{
370
/* program measure.c */
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
int len;
if(argc < 2) return;
len = strlen(argv[1]);
printf(”length=%d\en”, len);
}
int
char
{
strlen(s)
*s;
int n;
for (n = 0; *s != ’\0’; s++)
n++;
return(n);
}
GDB was started running on the UNIX machine. The target command was used
to establish communication with the DebugCore running in the standalone development card. The UDI Configuration file was used to establish DFE and TIP communication. The format of the Configuration File was described in section 7.5.6. The
UDI session_id for the example shown is monitor. The list below presents the response seen by the user. The keyboard entries made by the user are shown in bold
type.
gdb
GDB is free software and you are welcome to distribute copies of
it under certain conditions; type ”show copying” to see the
conditions. There is absolutely no warranty for GDB; type ”show
warranty” for details. GDB 4.5.2, Copyright 1992 Free Software
Foundation, Inc.
(gdb) target udi monitor measure
Remote debugging Am29000 rev D Remote debugging an Am29000
connected via UDI socket, DFE–IPC version 1.2.1 TIP–IPC version
1.2.1 TIP version 2.5.1 MONTIP UDI 1.2 Conformant
Once communication had been established, a breakpoint was set at the entry to
the strlen() function. Execution was then started using the run command. GDB informs the user that the program is being loaded. This is accomplished by the TIP
sending messages to the debug core, which transfers the accompanying message data
into Am29000 processor memory before Am29000 processor execution commences.
(gdb) symbol measure
Reading in symbols for measure.c...done.
(gdb) break strlen
Breakpoint 1 at 0x10200: file measure.c, line 14.
Chapter 7
Software Debugging
371
October 13 1995, Draft 1
(gdb) run measure_my_length
Loading TEXT section at 0x10000 (24408 bytes) ...
Loading DATA section at 0x80003000 (4096 bytes) ...
Clearing BSS section at 0x80004000 (0 bytes) ...
Breakpoint 1, strlen(s=0x80004013 ”measure_my_length”)
(measure.c line 17
17
for (n = 0; *s != ’\0’; s++)
The program runs until the requested breakpoint is encountered. At this point a
source code listing was requested. Typically, debug monitors do not allow source
code to be viewed. The use of GDB makes this important advantage available to the
embedded software developer.
(gdb) list
11
12
int strlen(s)
13
char
*s;
14
{
15
int n;
16
17
for (n = 0; *s != ’\0’; s++)
18
n++;
19
return (n);
20
}
The user then examined the call–stack history using the info stack command.
This is currently inefficiently implemented. GDB uses the to_xfer_memory() procedure to send read messages to the target DebugCore. Examining the instruction
memory in this way is much less efficient than requesting the DebugCore to search
back through its own memory for procedural tag words. Each procedure has a non–
executable trace–back tag word, or two, placed before the first instruction of the procedure (see Chapter 3). Tag words enable debuggers to quickly gain information
about a procedure frame, and hence variable values. Adding the procedural “hook” to
GDB to make use of the MiniMON29K monitor FIND service would greatly reduce
message traffic, and improve the users response time for the info stack command.
(gdb) info stack
#0 strlen (s=0x80004013 ”measure_my_length”) (measure.c line 17)
#1 0x101ac in main (argc=2, argv=0x80004000) (measure.c line 8)
GDB enables single stepping of source code with the step or next commands.
The listing shows a source–level step request followed by the printing of procedural
variables “n” and “s”. With large embedded programs it is important to be able to debug at source–level, and examine variables without having to look at cross–listing
mapping tables to find the address associated with a variables memory location. Typically small embedded debug monitors do not support this kind of debugging.
(gdb) step
17
for (n = 0; *s != ’\0’; s++)
372
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
(gdb) print n
$1 = 0
(gdb) print s
$2 = (unsigned char *) 0x80004013 ”measure_my_length”
Embedded applications often deal with controlling special purpose hardware
devices. This may involve interrupt handlers and assembly–level code which operates with processor registers reserved for the task. GDB does support examination of
assembly code and registers by name. The listing below shows disassembly from the
current PC location (PC1 on the Am29000 processor). The si command was then
used to single step at machine instruction level. The cont command caused execution
to continue to completion, as no further breakpoints were encountered.
The result of the printf() function call can finally be seen. This function relies on
the operating system making use of MiniMON29K monitor messages. The HIF–OS
write() system call, like the DebugCore, sends the required message to the host processor. However, in the case of operating system messages, the message is not normally sent to the GDB module but to the HIF–OS support module. An exception is
made in the case of a read() or write() to the standard–in or –out channel. Related
messages are relayed via UDI to GDB which must control both the displaying of received data on the screen and sharing the keyboard between the application and the
debugger itself.
(gdb x/4i $pc
0x10228 <strlen+64>:
0x1022c <strlen+68>:
0x10230 <strlen+72>:
0x10234 <strlen+76>:
sub gr117,lr1,8
load 0,0x0,gr118,gr117
add gr118,gr118,1
store 0,0x0,gr118,gr117
(gdb) si
0x1022c
18
n++;
(gdb) p/x $pc
$3 = 0x0001022c
(gdb) p/cont
Continuing.
length=17
7.5.10 GDB and MiniMON29K Summary
GDB is a powerful debug tool which can be applied to the problem of developing software for embedded applications. The MiniMON29K monitor DebugCore
and message handling modules enable GDB to be simply incorporated in a wide
range of embedded systems. The MiniMON29K monitor has only a small memory
requirement and does not require processor registers to be reserved for its use.
Chapter 7
Software Debugging
373
October 13 1995, Draft 1
Users are free to incorporate their own real–time operating system, or alternatively make use of the HIF operating system module. Because GDB is available in
source form, it can be extended to understand real–time operating system support
data structures. Purchasers of third party executives, or those who choose to build
their own, should not find it difficult to extend GDB to analyze the real–time operating system control parameters, via the Universal Debugger Interface standard.
The increased complexity of many applications being solved by RISC processor designs have a corresponding increase in software complexity. The low cost of
GDB and its associated productivity make it an attractive choice for developers.
7.6
SIMPLIFYING ASSEMBLY CODE DEBUG
It would be ideal to have a whole chapter dedicated to the subject of Designing
for Debug. However, size constraints have restricted this section to a few hints about
how to better develop assembly code. Certainly those developing 29K based systems
should first consider the difficulties (if any) of connecting logic analyzers, ROM
emulators or in–circuit emulators to their designs before constructing any circuitry.
Tool suppliers as well as AMD support services and literature provide useful information with regard to planning for debug. This information should be obtained
and studied at the early stages of a project.
When developing a program in a high level language such as C, the compiler can
be direct to provide the necessary debug information in the output object file (COFF
file). With the High C 29K compiler, as with most C compilers, the “–g” switch informs the compiler that additional debug information should be provided by the compiler. Source level debuggers, such as UDB or GDB, need the additional information
in order to correctly perform their task. Using High C, it is possible to examine the
assembly level directives which result from the use of the “–g” compiler switch. For
example, use the command “hc29 –S –Hanno –g file.c” to produce a file called
“file.s” which has high level language debug directives embedded among the 29K
assembly code.
When developing programs at assembly level it is best to include the high level
debug directives –– too frequently assembly language developers omit this task. Directives can be added to provide symbol–table and line number information for the
assembly files. This simplifies the task of later debugging the assembly code. For example, the swaf utility can be used to read COFF files and produce an information file
in Hewlett Packard’s General Purpose Ascii (GPA) format. The GPA file can be
loaded into an HP16500B logic analyzer, enabling the analyzer to display symbol information rather than, say, hex address values. Further, using HP’s B3740A Software
Analyzer product in conjunction with their logic analyzer, trace of source line execution is possible if line number information has been provided.
374
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
It is best to use macro instructions to embedded the required symbol–table and
line number information. The FUNC maco is used in the following example to provide information about function dbg_w_glob() which is written in 29K assembler.
;Write absolute global registers with memory resident data.
;dbg_w_glob(dest_p, src_p);
FUNC _dbg_w_glob, __LINE__
mtsr
IPA, lr2
;IPA set to destination
jmpi
lr0
;return
load
0,0,gr0,lr3
;read memory
ENDFUNC _dbg_w_glob, __LINE__
Macro instruction ENDFUNC is used to mark the end of the function. Both
macros receive two parameters; the first is the name of the function, the second provides line number information. Symbol __LINE__ is expanded by the C pre–processor utility, cpp, which is available with most systems supporting Unix. Note, user
of High C 29K version 3.3 or newer will not need to use cpp as the assembler directly
supports the use of the __LINE__ symbol. When a file is processed by cpp, the
__LINE__ symbol is replaced by the current line number. Unfortunately, cpp adds a
line at the start of its output file which does not comply with 29K assembler syntax.
This line is simply removed using the tail Unix utility. In general, to support line
number expansion, command lines similar to the following three must be added to a
Unix makefile for each assembly source file.
cpp file.s > tmp.s
tail +2 tmp.s > _file.s
as29
_file.s
# run C pre–processor
# Use Unix “tail” utility
# assemble file
The listing below shows the code used to implement the FUNC macro. The return type of the function is “int” (T_INT). A tagword is provided but the field details
are not constructed.
.macro FUNC, fname, fline
.def
fname
.val fname
.scl 2
.type 0x24
.endef
.word
0x0
.global fname
fname:
.def
.bf
.val .
.scl 101
.line fline
.endef
.ln
1
.endm
Chapter 7
Software Debugging
;start symbol–table entry
;value of symbol = address
;storage class = C_EXT
;type of symbol = T_INT()DT_FCN
;end of symbol–table entry
;Tag word
;start symbol–table entry
;value of symbol = PC address
;storage class = C_FCN
;source line number
;end of symbol–table entry
;line number within new section
375
October 13 1995, Draft 1
The listing below shows the code used to implement the ENFFUNC macro.
There are a number of high level language support directives required to specify an
end of function. The comment fields explain the symbol–table definition used.
.macro ENDFUNC, fname, fline
.def
.ef
;start symbol–table entry
.val .
;value of symbol = address
.scl 101
;storage class = C_FCN
.line fline
;source line number
.endef
;end of symbol–table entry
.def
fname
;start symbol–table entry
.val .
;value of symbol = PC address
.scl –1
;class= C_EFCN (func. end)
.endef
;end of symbol–table entry
.endm
Assembly macros can also be used to provide type information for data which is
defined in assembly level modules. Symbol–table information for data variables is
usually less useful than information about functions. However, if assembly level directives are not used, then all data will appear to be of type “char” (T_CHAR). Some
debuggers may be confused by this and will not be able to correctly report which simple has been accessed during a load or store operation. The example below shows
how the INT_32 and INT_32_ARY macros can be used to define variables. In the
example, the variables are located in a BSS (un–initialized data) region. The macros
provide the high level language directives which result in the correct symbol–table
information.
.sect
dbg_bss,bss
.use
dbg_bss
.align 4
INT_32
_dbg_tmp_reg
INT_32
_dbg_tmp_p
INT_32_ARY
_dbg_return,8
;32–bit uninitialised data
;4–byte data
;8 * 4–byte array
The listing below shows the code used to implement the INT_32 macro. The
symbol if of type “int” (T_INT). The enumeration (4) used for this type can be found
in the documentation supporting COFF. Alternatively, the C compiler can be run with
the “–g” switch and the output examined.
.macro INT_32, name
name: .block 4
.def
name
.val name
.scl 2
.type 0x4
.endef
.endm
;start symbol–table entry
;value of symbol, address
;storage class, C_EXT
;type of symbol, T_INT
;end of symbol–table entry
The INT_32_ARY macro is shown below. This macro is a little more complex
as it declares an array. The two macros shown here are useful but do not represent the
376
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
complete range of macros which would be required to describe all data types. However, given these examples, it should not be difficult to construct any other macros
required.
.macro INT_32_ARY, name, size
name: .block 4 * size
.def
name
;start symbol–table entry
.val name
;value of symbol, address
.scl 2
;storage class, C_EXT
.dim size
;dimension of array
.type 0x34
;symbol type = T_INT and DT_ARY
.endef
;end of symbol–table entry
.endm
7.7
SOURCE LEVEL DEBUGGING USING A WINDOW INTERFACE
There are a number of source level debuggers available for the 29K family
which support a windowed user interface; primarily windowed–gdb, xray29u and
UDB. A windowed debugger is appealing to many development engineers because
of its convenient interface and potentially greater productivity. For the benefit of
those engineers involved in embedded processor development who have not yet had
the opportunity to experience a windowed debugger, this section gives a brief
introduction to the topic. For illustration purposes, the UDB universal source level
debugger is used.
UDB was specifically designed for embedded software development.
Consequently, UDB provides a Generic I/O (GIO) interface alternative to ptrace()
for communicating with the target 29K system. The GIO code runs as a separate
process from the UDB process. The two processes communicate via a socket
connection on Unix hosts. This enables the GIO process to be provided in source
form without having to make UDB source code available. A UDI conformant version
of UDB is available for Unix hosts. This was achieved by interfacing the GIO
interface to the UDI–p library, see Figure 7-10. Currently, a UDI interface for PC
Windows is under development, and this will lead to a UDI conformant version of
UDB for PC Windows. CaseTools Inc., the developer of UDB, has a GIO
implementation available which is not interfaced to the UDI standard, but
communicates directly with a CaseTools maintained monitor known as UMON.
Currently, UDB for Windows operates with UMON rather than the DebugCore
Chapter 7
Software Debugging
377
October 13 1995, Draft 1
which is supplied as part of the MiniMON29K bundle. This will be the position until
the UDI for PC Windows specification has been completed.
When a UDI conformant version of UDB is used with a 29K evaluation board,
establishing UDB operation is very simple. The mktarget command in the udb.rc
start–up file is used to start a GIO processes which supports the UDI interface. The
GIO process uses the assigned mktarget parameters to select the entry in the udi_soc
file which is then used to establish the DFE–TIP connection (see section 7.9.2). In
this way, it is particularly convenient to use UDB with the instruction set simulator
ISSTIP. Similarly, UDB can be configured to connect to MonTIP which itself
communicates, for example, via a serial link or NetROM (ROM emulator), to the
DebugCore running in the 29K system.
Using UDB with UMON rather than MiniMON29K, is also simple when the
29K evaluation board has UMON installed and running. CaseTools recommends that
UMON first be linked with 29K boot–up code know as boot–crt0 and then installed
in, say, ROM on the target 29K system. A CaseTools customer is required to
construct their own boot–crt0 code. This could be accomplished using the OS–boot
code provided by AMD.
However, because AMD provides 29K evaluation boards with MiniMON29K
already installed, some developers and evaluators may wish to run UMON without
first constructing a boot–crt0. The UMON monitor can be run on–top of the
MiniMON29K DebugCore, and where necessary the application can make use of
services provided by OS–boot in place of the missing boot–crt0. A number of
preparation steps must be taken to make this tool combination operate correctly.
These steps are explained below. It is important to remember these steps only apply
when launching UMON from MiniMON29K. No special linking and loading steps
are required if UDB is used directly with MiniMON29K or directly with UMON
combined with an appropriate boot–crt0.
GIO
running
on host
MonTIP
Target–interface
Process on TIP
host processor
UDI–p
UDI
link
UDB
Debug process
running on DFE
host processor
29K RISC with
MiniMON29K
DebugCore and
OS–boot
private link
GIO
29K RISC running
connection
running
UMON
established by
on host
mktarget
command
Figure 7-10. The UDB to 29K Connection via the GIO Process
378
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
MiniMON29K – UMON Differences
When using UMON rather than MiniMON29K components, a different crt0 file
must be linked with application code (the crt0 file linked with application code
should not be confused with the boot–crt0 file linked with UMON). There are a number of reasons for this; for example, MiniMON29K normally clears the application
BSS data region when a new program is prepared for execution. UMON does not
clear the BSS region and hence this task must now be performed by the crt0 file
linked with the application. Normally the default crt0 file provided with the compiler
is linked ahead of any application code.
UMON expects OS–boot to satisfy HIF service requests –– at least for service
numbers 256 and greater. However, service numbers 255 and lower require OS–boot
to request help from MonTIP. For this purpose, OS–boot exchanges MiniMON29K
messages with MonTIP. When the DebugCore is replaced by UMON, the
MiniMON29K message system is also replaced by the UMON communication
mechanism. Hence, OS–boot can not be used to perform HIF services with service
number 255 or lower.
To support HIF services such as write (service number 20) which is used by the
printf() library routine, UDB is provided with a library which supplies routines, such
as _write(), which interface to the UMON communication mechanism. This library,
libudb.lib, must be linked with application code. The libudb.lib library must be
linked ahead of the default libraries supplied with the compiler, as the default libraries also contain system call glue–routines such as _write(); but these now unwanted routines request HIF services supported by OS–boot.
Compiling a Program for UMON Debugging
When UMON is launched from MiniMON29K, care must be taken when
building application programs for debug. A makefile for driving the High C 29K
compiler is provided with UMON as a template for building application programs.
The makefile builds a link–command–file and ensures the correct files are linked in
the correct order. The “APP=fib” line at the top of the makefile must be modified to
change the application program being build.
Alternatively, the compiler can be driven directly from the command line. First,
alternative files must be copied from the UMON installation to the High C 29K
installation directories. Copy the file /udb/apps/crt0.o to /29k/lib/udb_crt0.o and file
/udb/apps/libudb.lib to /29k/lib/libudb.lib. These files are referenced by the command file /udb/apps/udb.cmd which should also be copied to /29k/lib/udb.cmd. File
fib.c can then be compiled with the command:
hc29 –g –o fib.abs –nocrt0 –cmdudb.cmd fib.c
The “–nocrt0” option supresses linking of the default /29k/lib/crt0.o file. The
udb.cmd file is configured to link programs starting at memory location
0x40040000. This is suitable for use with an SA29200 evaluation board. The address
is higher than usual because UMON is also installed in the 29K’s program memory.
Chapter 7
Software Debugging
379
October 13 1995, Draft 1
Figure 7-11. UDB Main Window Showing Source Code Frame
380
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
UDB does not load symbol information directly from 29K COFF files. The
utility mksym (see section 2.6.1) must be used to build a symbol file in a format
understood by UDB. The command below builds a symbol file for the fib.abs COFF
file compiled earlier. It is convenient to place the mksym command in the build
makefile.
mksym fib.abs fib.sym
Preparing for UMON Debugging
When MiniMON29K is used to launch UMON, the UMON monitor must first
be installed in the target system memory. Once installed, control of the processor is
passed from the DebugCore to UMON. MonDFE can be used to install UMON. Assuming the udi_soc file (udiconfs.txt for PC hosts) has an entry “serial” for establishing operation with a 29K target board, MonDFE can be started with the command:
mondfe –D –TIP serial.
Once MonDFE is started, the UMON program can be loaded and execution
started. At this stage MiniMON29K including MonTIP is no longer needed. The
MonDFE command sequence below is all that is needed to get UMON running.
y /udb/umon/sa200/sa200x.abs
g
q
Debugging a program
As noted previously, if MiniMON29K is running on the target 29K system, a
UDB can be started which utilizes a UDI conformant GIO. If UMON is running on
the 29K target system, a non–UDI conformant UDB should be started. Double clicking on the UDB29K icon or starting UDB program execution from the command line
will establish a connection with the debug monitor (assuming the udb.rc command
file contains a mktarget command). The file udb.rc is read by UDB during the startup sequence. It can be used to customize UDB operation.
Once UDB has started, a 29K application program can be loaded. This is done
by using the upper left menu item File–load–Symbols & Executable. Then use the
menu item Execute–Run until. and enter the label “main” when prompted for an
address; displayed in the window will be the source file, for example see
Figure 7-11. Once a window has been created it can be used to display different
display frames. In Figure 7-11 a source code frame is displayed in the newly created
window. In general, any window can display any frame type. The following
discussion does not strictly adhere to the correct terminology for frames and
windows. In particular, where it is convenient, the term “window” may be used to
refer to a “frame” within a given window.
It is usually necessary to pop–up a window displaying a Console frame to enable
program input/output. This can be done by clicking on the Con button (lower right)
while holding down the shift key. A Console window will appear which enables application input/output information to be displayed. The keyboard echo option must
Chapter 7
Software Debugging
381
October 13 1995, Draft 1
Figure 7-12. UDB Window Showing the Assembly Code Associated with the Previous
Source Code Frame
382
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
be enabled by first clicking the right mouse button while in the Console window. The
left button can then be used to select “echo on” from the provided menu. Program
execution will continue when the Go button (top middle) is pressed.
The method used to pop–up a Console window can also be used to pop–up a
range of other debug support windows. The Asm button produces a window displaying assembly level code. The assembly code window shown in Figure 7-12 was produced by clicking the Asm button. The high–lighted code line (with arrow) corresponds to the current Program Counter (PC) position. The corresponding source line
was high–lighted in the source code window (Figure 7-11). All windows which have
a mode setting (Source or Code mode) such as to display the current code position,
are updated automatically when the PC changes. A new PC value is reported whenever program execution stops due to, say, single–stepping a source or assembly code
line, or hitting a breakpoint. The example windows have a breakpoint set at the first
line of the fib() function. The right mouse button can be used to select the current display mode for a window.
This section is too brief to fully describe the capabilities of UDB. Developers
typically pop–up a number of windows displaying code, memory and register contents. Windows can be selected and arranged in a way suited to an individual developer or project’s requirements. As a further example, Figure 7-13 shows a window displaying global register contents. The window is updated whenever a register value
changes. A new value can be entered into a register by placing the cursor over the
selected register data value and typing in the new value. The right mouse button can
be used to select other types of registers for displaying. For more information about
UDB commands, consult the UDB User’s Guide.
7.8
TRACING PROGRAM EXECUTION
Tracing program execution refers to recording the instruction execution and
data accesses performed by a processor. Programs are normally traced up to a breakpoint or other event causing normal instruction execution to halt. A software engineer can examine the trace information and determine the program’s operation prior
to the event. The technique provides the software developer with a powerful tool for
eliminating software bugs.
A tracing capability is normally provided by an In–Circuit Emulator (ICE) or
logic analyzer. The task of tracing is complicated by on–chip instruction and data
caches. Without caches, processor activity is fully visible from the memory interface.
When an access is performed to on–chip cache, it is not normally possible to determine the address or the data accessed. ICE developers can overcome this problem,
but often at increased tool cost. Those using a logic analyzer to perform tracing are
traditionally limited to debugging with caches turned off; or if caches are enabled, not
being able to observe all of a program’s execution. Embedded systems typically have
to meet stringent timing requirements and consequently it is not usually possible to
Chapter 7
Software Debugging
383
October 13 1995, Draft 1
Figure 7-13. UDB Window Showing Global Registers
turn off caches. It is unfortunate that the use of a logic analyzer is restricted. Logic
analyzers are not processor specific and are universally used by hardware development engineers; they are frequently available to the software engineer working on
embedded product development.
The 29K family helps overcome the problem of tracing while caches are turned
on by employing Traceable Cache technology. The 2–bus microprocessors and the
high–end microcontrollers support traceable caching. Later in this section, traceable
cache operation is described in detail for the Am29040 processor.
This section deals with the use of a logic analyzer as a software debugging tool.
As well as describing the problem in general terms, specific material is included relating to the use of Hewlett–Packard (HP) logic analyzers. HP was chosen because
their logic analyzers are popular, and many of the accompanying support tools have
been adapted to operate with HP analyzers. Other logic analyzer manufactures and
their partners have also developed tool combinations which support source level
software debugging.
384
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Logic Analyzer Connection
Many of the evaluation boards offered by AMD contain sockets suitable for
quick connection of a Hewlett–Packard logic analyzer. This simplifies the process of
connecting the analyzer to the processor’s signal pins. Certain other logic analyzer
manufacturers support a compatible termination adapter (pod) format. Logic analyzers such as the HP16500B (system) and the HP166x series connect directly to the
evaluation board connectors. This is convenient, as connecting to devices in packages other than PGA can be cumbersome and unreliable. Connection to your own
board can be achieved via a logic analyzer preprocessor: a preprocessor consist of a
small circuit board which connects directly into the processor socket (possibly with
the aid of a socket extender). A replacement processor is located on the board along
with an array of analyzer connection sockets. Corelis Inc. supply preprocessors for
microprocessors and microcontrollers in the 29K family.
Microcontroller members of the 29K family incorporate on–chip memory interface controllers. This results in the microcontroller providing RAS and CAS address
information separately (multiplexed on the same address pins) rather than a complete
DRAM address value. Consequently, it is necessary to latch the RAS address information and later combine the CAS address bits to produce a complete DRAM address. If the address latching technique is not used, then the logic analyzer can not
display the complete address used for a DRAM access. This is very inconvenient. For
this reason, AMD provide address latching circuitry on their more recent microcontroller evaluation boards. Corelis also provide address latches on their preprocessors.
The active components on the preprocessor draw power from the pins supplying
power to the processor.
A Logic Analyzer as a Software Development Tool
Logic analyzers can be used to study a circuit’s state and timing information.
Hardware engineers typically display state information in hexadecimal or binary
format (see Figure 7-14). Software developers need a format which is more relevant
to their task. To this end, Corelis provide a tool which runs on the logic analyzer and
enables the processor bus signals to be displayed in assembly instruction format. The
tool is used in conjunction with a configuration file which formats the analyzer to the
assigned preprocessor signals. (For example, file POD_040._D for the Am29040
preprocessor.) When the configuration file is used, the task of first assigning labels to
the termination connector signals is eliminated. When the inverse assembler tool is
used, the DATA label shown on Figure 7-14 can optionally be displayed in terms of
29K assembly instructions rather than the hexadecimal equivalent.
Chapter 7
Software Debugging
385
October 13 1995, Draft 1
Figure 7-14. HP16500B Logic Analyzer Window Showing State Listing
386
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Using the swaf utility described in section 2.6.1, it is possible to display the
ADDR label shown on Figure 7-14 in terms of program address labels. The utility
builds a GPA formatted symbol file from information extracted from a linked COFF
file. The GPA file must be transferred to the analyzer, this is best done using a LAN
connection.
Hewlett–Packard has further extended the ability of their analyzers to support
source level debugging. Their B3740A Software Analyzer tool enables trace information to be displayed at source level. The tool runs on a Unix workstation or on a
PC running Windows. An HP16500B logic analyzer must be connected to a computer system via an HP16500L LAN card. Once the analyzer is connected to the LAN, it
can be controlled from the workstation or PC. For example, an X–terminal connected
to a computer running Unix can use the Software Analyzer tool to display program
trace information in terms of the original C code. For convenience it is also possible
to display the equivalent assembly level trace normally presented on the dedicated
analyzer display. Analyzer trigger logic can be set from the X–terminal and is presented in terms of address symbols rather than hexadecimal values. The Software
Analyzer tool currently runs with HP16550A (6–pod, 102 channel) and HP16555A
(4–pod, 68 channel) analyzer cards which can be installed in the HP16500B logic
analyzer system.
Hewlett–Packard’s Software Analyzer tool is very useful, however, it does suffer from displaying trace information corresponding to instruction fetch activity
rather than instruction execution. Not all fetched instructions which are observed on
the system bus, flow through the processor pipeline and are executed. Instructions
can be fetched due to cache block reload or instruction fetch–ahead. The Software
Analyzer indicates that these instructions have been fetched for execution in the same
way as instructions which really are executed –– they are indistinguishable. This
problem is overcome when the logic analyzer is driven by the MonTIP program. The
UDI conformant MonTIP has been extended to include support for the HP16500B
logic analyzer. Algorithms have been incorporated within MonTIP for processing
trace information. These algorithms, described in more detail later, are able to eliminate unwanted trace information and consequently produce trace data which corresponds to the execution path taken by the processor.
Traceable Caching
Traceable Caching is accomplished using two processors in tandem: a main processor and a slave processor. The two processors are connected together, pin–to–pin,
except the slave uses its address bus and a few other signal pins to indicate cache hit
activity. The main processor performs all the required operations, and the tracing
processor duplicates the operation of the main processor except that the output pins
connected in parallel are disabled. All processor outputs to the system are driven by
the main processor. The slave processor simply latches the results of the accesses performed by the main processor.
Chapter 7
Software Debugging
387
October 13 1995, Draft 1
With the Am29040 processor, the address bus A31–A0 of the slave (the tracing
processor) along with output pins REQ, R/W and I/D report physical branch addresses even if the target instruction is provided by the on–chip instruction cache. By tracing the slave processor signals along with the master, it is possible to exactly reconstruct the sequence of instructions executed. Instruction execution is considered consecutive until a further nonsequential event (such as a branch or an interrupt) is reported by the slave processor.
When a load or store hits in the data cache, the Am29040 slave processor provides the corresponding physical address on its address bus. The slave also indicates
when a data access results in cache block allocation. When an instruction executes,
the corresponding processor status (signals STAT2–STAT0) are reported on the following cycle –– when the instruction is in the write–back pipeline stage. Load and
store instructions are reported in the same way as other instructions, at the write–back
stage, rather than when the actual data transfer is accomplished.
The Am29040 and Am29030 processors perform traceable caching at the internal speed of the processor, this may be twice the speed of the off–chip memory system. This ensures that the processor operation can always be fully reported. The
Am29240 microcontroller performs traceable caching at the off–chip memory system speed. This can lead to difficulties when the processor is running internally at
twice the memory system speed. For example, it is not possible to report the target
address of the first jump in a back–to–back sequence of jump instructions (instruction visiting). Only the target of the second jump is reported by the Am29240 slave
processor. Additionally, if a branch instruction executes in the same memory cycle as
a load or store instruction, the slave only reports the address of the branch instruction.
Traceable caching is enabled via the JTAG interface. A boundary–scan instructions for enabling or disabling tracing can be entered via the JTAG port. Corelis Inc.
manufacture preprocessor boards supporting traceable caching. The preprocessor
contains two processors: a master and a slave. The second processor is switched into
slave–type operation during reset Active components on the preprocessor board
drive a TRACECACHE instruction into the slave processor. Around the perimeter of
the Am29040 preprocessor are nine logic analyzer connectors. An unusually large
number of analyzer pods is required due to the need to trace both master and slave
operation. In the Am29040 case, it is possible to operate with a minimum of eight
pods if the optional connector J7 is eliminated. This enables tracing to be accomplished with a minimum of two HP16550A or two HP16555A logic analyzer cards
inserted into an HP16500B analyzer system.
The analyzer cards cards should be connected together in master and slave
mode. This requires physically connecting ribbon cables on the cards. The cards can
be placed anywhere in the HP16500B card cage, as MonTIP scans for their actual
location. Assuming two HP16550A cards are located in slots D and E, pod E1 (slot E)
should be connected to position J1 on the Corelis preprocessor, and pod E2 to J2, and
388
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
so on. Pods D1–D3 must be connected to J7–J9. The POD_040._D file formats the D
and E analyzer cards for this configuration. The file, supplied by AMD or Corelis, is
normally located on the HP16500B directory /AMD/CONFIG/POD_040._D. Note,
a different POD_040._D file is required if HP16555A cards are used rather than the
lower cost HP16550A cards. The HP16550A card has a 4K sample memory depth (at
full channel width), the HP16555A can store 1024K samples.
Processing Trace Information
Enhancing MonTIP to control operation of the HP16500B logic analyzer offers
a number of advantages to the software engineer. It enables a UDI conformant debugger to access the analyzer. This makes the analyzer usable with a range of different
Debugger Front Ends (DFEs), such as UDB or xray29u. It also enables trace information to be processed before it is presented to the DFE. It is desirable that only
the execution instruction path be included in the trace data. This, after all, is what
software developers expect, given their previous experience using In–Circuit Emulators (ICE). A further advantage is that an analyzer can be combined with other UDI
conformant debug tools to produce a debug environment similar to that achieved
with an ICE.
The MonTIP program controls the logic analyzer and processes trace information. The same MonTIP can also control the target 29K system via commands sent to
a MiniMON29K DebugCore. The operation of MonTIP is directed by the chosen
DFE. The user enters commands to the executing DFE program. When the DFE is
started it typically initiates the operation of MonTIP. When started, MonTIP establishes communication with the DebugCore and, via a LAN, the HP16500B logic
analyzer. The DFE user interface will appear on the display, along with the
HP16500B user interface which is requested by MonTIP. In addition to entering DFE
commands, it is possible to enter HP16500B commands directly into the logic analyzer window. Note, a colour terminal simplifies the process of entering analyzer
commands.
Using the logic analyzer window, unprocessed analyzer trace can be viewed.
This is a tedious task, particularity when the 29K processor is operating with its on–
chip caches turned on. The DFE can also be used to display analyzer trace information, but this time in a fully processed format. Only instructions which actually
execute are reported in the trace listing.
The format of the displayed processed trace is dependent on whether the DFE
has been extended to display trace information. If a DFE has not been enhanced to
display trace in, say, source format, then the DFE must rely on MonTIP’s ability to
prepare trace information for display; this is achieved using a transparent–mode of
operation, which is described shortly. Bus signals selected for display in the processed trace must be included in the format for unprocessed (raw) trace. However,
they need not actually appear in the analyzer state listing window.
Chapter 7
Software Debugging
389
October 13 1995, Draft 1
A processed trace line contains the instruction which was in write–back during
the captured trace cycle, or data which was transferred during the cycle. Put another
way, if an instruction is fetched from memory, then during its write–back cycle (if it
reaches execute) the op–code is reported in the processed trace. Let’s look at the algorithm used with the Am29040 Traceable Cache preprocessor. The DATA and ADDR
labels have their values changed by the algorithm to reflect the instruction which was
executed during the traced cycle. The DATA and ADDR labels in the raw trace indicate the instruction which was fetched during the traced cycle or data which was accessed during the cycle. If no data access or instruction execution occurs in a cycle,
then there is no processed trace line corresponding to the raw trace line. MonTIP only
reports lines which are considered valid.
The algorithm operates in two stages; first data accesses are processed, then
instruction flow is determined. Data accesses are examined to determine if there are
any repeat accesses reported due to the use of Scalable Clocking. Trace information
is captured at the internal processors speed. The memory system may be running at
half this speed. Consequently, accesses to memory are captured twice in adjacent
trace cycles. Only the final access is considered valid.
Data transfer, due to a load or store instruction, can occur during the same cycle
another instruction is executed. When this happens, the algorithm moves the reporting of the data access to a future trace cycle which contains no valid trace information. If another data transfer occurs before the previous is reported, then the previous
data value will not be reported. The R/_W, and I/_D information is repositioned
where necessary and possible, so as to report data accesses which occurred. Note,
LOADM and STOREM data transfers are reported before the instruction execution
is reported; this reflects the correct operation of a 29K processor. Currently, the algorithm is being enhanced to enable multiple instruction execution or data accesses to
be reported occurring on different processed trace lines which correspond to the same
captured trace cycle. This eliminates the need to reposition or drop data accesses
reporting. These algorithm enhancements are required by superscalar processors.
target
sequential
instruction
sequence
branch
delay–slot
target
branch
delay–slot
target
instruction
Figure 7-15. Path Taken By Am29040 Recursive Trace Processing Algorithm
390
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Data accesses which generate a cache hit are given the same treatment applied to
memory–resident data accesses. When a data transfer occurs, it will be reported in the
next available cycle (could be the current) which is not being used to report an
instruction’s execution or other valid data access. When the data cache is turned on, it
will not always be possible to report the value of the data transferred. The slave processor does not provide the cached data value, only the address.
For vector fetches, the vector fetch and the address of the first instruction as well
as the VECT status are reported on the same processed trace line.
The second stage of the algorithm is a little more complicated; it produces the
complete address–flow for executed code. It currently only operates with 32–bit
memory accesses. Programs should not be traced when executing from 8–bit
memory devices. A recursive algorithm determines consecutive instruction execution sequences, as shown on Figure 7-15. The algorithm starts with a branch instruction and stops when it reaches a delay–slot instruction. Branch instructions initiate
new instruction sequences for the algorithm to recursively process.
Once the address flow is determined, a second recursive routine determines the
instructions which correspond to the address flow. Often these instructions are
fetched from memory and can be found in the DATA field of a previous trace cycle.
However, if the instruction is supplied by the instruction cache then XXXXXXXX is
entered into the DATA column. If an address value lies in the loaded TEXT region
and the DATA column is marked XXXXXXXX, then the op–code is obtained from
the loaded COFF file and placed in the DATA field.
MonTIP Commands
Strictly speaking, commands should be processed by the Debugger Front End
(DFE), such as MonDFE. However, MonTIP has the capability of also processing
commands. The range of commands dealt with by MonTIP is greatly limited. Each
DFE has a mechanism by which its command processing can be placed in
transparent–mode. This causes commands to be passed to the TIP. With MonDFE,
commands begining with the key word “tip” are passed to MonTIP. A number of
commands have been added to MonTIP to support analyzer operation. By typing the
MonDFE command “tip lahelp” a list of the commands will be displayed. The
MonTIP man–page describes the commands in more detail.
The MonTIP command “latadd label, width” is used to add a column to the trace
listing produced by MonTIP. An “latadd” should be used for each column in the
processed trace listing. Acceptable values for “name” are defined by the labels which
appear in the raw trace listing. The only exception to this rule is for labels
SYMADDR and ASMDATA. These are pseudo labels derived from the raw labels
ADDR and DATA respectively. The use of the SYMADDR label causes the
hexadecimal address value to be replaced by an address symbol. For this to work the
“lacoff file” command must be used to specify the file to be used during symbol table
look–up. Addresses which are not found in the COFF file are presented in
Chapter 7
Software Debugging
391
October 13 1995, Draft 1
hexadecimal format. Use of the ASMDATA label indicates that the DATA label
information should be disassembled –– when the corresponding address is known to
lie in a TEXT region.
The “latd start, end” command is used to display processed trace information
based on stored lines. Processed line and raw line numbers are the same with regard
to the processor status during the traced processor cycle, but only valid lines appear in
the processed trace listing. A valid line is one in which useful processor activity was
performed. For valid lines, the ADDR, DATA, R/_W and I/_D labels are reevaluated
to correspond with the associated processor status value.
MonDFE Trace Access Commands
This section briefly describes MonDFE commands relating to displaying trace
information. A complete list of MonDFE commands is obtained by entering the
command “?”. MonDFE supports command files with the “zc file” command. It is
useful to place a list of “tip latadd SYMADD”–type commands in a command file
such as la.rc. This enables the “zc la.rc” command to initialize MonTIP trace
processing. The MonDFE command “ze file” can be used to record displayed
information into a log file.
Before a trigger can occur, trigger conditions must be installed in the analyzer.
The command “latrig term, label=pattern” can be used when setting trigger patterns
in the logic analyzer. Specifying trigger logic and sequence control must be specified
using the analyzer window. Once the trigger has been established (and the
POD_040._D setup may be adequate) it is usually only necessary to use “latrig”
commands such as “tip latrig a, ADDR=10004”. Trigger patterns can be entered
directly using the analyzer window, this also requires the hexadecimal patten (rather
than symbol) value be known. If the swaf program is used to build symbol table
information in Hewlett–Packard’s GPA format, it can be directly loaded into the
analyzer. This enables trigger labels to be set using symbols when directly using the
analyzer window.
The Analyzer can be triggered at each breakpoint by setting the break–address
to the illegal op–code vector address (vector 0). This technique is useful when
breakpoints are implemented by temporarily replacing instructions with illegal
op–code instructions. The MiniMON29K DebugCore uses this technique when
on–chip breakpoint registers are not available.
UDB Commands
UDB has been enhanced by CaseTools to support displaying trace information
in source format. This makes UDB a preferred tool for use with a logic analyzer.
Additionally, UDB, like other non enhanced source level debuggers, can also be used
in transparent–mode. With transparent–mode operation, it is possible to issue
commands for MonTIP processing. Given that UDB supports source level tracing, it
is unlikely that transparent–mode operation would be selected for use with UDB.
392
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
However, it is described here to aid users of other source level debuggers which only
have access to transparent–mode commands.
UDB is a window based debugger; however, command line processing is
supported. When in the Main window or the Console window, a command line
sequence begins with an <ESC> character. For example, to issue a “latadd” MonTIP
command, use the command sequence “<ESC>ioctl tip latadd label”.
Alternatively, the “latadd label” command can be directly entered at the
Console window. If the console is not currently gathering input for an out–standing
standard input request (such as a scanf()), the keyboard input is sent to MonTIP for
processing rather than the application or target operating system. It is useful to place a
list of “ioctl tip latadd SYMADD”–type commands in a command file such as la.rc.
The “<ESC>ioctl tip exec la.rc” command can then be used to process the MonTIP
command file. An example la.rc file for use with UDB is shown below:
ioctl
ioctl
ioctl
ioctl
ioctl
ioctl
ioctl
ioctl
tip
tip
tip
tip
tip
tip
tip
tip
latclr;
latadd LINE;
latadd ADDR;
latadd SYMADDR;
latadd ASMDATA;
latadd R/_W,6;
latadd *STAT_,6;
lamore 20;
The trace listing produced by commands such as “<ESC>ioctl latd 0, 20” will
appear in the console window along with any other console output information such
as printf() output. It is also convenient to use UDB’s macro instruction capability to
bind macros to buttons associated with the console frame. This allows user defined
buttons (left side of frame) to be simply clicked to issue the required MonTIP
command. The macro instructions shown below can be placed in the udb.rc startup
file.
macro
macro
macro
macro
macro
macro
macro
m=mcon
m=mcon
m=mcon
m=mcon
m=mcon
m=mcon
m=mcon
–f
–f
–f
–f
–f
–f
–f
–”echo”
–”trig”
–”sync”
–”la.rc”
–”coff”
–”help”
–”latd”
{lb6}
{lb7}
{lb8}
{lb9}
{lb10}
{lb11}
{lb12}
”{com}stty +echo\r”
”{com}ioctl tip latrig a, ADDR=”
”{com}ioctl tip lasync\r”
”{com}exec la.rc\r”
”{com}ioctl tip lacoff ”
”{com}ioctl tip lahelp\r”
”{com}ioctl tip latd ”
Unprocessed analyzer trace was shown on Figure 7-14. The corresponding processed trace is shown on Figure 7-16. Trace information is presented in the console
frame using UDB in a transparent–mode of operation. Although the Console window
is adequate, it is easier to study program execution from the enhanced trace window.
The enhanced trace window is shown on Figure 7-17. This window appears
when the trace view–toolbar button (bottom right of window) is selected. The trace
listing window is formatted via the “trcol” command. This command, along with a
Chapter 7
Software Debugging
393
October 13 1995, Draft 1
number of other trace modifying commands, can be interactively entered at the UDB
command line. However, it is much more typical to arrange for “trcol” commands to
be processed during UDB start–up. This is accomplished by entering a “trcol” command sequence, such as the following example, into the udb.rc file.
trcol
trcol
trcol
trcol
trcol
trcol
–d
–d
–d
–d
–d
–d
–w
–w
–w
–w
–w
–w
8
15
8
27
6
6
ADDR
SYM
DATA
DASM
TYPE
STAT_
UDB fetches new trace information when the Fetch button is pressed. This
button should be used whenever the analyzer has acquired new trace data. The Start
and Stop buttons are provided for remote Running and Halting of the analyzer data
acquisition. This is equivalent to using the top–right–hand corner button on the
analyzer display. The Line button is used for displaying a desired line number ––
entered via a dialog box. The Top button moves the current line (indicated by the
cursor position) to the top of the display.
The Set button is particularly useful. The selected line is highlighted in red and
the raw analyzer display is adjusted as necessary to show the corresponding raw trace
line. If any source–display windows are opened (in up–date mode rather than edit
mode), they are adjusted to show and highlight (in red) the corresponding C source
line. The Loc button can be used to relocate the current highlighted line.
After a trace line has been selected and the Set button applied, the Next and Prev
buttons can be used to single step through source level trace. Using the Next button,
all three displays (if in use) will be updated with the next line corresponding to the
recorded source execution. If the shift key is held down while using the Next or Prev
button, assembly level stepping rather than source level stepping is performed.
Trace column SYM is a synonym for column ADDR formatted symbolically.
The address symbols are taken from the loaded symbol file. Hence the need to load a
UDB symbol file produced by the mksym utility. The DASM column is a synonym
for DATA presented in disassembly format. When an instruction is supplied by the
on–chip cache, XXXXXXXX is placed in the DATA column. However, the Xs will
be replaced with the actual instruction if an executable program has been loaded. For
more information on preparing the trace display see section 7.9.3.
394
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Figure 7-16. UDB Console Window Showing Processed Trace Information
Chapter 7
Software Debugging
395
October 13 1995, Draft 1
Figure 7-17. UDB Trace Window Showing Processed Trace Information
396
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
7.9
Fusion3D TOOLS
Shortened development times and increased product complexity has
necessitated the use of powerful software development tools. Unfortunately
however, the higher processor speeds and on–chip integration provided by many of
the newer embedded RISC processors has led to an increased cost associated with
traditional debug tools such as In–Circuit Emulators (ICE). Additionally, the rapid
changes occurring in the embedded processor market and the frequent introduction
of processor variations has placed emphasis on the need for tool reusability.
Reusable and low cost tools have a broad appeal among software designers.
The term Fusion3D refers to a Distributed Design and Debug environment. The
purpose behind Fusion3D is to provide cost effective design and debug tool
alternatives selectable from a range of compatible products. This is achieved by
distributing the primary tool functions. For example, traditionally a full–function
ICE has been chosen as the primary debug tool. However, the overlay or substitute
target memory provided by an ICE is alternatively available with a ROM emulator;
the ICE’s tracing capability can be effectively achieved with a logic analyzer; and
controlling program execution can be accomplished with a debug monitor and
on–chip debug support hardware. The capabilities inherent in a full–function ICE are
distributed among the selected Fusion3D components. AMD has identified and
worked with key Fusion3D partners to bring together the necessary components of
the Fusion3D environment.
The Fusion3D approach is flexible. The scalable nature of Fusion3D enables the
software developer to construct a debug environment which is adequate for the task
to be undertaken, yet does not incur the high costs typically associated with a
full–function ICE. At a later stage, if a project requires an additional debug
capability, the chosen tool combination can be enhanced.
Many of the tools provided by the Fusion3D program are useable with any
member of the 29K family or other processor family such as the X86. This helps
reduce the cost associated with tooling–up for a new project. For example, the
HP16500B logic analyzer is widely used within the industry. Traditionally it has been
used by hardware development engineers. Extending its utility as a software
development tool, useable across a wide range of processors, is very cost effective.
7.9.1 NetROM ROM Emulator
A ROM emulator is used to replace a system’s ROM or SRAM type memory
devices with substitute memory. Typically, ROM devices are removed from socket
locations on the target system and a cable used to connect the ROM emulator to the
vacated sockets. The processor can read the emulated memory as if it were real
ROM. Occasionally there may be differences in memory access times due to
different memory access wait states; but essentially the system runs as normal.
Chapter 7
Software Debugging
397
October 13 1995, Draft 1
ROM emulators always provide a second access port to the emulated memory.
Via this second port, the contents of the memory can be read or written. This is
generally accomplished by a host computer to which the ROM emulator is attached.
The technique enables programs to be installed in system memory without the need
to prepare (often termed burn) new ROM devices. During the process of developing
and debugging software, modification of the program code frequently occurs.
Consequently, an updated program must be reinstalled in the target system’s
memory. The process of preparing new ROMs is slow, and a ROM emulator with a
fast computer link provides an alternative means of updating the system memory.
NetROM is a ROM emulator product provided by a Fusion29K tool
development partner. It can emulate 8–bit, 16–bit or 32–bit wide memory devices as
required. Depending on the width (number of bits) of the memory being emulated,
between one and four cables are required to connect the NetROM to the target system
memory. Up to 1M byte of memory can be emulated, depending on the pin layout of
the memory devices in use. The 1M byte limitation does not restrict NetROM’s use
for developing programs which are larger than 1M byte –– this is achieved via the
on–board UART. The UART is mapped into a location within the emulated memory
space. The 29K processor can exchange data with the UART. The host computer can
also access the UART and hence exchange information with the 29K processor.
The MiniMON29K bundle contains a driver for the NetROM UART (often
referred to as a virtual UART). This enables the TIP program (MonTIP) running on
the host computer to communicate with the DebugCore software running on the 29K
target system. The method enables programs to be downloaded into the target
systems DRAM memory. Typically, OS–boot, the DebugCore and support drivers
are placed in emulation memory; then, via MonTIP support, programs are loaded and
executed out of the target system’s DRAM memory.
The NetROM equipment connects to the host computer via an Ethernet
connection. A NetROM can be used with an IBM PC compatible machine running
Windows; however, because of their Ethernet connection, they are much more
frequently used with networked Unix based systems. The Unix machine serving the
NetROM will have an entry in (typically) its /etc/bootptab file, specifying the IP and
Ethernet addresses allocated to the NetROM. Also specified in the bootptab file is the
the path to the NetROM configuration file. An example bootptab file entry for a
CMU type server is shown below. The actual NetROM configuration file is
/tftpboot/netrom/startup2.bat. Note, for servers running in “secure” mode, the
/tftpboot directory must be at the root of the path to the NetROM configuration file.
The NetROM (client) Ethernet hardware address is given by the “:ha=” field.
netrom2:hd=/tftpboot/netrom:bf=startup2.bat:sm=255.255.255.255:
ht=1:ha=00402f008444:ip=163.181.22.60
When the host server connects to the NetROM client, the configuration file is
downloaded into the NetROM. A portion of the /tftpboot/netrom/startup2.bat is
398
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
shown below. The loadfile and loadpath variables are used to specify the default
image file to load into emulation memory. In the example below, the default image
file is /tftpboot/netrom/target/sa29040.hex. The key parameters in the configuration
file should be arranged to describe the type of memory being emulated. The example
below shows four 27c020 memory devices combined to produce a 32–bit memory
system. This will require four NetROM cables. It is possible (and common) to
emulate only an 8–bit wide memory system.
setenv
setenv
setenv
setenv
setenv
setenv
setenv
;part of
host
loadfile
loadpath
romtype
romcount
podorder
wordsize
the startup2.bat NetROM configuration file
163.181.22.9
;server IP address
sa29040.hex
;29K program (image file)
/netrom/target ;path to 29k program
27c020
;ROM type
4
;number of ROMs
0:1:2:3
;pod order
32
;memory width
A NetROM can support TELNET and direct TCP connections simultaneously.
The MonTIP program forms a direct connection to the NetROM via the parameter
information located in the udi_soc file (see section 7.5.6). An example udi_soc entry
is shown below.
# udi_soc file entry to support NetROM
netrom2 AF_UNIX soc montip –t netrom –netaddr 163.181.22.60 –netport 1234
It is possible to have a TELNET session active with a NetROM while also
running MonTIP. Of course, the user controls the NetROM via a front–end debug
tool such as UDB which directs the operation of MonTIP via the UDI interface. From
a window running the TELNET command “telnet netrom2” (for example), the
“newimage” NetROM command can be used to download a file (usually the default)
into emulation memory. The sa29040.hex image file contains OS–boot, the
DebugCore, and support driver code for an SA29040 evaluation board. Once
installed it enables DebugCore messages to be exchanged between the 29K target
and the host computer running MonTIP.
A software reset can be performed by issuing a reset command from UDB.
Normally the DebugCore is successfully running and will perform the reset. Under
extreme conditions the DebugCore may no longer be in control of the 29K processor.
In this case a hardware reset can be performed. This requires that the 29K reset pin be
asserted. From the TELNET session this is accomplished via the “tgtreset”
command. The technique requires that a reset wire be used to connect the reset output
pin on the back of the NetROM (marked R) to a connection post on the target system.
The connection post must be wired to the processor reset pin. For this reason, it is best
to incorporate a reset connection post on each 29K target system for use by the
NetROM.
Once a NetROM has been added to a network, a TELNET connection can be
used to confirm its correct installation. After issuing a “newimage” command, and
Chapter 7
Software Debugging
399
October 13 1995, Draft 1
possibly a “tgtrest” command, the 29K target system is ready for operation. The
chosen debug tool (UDB, GDB, etc.) can then be invoked and used to examine,
modify and control the target 29K processor in the normal way. Once correct
installation has been confirmed, there is no need to first establish a TELENT
connection before initiating normal program debug. All that is necessary is to start
execution of the chosen debugger.
The NetROM driver (for the 29K side of the virtual UART) that is built into the
image file, typically operates in poll–mode. This refers to the 29K processor on
occasion polling the UART to determine if it is receiving a message from MonTIP.
The image file can be built with an interrupt–mode driver. This enables MonTIP to
interrupt the 29K at any time (if interrupts are enabled) when it wishes to send a
message (such as halt) to the DebugCore. To enable operation of this technique, an
interrupt wire must be used to connect the interrupt output pin on the back of the
NetROM to an interrupt input post on the 29K system. Once again, the post should be
incorporated into any design which wishes to make sue of a NetROM.
7.9.2 HP16500B Logic Analyzer
Network Installation
The use of logic analyzers for tracing program execution was previously
presented in section 7.8. This section briefly deals with the details of configuring the
logic analyzer’s operation for use in the Fusion3D environment. A high speed
connection to the analyzer is achieved via the optional Ethernet link. This requires
that the analyzer be allocated a unique IP address. Using the analyzer’s
communications set–up window, the IP address is recorded for future use. With Unix
networks, the IP address and chosen name are entered into the network database file
/etc/hosts. The following example allocates an IP address for an analyzer called
“hpla”.
#entries in /etc/hosts file allocating IP addresses
163.181.22.117 hpla
# logic analyzer
163.181.22.121 ginger
# X–terminal
The analyzer connection can be confirmed by establishing a telnet connection.
This is accomplished with a “telnet hpla 5025” command. Port number 5025 enables
access to the analyzer command parser. Commands can then be directly issued to the
analyzer. One very useful command: “xwin on, ’163.181.22.121:0.0’”, establishes a
remote window interface to the analyzer. The example command shown causes an
analyzer front panel interface to be presented on the display determined by the IP
address 163.181.22.121. Checking the example /etc/hosts file, it appears to be an
X–terminal known to the network by the name ginger. It is important that the X server
allow the analyzer to make connection to the server. The “xhost + hpla” command
can be used to add hpla to the list of machines that are allowed to make connection to
400
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
the X server. To obtain the name of your terminal’s display, print the environment
variable DISPLAY as shown below.
echo $DISPLAY
ginger:0.0
#Unix shell command
#response
It is important for the successful operation of MonTIP that the environment
variable DISPLAY be correctly initialized. Note that some HP workstations set the
variable to the value “local:0.0”, this does not create any difficulty for MonTIP.
UDI Installation
The udi_soc file (for Unix based systems) must contain an entry for
establishing, via UDI, the MonTIP to analyzer connection. The MonTIP option “–la
name” is provided for this purpose. The example below shows a udi_soc entry for a
session identified by the name “trace”. Note that the udi_soc file format was
described in detail in section 7.5.6. If a logic analyzer were being used alone, the
example udi_soc entry would be adequate. However, a NetROM is typically
combined with an analyzer. In this case the two entries shown below would be
combined to produce a single entry with a unique session identifier.
# udi_soc file entry to support logic analyzer1
trace AF_UNIX soc montip –la hpla
rom
AF_UNIX soc montip –t netrom –netaddr 163.181.22.60 –netport 1234
UDI session_id
When using the UDB source level debugger to control a logic analyzer, a
mktarget command must be placed in the udb.rc start–up command file. As
explained in section 7.7, a GIO process, controlled by UDB, uses the assigned
mktarget parameters to connect to a 29K target (in this case via MonTIP). An
example udb.rc entry is shown below.
# udb.rc, UDB startup command file
#driver args (GIO ID, GIO executable, exec. flags, udi_soc session ID)
#mktarget name id
type driver (args....)
mktarget
LA
1
29040 dr_gio 0 ios_udi –be trace
executable
UDI session_id
Note that normally the GIO and UDB processes determine the endian of the 29K
target via examining the processor’s CFG special register. When an analyzer is used
alone, there is no connection to the 29K processor and the CFG register can not be
accessed. This necessitates that the mktarget command specify the target endian.
The “–be” switch is used in the example to select big–endian operation. The “–le”
switch is available for selecting little–endian. The following section 7.9.3 describes
how user defined buttons can be used to issue mktarget commands.
Accessing the Analyzer File System
It is very convenient to be able to drive the logic analyzer remotely from, say, the
X–terminal on your desk. Note that a colour monitor is required to achieve full
Chapter 7
Software Debugging
401
October 13 1995, Draft 1
control of the analyzer. As described above, remote control of the analyzer is enabled
via the “xwin on” command. When remote control of the analyzer is no longer
required, the command “xwin off”, entered via the telnet connection to the analyzer,
discontinues the remote display. Connection to the analyzer command parser is
broken when the TELNET session is terminated.
Only one user can be in control of the analyzer at any time. This means the
analyzer can not be driven from the front panel when a remote window is active.
When MonTIP controls the analyzer, it requests a remote window be presented on the
MonTIP host computer (actually, the DISPLAY variable identifies the screen).
Consequently, it is not possible for another user to establish a second remote window
connection. However, it is possible to simultaneously have an FTP connection active
when remotely controlling the analyzer. The example command sequence below
demonstrates how this is achieved.
1% ftp hpla
#Unix shell command
Connected to hpla.
220 HP16500B V01.00 FUSION FTP server (Version 3.3) ready.
Name (hpla:danm): data
230 User DATA logged in.
ftp> cd system/disk/hard/amd/danm
200 Remote Directory changed to ”/system/disk/hard/amd/danm”.
ftp>
When entering a login name, the identifier “data” was used in the above
example. This enables read access to files located on the analyzer disk system.
Entering the identifier “control” enables read–write access to the file system.
However, logging in as “control” is not permitted if another user is identified as
already controlling the analyzer. Files can be transferred from/to the analyzer using
the FTP commands get/put; remember you may have to first use the binary command
to enable transfer of binary data files.
Triggering the Analyzer for Trace Capture
The HP16500B logic analyzer is equipped with a very sophisticated triggering
capability. Hence, debuggers controlling the logic analyzer tend to rely on the
analyzer’s triggering logic. When using the PI–Am29040 preprocessor, the
POD_040 configuration file prepares the analyzer for triggering on access to the
memory location described by analyzer trigger term A. This may, or may not, be
adequate for your triggering requirements. All changes to trigger logic must be
entered using the logic analyzer front panel display (remotely if desired). If using the
POD_040 file, all that is necessary is to supply the trigger address in the ADDR field
of term A. Of course the address must be entered in hexadecimal format unless a
symbol file has been loaded into the analyzer.
When on–chip caches are used, instruction and data accesses may not always
appear on the processor bus. This complicates the task of triggering the analyzer. It is
also not possibly to simply use the ADDR field of term A when a microcontroller is
402
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
being used and hence the full 32–bit address value is not observable, even if an
off–chip memory access is performed. (Microcontrollers divide the address space
into regions, only the lower address bits for any particular region may appear on the
microcontroller address bus.) Dealing with these problems can require the user to be
creative when developing alternative triggering logic.
With processors which have on–chip breakpoint control registers, a SYNC
pulse can be generated when a specified data or instruction access occurs. The
analyzer trigger logic can be configured to trigger on the occurrence of the SYNC
pulse. Alternatively, for processors without breakpoint control registers, a simple
arrangement can be used to trigger the analyzer when any execution breakpoint is
taken: When a breakpoint is taken, the illegal opcode trap is taken (trap number zero).
The analyzer should be set to trigger on a read of the first entry in the vector table. The
address is specified by the contents of special register VAB (Vector Area Base).
For convenience, UDB provides a remote method of entering data into the logic
analyzer trigger setup. Using the “trigterm” command shown below, trigger patterns
can be specified for different labels and patterns.
trigterm <term> <label> {<pattern> | <address>}
Normally, one simply specifies a <pattern> for a label. The format of the pattern
is assumed to be hexadecimal unless a base is explicitly specified. However, in the
case where the <label> is ADDR, then an <address> should be provided instead; and
UDB will convert the address, which may be specified as a symbol, into a
hexadecimal string of eight characters.
To further simplify issuing “trigterm” commands, a Trig button has been added
to the View, Var and Dasm frames. In the View frame, clicking on a line and then
clicking the Trig button will set term A of the ADDR column to the address of the
source line the cursor is currently on. Clicking on a variable and then shift–clicking
the Trig button will set term A of the ADDR column to the address of the variable the
cursor is currently on. This only works in the case where the variable is allocated to a
memory location and not held in an on–chip register.
In the Dasm frame, clicking on a line and then clicking the Trig button will set
term A of the ADDR column to the address of the disassembly line the cursor is
currently on.
Searching Through Trace Data
The HP16500B logic analyzer provides, via the front panel display, a means of
searching for patterns in the captured trace data. However, without symbolic address
support, and given the fact that raw trace data is not limited to just the execution
stream, it is often more convenient to search for patterns in the processed trace data.
UDB provides support for trace searching with the “trsearchnext” and
“trsearchprev” commands. The command format is shown below:
Chapter 7
Software Debugging
403
October 13 1995, Draft 1
trsearchnext [<label> {<pattern> | <address>}]
trsearchprev [<label> {<pattern> | <address>}]
Normally, one specifies a search <pattern> for a selected <label>. However, in
the case where the <label> is ADDR, then an <address> should be provided instead.
In such case, UDB will convert the address, which may be specified as a symbol, into
a hexadecimal string of eight characters.
After a <label> and <pattern> have been specified, UDB remembers them to
allow for further searching without having to specify the <label> and <pattern>
again. In particular, the Next and Prev keys in the trace frame have been overloaded,
such that Ctrl–Shift–Clicking them will issue these commands with no parameters.
UDB supports the binding of buttons to macro commands. This is a convenient
means of issuing “trsearchnext” commands. The following udb.rc command
sequence assigns buttons to the macro table associated with the Trace frame. Note
that user programmed buttons should be restricted to the left hand side of the window.
The example command creates two buttons. The next button can be used initiate a
“trsearchnext” command. Because the command–string does not finish with a “\r”
character, the user can enter the <pattern> from the command line interface.
# macro table
button
#
––––––
––––
macro m=mtrace –f “next”
macro m=mtrace –f “prev”
position
command
––––– ––––––––––––––––– . . .
{lb1} “{com}trsearchnext ADDR ”
{lb2} “{com}trsearchprev ADDR ”
7.9.3 Selecting Trace Signals
In section 7.8 under the headings MonTIP Commands and UDB Commands,
techniques for formatting the trace display were presented. Groups of processor signals, such as the address bus, are grouped together and assigned labels. The user can
always rely on the following four labels being available for display: ADDR, DATA,
LINE, TYPE. Different 29K family members, and different system configurations,
will provide a number of other useful labels, such as R/_W. Given the limited size of
the trace display, it is necessary to limit the number of trace labels.
Depending on the source level debugger selected, or if MonTIP is being used to
format the trace display, there may be synonyms for the main trace labels. For example: ADDR is also known as SYM by UDB, and as SYMADDR by MonTIP. These
alternatives to ADDR, enable the traced address values to be presented symbolically,
even if the logic analyzer is configured to display them in, say, hexadecimal. An alternative implementation would have been to support a format parameter for controlling the displaying of selected labels. But, so far, this has not been the route taken by
source level debugger implementors. Note that if the SYM label is selected, it is necessary to load a UDB symbol file. This file is produced by the mksym utility.
The DASM column is a synonym for DATA presented in disassembly format.
The trace processing algorithms used with the Am29040 processor place
404
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
XXXXXXXX in the DATA column when an instruction is supplied by the on–chip
cache. However, the Xs will be replaced with the actual instruction if an executable
program has been loaded. For replacement to be successful, the Am29040 target
processor must be executing with physical addressing or with a one–to–one virtual to
physical address translation scheme. This is because the Am29040 slave processor
produces physical address values. The virtual addresses inherent in the loaded
program must correspond to the physical addresses appearing on the processor
address bus. The Am29460 slave processor produces virtual addresses, but this does
not entirely solve the problems created by the use of address translation. I will say
more about this in section 7.9.5.
As explained in section 7.8, the trace listing frame is formatted with the “trcol”
command. This is accomplished by entering a “trcol” command sequence, such as
the following example, into the start–up udb.rc file. The “–w” parameter specifies the
maximum display width (in characters) for a label. The udb.rc file is accessed from
the current working directory or from your home directory.
trcol
trcol
trcol
trcol
trcol
–d
–d
–d
–d
–d
–w
–w
–w
–w
–w
8
15
8
27
6
trcol
–d
–w 6
ADDR
SYM
DATA
DASM
TYPE
#general trace labels
#symbolic adress
#data bus value
#disassembled DATA label
#type of operation
STAT_
#additional Am29040 trace label
Displaying a large number of labels will require a wide trace frame. It is useful to
initially define a large View frame, which can later be switched to displaying trace.
When UDB is invoked, a fixed sized null frame is randomly positioned on the display. Using the following udb.rc command sequence, the null frame can be replaced
with a user specified window size positioned at the top left hand corner of the screen.
After creating the View frame, the null window is deleted.
#create X
wcreate 0
wdelete 0
Y
0
–g
rows columns type udb.rc command
24
90
view #create new View frame
#delete original null window
UDB supports the binding of buttons to macro commands. This is a convenient
means of issuing mktarget commands, rather than hard–wiring them into udb.rc.
This simplifies the task of selecting from a number of different mktarget options.
Given that the View frame is established during UDB start–up (as described above),
the following udb.rc commands assign buttons to the macro table associated with the
View frame. Note that user programmed buttons should be restricted to the left hand
side of the window. The example command sequence creates two buttons. The button
marked “LA” can be used to establish a connection to a logic analyzer. The “view –u”
command causes a source frame to be invoked. If the 29K program counter is
currently not in source but in disassembly, the disassembly view will be invoked.
Chapter 7
Software Debugging
405
October 13 1995, Draft 1
# macro table
button position
command
#
––––––
––––
––––– ––––––––––––––––– . . .
macro m=mview –f “iss” {lb1} “mktarget ISS 1 29040 dr_gio 0 ios_udi
simulator; view –u\r”
macro m=mview –f “LA”
{lb2} “mktarget LA 1 29040 dr_gio 0 ios_udi –be
trace; view –u\r”
7.9.4 Corelis PI–Am29040 Preprocessor
A logic analyzer preprocessor simplifies the connection of the analyzer to the
target system. The principles behind its operation were discussed in section 7.8. This
section briefly deals with the operating details encountered with the Am29040 preprocessor. To prepare the preprocessor based system, a number of steps must be take:
1.
The PI–Am29040 Preprocessor hardware unit replaces the Am29040
processor in the target system. The preprocessor contains two Am29040
processors, one operating in master mode, the other in slave mode. Earlier
version of the preprocessor required that certain pins such as MEMCLK
(H–14) on the slave processor be removed. Later versions do not require pin
removal. There is a jumper option for removing the slave MEMCLK signal if
it is configured as an output. If MEMCLK is configured as an input, the slave
and master MEMCLKs must be tied together. Because of the high speed
operation of Am29040 based systems, the use of PGA socket extenders
should be limited as they add to signal propagation delays. It is often
desirable to add extenders to the preprocessor connection pins to protect them
from damage. If a pin gets broken, it is less expensive to replace a socket than
to replace the preprocessor. Zero ohm resistors have been incorporated in
series with a number of signal pins, such as MEMCLK and INCLK.
Impedance matching, and hence better signal conditioning, can be achieved
by replacing these resistors with an appropriate value resistor.
2.
If HP16550A logic analyzer cards are being used with the HP16500B system,
then two cards should be wired together in accordance with the HP
Installation Manual. Two analyzer cards provide a total of 12 trace pods.
Assuming the cards are located in slots D and E, pod E1 (slot E – master)
should be connected to position J1 on the preprocessor. Pod E2 to position J2,
and so on. Pods D1–D3 should be connected to J7–J9 (see Table 7-5). The
analyzer configuration file POD_040._D will format the analyzer cards for
this configuration. (The ._D file name postfix, is because the master analyzer
card is located in card cage slot D.) The POD_040 configuration file is
available from AMD or Corelis. It is important to obtain a copy of the
configuration file, as it is much too time consuming to reassign the
pod–to–label mapping by hand.
406
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Table 7-5. PI–Am29040 Logic Analyzer Pod Assignment
pods for
capturing
clock signals
3.
4.
5.
PI–Am29040
HP16550A
HP16555A
Preprocessor connector
J1
J2
J3
J4
J5
J6
J7
J8
J9
Analyzer pod
master 1
master 2
master 3
master 4
master 5
master 6
expander 1
expander 2
expander 3
Analyzer pod
expander 1
expander 2
expander 3
expander 4
master 1
master 2
master 3
master 4
If the more expensive HP16555A logic analyzer cards are selected, two cards
are still required. Once again they should be wired together in accordance
with the HP Installation Manual. Note that even if a pair of cards are
purchased together, they may not be interconnected in accordance with HP’s
manual recommendations. Two HP16555A analyzer cards provide a total of
6 trace pods. Assuming the cards are located in slots A and B, pod B1 (slot B –
expander) should be connected to position J1 on the preprocessor. Pod B2 to
position J2, and so on. Pods A1–A2 should be connected to J5–J6, and pods
A3–A4 to connections J8–J9 (skipping J7), see Table 7-5. The analyzer
configuration file POD_040._A will format the analyzer cards for this
configuration. Note that the configuration file required for HP16555A cards,
although the same name, is not the same file required to configure HP16550A
cards. The reason B–pods are allocated before the A–pods is because the card
in the B slot is wired as an expander card and all clock signals must be
acquired by the master card in slot A. The configuration file specifies that
trace signals are captured on MEMCLK signal edges, and MEMCLK is
provided on connector J5.
If the 29K target system is operating with 1x clocking, then the master clock
should be configured to acquire trace data on the rising edge of MEMCLK. If
2x Scalable Clocking is being used, the master clock should acquire trace
data on both the rising and falling edge of MEMCLK. The appropriate
selection can be made from the analyzer control panel (remotely if desired).
With newer versions of the preprocessor (those that provide access to the
processor’s DIV2 pin), the MonTIP software warns the user that the wrong
edge–selection has been made.
There is no need to install the Corelis preprocessor support software supplied
with the PI–AM29040. However, if it is installed then is will be possible to
Chapter 7
Software Debugging
407
October 13 1995, Draft 1
present disassembled instructions on the analyzer display. Using the Corelis
software, trace label DATA can be displayed in Invasm (inverse assembler)
format without resulting in any conflicts with MonTIP’s access to DATA.
There is no advantage to using the disassemby software, as the analyzer
display shows instructions which are fetched but not necessarily executed.
6.
There are a number of limitations imposed by the Am29040 Traceable Cache
architecture. These where previously discussed in section 7.8 under the
heading Processing Trace Information. Very briefly, the complete instruction
flow is reported: Labels, DATA, ADDR, R/_W and I/_D have their values
manipulated to report the instruction which was executed during the traced
cycle.
7.
The trace data processing algorithms built into MonTIP need to know the
endian–ness of the 29K target processor. When connection to the analyzer is
established, a window displaying the analyzer control panel will appear.
MonTIP prints a message in this window indicating the endian–ness of the
target processor. If the endian–ness is unknown, MonTIP will continue
operating; but sub word–sized data accesses will only be partially processed.
To fully process data accesses, the “Analyzer 1:Name” field provided under
the logic analyzer “Configuration” window should be set to AM29040B or
AM29040L, respectively for big or little endian operation.
8.
The MonTIP algorithms are currently restricted to operating with systems
which fetch instructions from 32–bit memory. This does not necessitate that
32–bit ROM emulation be used with NetROM. If application programs are
loaded and execute from 32–bit memory, they can be successfully traced.
However, if interrupt handlers or other support code is run from 8–bit
memory, tracing will not be possible.
9.
can’t reduce capture rules
7.9.5 Corelis PI–Am29460 Preprocessor
A logic analyzer preprocessor simplifies the connection of the analyzer to the
target system. The principles behind its operation were discussed in section 7.8. This
section briefly deals with the operating details encountered with the Am29460
preprocessor. For those simply interesting in getting their preprocessor working, and
not at this stage needing to understand the background behind its operation, proceed
to the section with the subheading PI–Am29460 Setup and Limitation.
The Traceable Cache information provided by the Am29040 slave processor is
synchronous with program execution. If this approach were taken with the Am29460
microcontroller, the superscalar execution capability would necessitate very high
speed trace reporting. To reduce the slave processor’s information bandwidth
408
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
requirements, the Am29460 does not synchronize trace reporting with program
execution. The Am29460 trace information is compressed, relative to the Am29040
trace data, and held in an output queue before being transferred off–chip.
PI–Am29460 Preprocessor Operation
The main processor operation is driven by the PCI INCLK signal pin. However,
for data capturing purposes, the logic analyzer master clock is the slave trace clock
(TRACECLK). It runs at half the internal clock speed. Note that the internal clock
(not available on a pin) runs at 2x, 3x or 4x the INCLK speed. With 2x clocking and
single–cycle MCU transfer rates, the MCU access speed would equal the
TRACECLK speed. With 2x, or higher scaling ratios, the PCI and MCU data transfer
rates can not exceed the frequency of TRACECLK. This enables TRACECLK to be
used as the master clock by the logic analyzer.
The analyzer captures signal values when the master clock is active. There also
has to be at least one of the following conditions: valid trace information, a valid PCI
access, a valid MCU access. PCI accesses are first captured by the analyzer slave
clock. The PCI INCLK is used to capture logic analyzer slave information. Analyzer
slave signals are transferred into the logic analyzer trace buffer on the next master
clock signal. If more than one slave value is captured before the next master clock,
then only the most recent slave values are stored in the analyzer trace buffer. For this
reason it is important that the master clock operate at a higher frequency than the
slave clock.
The HP16500B analyzer only supports one slave clock. For this reason, MCU
access are latched and held until the next master clock edge; during which any PCI
access captured by the slave clock are recorded by the analyzer, see Figure 7-18. It is
impossible for two MCU accesses to occur before the next TRACECLK, even if
another agent uses the PCI to access an MCU. Anyway, the trace processing
algorithms are only interested in MCU accesses initiated by the processor, not the
PCI. The preprocessor and its supporting software are not intended to form a general
purpose PCI probe.
Master clock
TRACECLK
(delayed)
Logic analyzer trace buffer
MCU latched signals
PCI signals
_STROBE
Slave clock
INCLK (and other
conditions)
CAS, RAS
Figure 7-18. PI–Am29460 Preprocessor Trace Capture Scheme
Chapter 7
Software Debugging
409
October 13 1995, Draft 1
The HP logic analyzer specification states that there must be 4nS separating the
active edge of the slave clock and the active edge of the master clock. Master clock to
slave clock separation is specified as 0nS. The delay is required to ensure that the
slave information is valid before it is entered into the logic analyzer trace buffer at the
active master clock signal. Signals captured directly by the master clock have a 4nS
set–up time and a 0nS hold time. Consequently, the active edge of the master clock
must not be allowed to arrive within 4nS of an active slave clock’s arrival. The
preprocessor achieves this by delaying the TRACECLK signal used to generate the
analyzer master clock. A example timing sequence is presented in Figure 7-19.
INCLK
Internal CLK
2x
MCU CLK
LA master
sample clock
TRACECLK
example
MCU access
preprocessor latched
PCI access
analyzer latched
Figure 7-19. PI–Am29460 Preprocessor Trace Capture Timing
There is an additional reason for delaying the TRACECLK; the slave processor
output signals, including TRACECLK, are actively driven at the same time. Consequently, slave signals which are to be sampled using the TRACECLK may be changing at the same time as TRACECLK.
A benefit is obtained as a result of using MCU latching and a PCI slave clocking.
That benefit is the better utilization of analyzer trace depth. When a trace buffer entry
is recorded during an active TRACECLK edge, trace information as well as PCI and
MCU information is captured in a single trace line. This results in more efficient use
of the trace buffer than if each of these three asynchronous events where separately
captured by the logic analyzer.
RLE Data Pairs
As with the Am29040 processor, the second Am29460 slave processor is
entirely responsible for providing the data required to reconstruct the instruction
execution stream. The second processor, the slave, provides three types of trace data:
information about MCU accesses, information about PCI accesses, and instruction
execution flow. Unlike the Am29040, the slave does not provide any information
410
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
about data cache hit activity. Data about Instruction flow is provided in the form of
address–length pairs, known as Run Length Encoding (RLE).
The Am29040 slave processors does not need to provide access type
information about data accesses, as they can be fully observed by monitoring the
main processors data busses. However, many of the bus signals available with a
2–bus processors are not available with a 29K microcontroller. For example, the I/_D
pin is not available. This means when a memory read access is performed, it is not
possible to determine if data or an instruction is being fetched. With the Am29460
processor, the slave provides this type of information. This explains why the slave
provides trace data for both MCU and PCI accesses performed by the master
processor.
Before describing the RLE technique in more detail, we must first remind
ourselves of the speculative execution nature of the Am29460 processor. As
explained in section 1.7, instructions are fetched and speculatively executed.
However, instruction are not truly consider to have executed until they have been
retired. This introduce the notion of a Retire Program Counter (R–PC). At any time,
instructions who’s addresses are ahead of the R–PC may be held in the reorder buffer
waiting for potential retirement. Special register PC1 contains the address of the
instruction currently in execution. Because the processor supports precise interrupts,
the PC1 register can never get ahead of the R–PC address. When a trap or interrupt is
taken, the R–PC value will appear in register PC1 or PC0 (decode address)
depending on the stage at which the processor pipeline is interrupted.
Each RLE (TRACEADDR, TRACERUN) data pair specifies that
TRACERUN instructions, starting from the current R–PC, have been retired, and
subsequent retirement is to continue from an R–PC value of TRACEADDR. A
TRACERUN value of zero is permitted; it is used to redirect trace flow without
recording any instruction execution (retirement). In such case a TRACEADDR
value change accompanies the TRACERUN zero value. A value of zero is also used
to indicate that no instructions are available for retiring. In such case the
TRACEADDR value does not change. An example RLE trace sequence is shown on
Figure 7-20.
An RLE data pair can be output by the slave processor during the rising edge of
each TRACECLK. The RLE data is not provided directly by the reorder buffer, but
by a buffer queue which receives its input from the reorder buffer, see Figure 7-21.
This is necessary as very short run lengths produce RLE data at a rate faster than the
TRACECLK can report them. Using a queue reduces the need to stop instruction
retirement until the RLE data has been presented on the slave output pins. However,
the queue is limited in size and consequently, under rare circumstances, it can
become full. This results in the master processor postponing instruction retirement
until a queue entry is available. Without this throttling back approach, uninterrupted
reporting of instruction flow could not be guaranteed.
Chapter 7
Software Debugging
411
October 13 1995, Draft 1
TRACERUN
TRACEADDR
sequential
instruction
retirement
address–length pair
TRACEADDR
branch instruction
delay–slot
TRACEADDR target instruction
consecutive group of
retired instructions of
length TRACERUN
TRACEADDR
program flow
Figure 7-20. Slave Data Supporting Am29460 Traceable Cache
Reorder buffer retirement
.
RLE
RLE
RLE
. .
entry
entry
entry
Run–Length Encoding
output queue
Slave processor output pins
clocked at TRACECLK speed
TRACERUN
TRACEADDR
Figure 7-21. RLE Output Queue From Reorder Buffer
Given that the TRACECLK runs at half the internal speed of the processor, as
many as 8 instructions could be retired at each TRACECLK interval. However, as a
means of reducing RLE queue entries, an entry is not placed in the queue if the current
sequence of instructions being retired does not contain a branch instruction. In such
case, the TRACRUN is allowed to accumulate to a maximum of 31, the largest value
which can be reported on the 5–pins allocated to TRACERUN. Research (see section
1.7.4, [Johnson 1991]) has shown that instruction sequences typically contain five or
six instructions before branching. This would indicate that TRACERUN values near
412
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
31 will not be a frequent occurrence, and values in the range five or six are to be
expected.
PI–Am29460 Setup and Limitations
As with the PI–Am29040 preprocessor, a number of restrictions and
preparation steps apply when using the PI–Am29460 preprocessor:
1.
The trace processing algorithm places the value X_SLAVE_ in the DATA
column for all instruction accesses. It is necessary to have access to the COFF
file(s) for the loaded program to ensure the X_SLAVE_ value is replaced with
the actual 29K instruction executed. Debuggers such as UDB can retain
multiple COFF file images at the same time. This enables simultaneous
tracing of application space and operating system space (say, interrupt
handlers).
2.
Considering MCU performed memory accesses, only data accesses are
reported. Data transfer is shown at the time it appears on the system busses;
which, for data stores, may be several cycles after the corresponding STORE
instruction.
Two bus 29K processors have OPT pins and lower (A1–A0) address pins
which indicate the size and alignment of the object currently being accessed.
The Am29460 microcontroller does not have these pins. Consequently it is
not possible to determine the alignment and size for reads of sub word–sized
objects. Fortunately, the microcontroller has four byte enable pins which are
used for data writes. This enables the alignment and size of objects which are
written to be determined, and improves the trace reporting for data writes.
Only MCU accesses performed on behalf of the Am29460 processor are
reported. Accesses initiated by another processor via the PCI interface will
not appear in the trace.
3.
Considering accesses to the PCI bus, as with MCU accesses, only data
transfers are reported. By monitoring the PCI command provided during the
address–phase of a PCI access, it is possible to determine the object size for
sub word–sized objects.
4.
The PI–Am29460 preprocessor does not reconstruct 32–bit MCU addresses.
This can complicate logic analyzer triggering. One solution is to use the
on–chip breakpoint control registers to generate a _SYNC pulse which is
then used to trigger the analyzer. The UDB debugger has a convenient user
interface for specifying breakpoint control register operation. Unfortunately,
however, breakpoint control registers are a limited resource, and they are also
used to control program execution.
5.
Unlike a scalar processor, processed trace lines with the Am29460, indicate
multiple instruction execution per trace line. The number of instructions
Chapter 7
Software Debugging
413
October 13 1995, Draft 1
reported executed by a trace line has less to do with the instruction retirement
rate and more to do with the run–length between branch instructions. When
an MCU or PCI access occurs at the same time as RLE reporting, the
processed trace indicates all activity on the same processed trace line.
6.
The trace data processing algorithms built into MonTIP need to know the
endian–ness of the 29K target processor. When connection to the analyzer is
established, a window displaying the analyzer control panel will appear.
MonTIP prints a message in this window indicating the endian–ness of the
target processor. If the endian–ness is unknown, MonTIP will continue
operating; but sub word–sized data accesses will only be partially processed.
To fully process data accesses the “Analyzer 1:Name” field provided under
the logic analyzer “Configuration” window should be set to AM29460B or
AM29460L, respectively for big or little endian operation.
7.
If HP16550A logic analyzer cards are being used with the HP16500B system,
then two cards should be wired together in accordance with the HP
Installation Manual. Two analyzer cards provide a total of 12 trace pods. If
the more expensive HP16555A logic analyzer cards are selected, three cards
are required. Once again they should be wired together in accordance with the
HP Installation Manual. Note that even if cards are purchased together, they
may not be interconnected in accordance with HP’s manual
recommendations.
Table 7-6 shows the assignment of analyzer pods to preprocessor connectors.
The analyzer configuration file POD_460._A will format the analyzer cards
for this configuration. (The ._A file name postfix, is because the master
analyzer card is located in card cage slot A.) The POD_460 configuration file
is available from AMD or Corelis. It is important to obtain a copy of the
configuration file, as it is much too time consuming to reassign the
pod–to–label mapping by hand.
Note that the configuration file required for HP16555A cards, although the
same name, is not the same file required to configure HP16550A cards.
8.
414
A logic analyzer, controlled by UDB, may be attached to a 29K target system
which is not under UDB control. This is the case where a logic analyzer is
used alone, without the utilization of, say, a NetROM. It is also the case when
previously captured trace data is reexamined. To enable 32–bit address
reconstruction, the algorithms built into MonTIP need to know the
processor’s bank profile register (BPR) settings. MonTIP normally
accomplishes this by accessing the DebugCore each time trace data is
fetched. When no DebugCore is present, MonTIP is provided the BPR values
from the udb.rc initialization file.
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Table 7-6. PI–Am29460 Logic Analyzer Pod Assignment
pods for
capturing
clock signals
PI–Am29460
HP16550A
HP16555A
Preprocessor connector
J1
J2
J3
J4
J5
J6
J7
J8
J9
J10
J11
J12
Analyzer pod
master 1
master 2
master 3
master 4
master 5
master 6
expander 1
expander 2
expander 3
expander 4
expander 5
expander 6
Analyzer pod
master 1
master 2
master 3
master 4
first expander 1
first expander 2
first expander 3
first expander 4
second expander 1
second expander 2
second expander 3
second expander 4
The BPR registers are mapped into the processors I/O address space. The
UDB “outl <address> <value>” command can be used to write a 32–bit value
to an I/O location. Note that the command can also be used to write to a
memory location, but this first requires an “<ESC> ioctl space d” command
be first used to switch output to memory space (“d”) rather than the default
I/O space (“i”). After connection to the analyzer has been established, “outl”
commands can be used to set BPR values for use by MonTIP. This is best done
by binding a user defined button to the Trace frame. When MonTIP has no
connection to a 29K target, it does not try and set the real BPR registers, but
retains the values for future use. The following udb.rc command sequence
defines a BPR button for an example register initialization. Before analyzer
data is fetched, the BPR button should first be clicked. This enables the
algorithms to correctly build address values.
#ioctl space i
macro m=mtrace –f –”BPR” {lb3} ”{.com}outl ffffff80 00001003H;
outl ffffffa0 800c6005H; outl ffffffb0 900c6105H; outl ffffffc0
a00c6205H; outl ffffffd0 b00c6305H;\r”
9.
The TRACEADDR addresses provided by the slave processor are virtual ––
assuming address translation is in use. However, the address values observed
for MCU and PCI accesses are always physical. This creates difficulty when
looking–up the MCU and PCI addresses in the loaded COFF file. A program
must run with physical addressing, or with one–to–one virtual to physical
address translation, if MCU and PCI address symbols are to be correctly
reported.
Chapter 7
Software Debugging
415
October 13 1995, Draft 1
10.
The PI–Am29460 preprocessor contains additional analyzer connections
(J13–J16). These are for use by hardware development engineers. They are
not required for program tracing. They are provided to enable capturing of
unlatched processor signals. A number of the connectors used for software
tracing latch their signal values, and this disrupts the analyzers visibility of
timing relationship. A hardware engineer can use the alternative connectors
to view unlatched versions of the main processor signal pins.
416
Addendum to –– Evaluating and Programming the 29K RISC Family
October 13 1995, Draft 1
Chapter 7
Software Debugging
417
Chapter 8
Selecting a Processor
This chapter helps with the sometimes difficult task of processor selection.
Processors are considered in terms of their performance and software programming
requirements. There is little attempt to review, say, development tools or bus timing
for alternative processors. Consequently the chapter is of most interest to software
engineers and computer scientists. In undertaking comparative processor evaluation,
the often confusing task of performance benchmarking is studied for dissimilar
processors. This chapter will enable you to better understand the methodologies used
by manufacturers trying to win the benchmark race, and presents an approach which
will enable you to more accurately determine a processor’s performance for your
own application.
Processor execution speeds are restricted to integer performance evaluation. No
attention is given to floating–point performance. This should not be disappointing, as
the selection of a processor is greatly limited if floating–point performance is critical.
Most manufacturers have processors (such as the Am29050) which are specifically
intended for floating–point use. Additionally, the large majority of systems are not
concerned with floating–point operations.
The well known Stanford benchmark developed at Stanford University is used
for performance comparisons. It is easy to criticize the choice of a synthetic
benchmark. However, it is difficult to come up with with a more acceptable
alternative which everyone will agree upon. At least the Stanford benchmark is more
revealing than the over used (and often unreliable) Dhrystone benchmark. Separate
results for six of the integer routines taken from the integer–part of the Stanford code
will be shown. The six were chosen because of their diversity in function and
similarity in execution times. This similarity made for clearer scaling and hence
easier comparison of the results.
418
The Stanford benchmark is relatively small and can have high instruction cache
hit ratios. It also does not exhibit the large data movement activities typical of
communications applications. For this reason a LAPD benchmark, which is larger
and more representative of communication applications is also used. The LAPD
acronym stands for Link Access Protocol–D. It is an ISDN protocol used by the
communications industry when sending packet information between a caller and
callee. The benchmark is intended to measure a processor’s ability to crunch a
typical layered protocol stack. The LAPD code used is based on a prior AMD,
“AmLink”, software product. The benchmark is in three parts: Send an information
package and receive an un–numbered acknowledge; Receive an information
package and respond with an un–numbered acknowledge; And send an information
package and receive an information package. Results are presented in terms of
geometric mean values for packet switching speeds for the three parts (the geometric
mean is found by multiplying the three results and taking the cube–root of the
product).
The performance results presented can act as a guideline for your own
application. However, the only certain way to know a processor’s performance for
any particular processor/memory configuration is to benchmark your own code on
the system or an Architectural Simulator.
8.1
THE 29K FAMILY
Chapter 1 described the features of the 29K family members in detail. The
family is divided into three main groupings: three–bus microprocessors, two–bus
microprocessors, and microcontrollers. This section will concentrate on the
sometimes difficult task of selecting a particular family member. When designing a
new microprocessor system, price and performance expectations restrict the choice
of available processors. It is not acceptable to select a high–end processor with a fast
memory system when the budget requires a low system cost. It is equally important to
be aware that a low–end processor with inexpensive memory system may not have
the required performance. There can also be other design restrictions, such as low
power consumption or short development time, that further influence the processor
selection. The problem of selecting a processor is dominated by the difficulty of
evaluating relative performance of different processor and memory combinations.
To help resolve this problem, I have simulated a wide range of potential systems and
determined their relative performance. The results are presented in the following
sections.
The review is divided into two sections: first, microcontrollers; and second, all
types of microprocessors. The division is natural. One of the first decisions to be
made is whether to use a microcontroller or a microprocessor.
Chapter 8
Selecting a Processor
419
They each have advantages and disadvantages, summarized below:
Performance
The 29K microcontrollers make available a wide range of system
performance. However, they do not enable construction of the
fastest systems. The 29K 2–bus microprocessors have the
advantage of higher processor clock speeds and larger on–chip
cache. They can also operate with faster memory systems,
although the construction of these fast memory systems is not as
simple as attaching a memory system to a microcontroller.
Design Time
The hardware design time is less with a microcontroller. This is
mainly because the microcontroller contains memory interface
controllers on–chip. There is no need to build any DRAM refresh
circuitry or memory interface logic. A number of frequently
required peripheral devices, such as UARTs and Input/Output
(I/O) ports are also incorporated into the microcontrollers,
eliminating the need to select, integrate, and debug these
peripherals when they are required by the system is an advantage.
System Cost
Microcontroller systems generally cost less to design and
construct; they offer good value. The higher price of 2–bus
microprocessors is justifiable if higher system performance is
required. Additionally, the higher cost of the associated high
performance memory system makes the higher microprocessor
cost more acceptable.
Future Expansion
Frequently systems are required to be built at different
price/performance combination. Both types of processor have
something to offer in this area. The 2–bus processors are all pin
and bus compatible. It is possible to initially design with an
Am39035 processor using a 2/1 DRAM memory at 16 MHz. The
processor can be directly replaced with an Am29030 or Am29040,
each offering additional performance. Additionally, the faster
processors could be used at 33 MHz using Scalable Clocking to
achieve the highest performance system. Each processor has a
different cost. But, without redesigning the system, a simple
performance upgrade (or alternatively down grade) path is
available.
Am29240
Am29200
420
The upgrade path is not as simple with microcontrollers. However,
it is possible (and frequently done) to build a system with a
multiple microcontroller foot print. The Am29240 device is
bigger than the Am29200, which in turn is bigger than the
Am29205. The difference in the physical size of the devices (the
foot prints) enables a board layout with a concentric pad site for all
Evaluating and Programming the 29K RISC Family
three microcontrollers. Hence, the same board can be utilized with
different processors. However, because of the different access
timing of DRAMs used by microcontrollers, it would be necessary
to also upgrade the memory devices. This does not generally
present much of a problem. As a single board layout can easily
accommodate different memory device speeds.
Peripherals
Microcontrollers have the advantage of on–chip peripherals. As
well as simplifying the design process, they enable a smaller board
layout area and reduced system power consumption. The close
coupling of the on–chip peripherals to the processor, enables fast
communication between the two, even at high clock speeds. There
is also no extra cost for the peripherals.
Memory Choice When DRAM is used, microprocessors enable a wider range of
memory systems to be constructed. The Am2920x
microcontrollers only support 3/2 DRAM access (3–cycle first
access, thereafter 2–cycle burst). The higher performance
Am2924x microcontrollers only support 2/1 DRAM access.
Burst–mode can be disabled resulting in slower 3/3 and 2/2
memory access respectively, but there is little else that can be
modified. However, for many systems, the restrictions inherent in
the built–in microcontroller DRAM interface will present no
problem.
Board Size
Microcontrollers are much more likely to enable a smaller board
layout area. They have less need for peripheral support circuitry,
particularly because of their built–in memory interfaces.
Power Consumption The Am2924x microcontrollers can operate at 3.3 volts and
support power saving operating modes. Additionally,
microcontroller based systems have less need for power
consuming peripherals. This gives the microcontrollers the
advantage when constructing a low power system. However, the
Am2920x low cost microcontrollers do not have the power saving
features. The only microprocessor particularly suited to low
power systems is the Am29040. Because of its higher cost and
higher clock rate, its use is restricted higher performance systems.
Tool Selection
Chapter 8
User mode code is compatible across the entire 29K family. This
means, for instance, a C compiler can produce code for any of the
processors. However, there are optimizations, such as the use of
integer multiply with the Am29240 and Am29040 processors,
which can improve a particular processor’s performance. Tool
Selecting a Processor
421
selection, with the exception of certain debuggers, is likely to be
equally available, priced, and effective across the 29K family.
Multiprocessor
The bus snooping capability of the Am29040 makes this
microprocessor the clear choice for a complex multiprocessor
design. For less complex systems, where a 29K may be used as a
coprocessor for a peripheral task, the microcontrollers have an
on–chip parallel port which can be used to communicate with the
main processor. When a processor is used in conjunction with an
off–chip DMA controller, or other agent accessing shared
memory, it is important that a data cache (if used) support a
write–through or copy–back (with snooping) policy. However, a
write–through data cache still has problems with data consistency
when another agent wishes to write shared memory (see section
5.14). The techniques supported by the 29K members are superior,
in terms of data consistency, to simply using on–chip SRAM.
Debug Support
Processors are equally supported with software simulators and
low cost debug tools such as ROM emulators. The effectiveness of
low cost software debug tools, such as ROM emulators and debug
monitors, is enhanced with on–chip debug features such as
Monitor node and breakpoint control registers. It is mainly the
higher performance processors which have these on–chip debug
features. The most popular processors are supported with In
Circuit Emulators (ICE) supplied by AMD partners. There are
also hardware and software personality modules which enable
logic analyzers to be used across the family for hardware and
low–level software debugging.
The simulation results presented in the following sections were obtained using
the Architectural Simulator. This simulator accurately models processor operation,
and can be used to evaluate any potential 29K system. Use of the simulator was
described in detail in sections 1.14. An event file is required to describe the system’s
characteristics. For example, the file below, 200_3232_2232.evt, was used to
describe an Am29200 microcontroller which had a 32–bit ROM and DRAM
memory system (the 3232 part of the file name), with 2/2 ROM access and 3/2
DRAM access (the 2232 part of the file name).
422
;Architectural
romread
romwrite
romburst
rombread
romwidth
Simulator event file, 200_3232_2232.evt
2
;ROM space, 2/2 access
2
false
;burst off
2
32
;32–bit ROM–space
ramread
3
DRAM space, 3/2 access
Evaluating and Programming the 29K RISC Family
ramwrite
rampage
rampread
rampwrite
ramwidth
3
true
2
2
32
;pagemode on
;32–bit DRAM space
By building new event files, it is possible to re–run simulations and evaluate the
effect on the system’s performance. The simulator was run using the command
below:
sim29 –29200 –e 200_3232_2232.evt a.out
The program being simulated, shown as a.out above, was the LAPD
benchmark. I chose to use LAPD rather than Stanford because of the high instruction
cache hit ratio of the Stanford benchmark –– above 90% with even very small caches.
I believe modeling the performance of LAPD is more likely to reflect the actual
performance most users will experience with their own application code. However,
as always, I urge you to use your own code when benchmarking various processors.
The LAPD benchmark is good at testing data movement and bit field (packet header)
operations, but this may not be your requirement. Additionally, the Metaware
compiler was used with a high level of optimization (–07) when compiling the
benchmark. This produces the best performance but may require additional memory
to hold the expanded code which results from such optimizations as loop unrolling.
8.1.1 Selecting a Microcontroller
Microcontrollers are studied and grouped in this section according to their
memory system speed. Initially, systems based on 16 MHz memory are analyzed.
The performance of both 16– and 32–bit wide memories is presented. However, no
8–bit systems are included. Very small systems based on 8–bit memories and using
the Am2920x microcontrollers are evaluated in a separate section (section 8.1.2,
Moving up to an Am2920x Microcontroller). Memory systems operating at 12.5
MHz are also dealt with in the section dealing with very small systems.
16 MHz Memory Systems
Setting 12.5 MHz systems aside, 16 MHz is the entry level system speed. This
can be achieved using a 16 MHz Am29200, Am29205 or Am29245 processor, or an
Am29240 using Scalable Clocking. When Scalable Clocking technology is used, a
33 MHz Processor would be combined with a 16 MHz memory system. Both
instruction and data accesses are satisfied by the slower 16 MHz memory. The
simulation results for various systems running the LAPD benchmark are shown in
Figure 8-1. Memory access times for the evaluated systems are shown in the
notation: (initial/subsequent), for example 2/1.
Programmable Data width was used to model 16–bit and 32–bit memories. As
expected and supported by the results, the 16–bit memory systems offer less
Chapter 8
Selecting a Processor
423
Am29205 */* 3/2
Am29205 2/2 3/2
Am29205 2/2 */*
Am29200 */* 3/2
Am29205 2/1 3/2
Am29205 1/1 3/2
Am29205 2/1 */*
Am29245 */* 2/1
Am29200 2/2 3/2
Am29205 1/1 */*
Am29200 2/2 */*
Am29200 2/1 3/2
Am29245 */* 2/1
Am29245 2/2 */*
Am29200 2/1 */*
Am29245 2/2 2/1
Am29240 */* 2/1
Am29200 1/1 3/2
Am29240 1/1 2/1
Am29240 1/1 */*
Am29240 2/2 */*
Am29240 2/2 2/1
Am29200 1/1 */*
Am29245 1/1 */*
Am29245 1/1 2/1
Am29240 */* 2/1
Am29240 1/1 2/1
Am29240 1/1 */*
ÈÈÈ
Í
ÈÈÈÈ
Í
ÈÈÈ
Í
ÈÈÈÈ
Í
ÈÈ
Í
ÍÍ
ÈÈÈÈ
ÍÈ
ÈÈ
ÈÍ ÍÍ
Í
ÈÈÈÈÈÈ
È
Í
ÈÈÈÈÈÈ
Í
ÈÈÈÈÈÈ
Í
È
Í
ÈÈÈÈÈÈ
ÈÈÈÈÈÈ
Í
È
ÈÈÈÈÈÈ
ÍÈ
ÍÈ
Í
Í
È
ÍÈ
È
Í
ÍÍ
È
ÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÈ
ÍÍÍÍÍÍÍÍÍÍÍ
È
Í
ÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍ
Í
È
ÍÈ
È
Í
ÍÈ
È
Í
Í
È
32–bit memory
32–bit memory at 1/2 CPU speed
16–bit memory
16–bit memory at 1/2 CPU speed
SRAM
SRAM
0
ROM or
SRAM
memory
access
bigger
is better
DRAM
memory
access
È
Í
Í
È
SRAM
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Packets per second
Figure 8-1. 29K Microcontrollers Running the LAPD Benchmark
With 16 MHz Memory Systems
424
Evaluating and Programming the 29K RISC Family
performance. Not all of the modeled systems are likely candidates for construction.
They are shown merely to report their relative performance. Some of the most
interesting systems are highlighted. For example, the second from the top entry
shows an Am29205 system with 16–bit 2/2 ROM and 3/2 DRAM. This is an entry
level system. The first entry shows an Am29205 operating from 16–bit DRAM
alone. The notation “*/*” in the ROM/SRAM column indicates that no ROM
memory was used. Such a system would require initialization of the DRAM memory.
This could be achieved with an 8–bit ROM which transferred its contents to DRAM
before application code execution commenced. Note, it is not possible to build a
DRAM–only system where a dual–ported DRAM is initialized by another processor.
This is because after reset, program execution commences from ROM region 0. This
does mean an SRAM–only system could be constructed; assuming that the SRAM is
located in ROM region 0, and is somehow initialized before processor reset.
The second entry, the 2/2–3/2 system, was linked such that instructions were
fetched from the 2/2 ROM space; all data was accessed from the 3/2 DRAM. The
combined ROM–DRAM system is faster than the 3/2 DRAM–only system. The
DRAM–only system has 81% of the faster system’s performance. This is due to
instruction accesses being directed to the faster 2/2 memory and the frequent
occurrence of DRAM precharge cycles. The Am29200 DRAM is frequently referred
to as 3/2, this assumes the 1–cycle of RAS–precharge is hidden. When DRAM–only
systems are used, the precharge is not likely to be hidden, and the access is truly 4/2
rather than 3/2. This is explained in section 1.14.1 under the Am29200 and Am29205
subheading. Given that even inexpensive EPROM devices can be 1.5 to 2–times the
cost of DRAM (per byte), it is less expensive to use a single 8–bit EPROM to
initialize the DRAM, and then execute the program from DRAM. However, there is a
loss of performance with this technique.
The sixth entry shows an Am29205 system with 1/1 ROM and 3/2 DRAM. The
system has substantially increased performance over the 2/2 ROM system (66%
faster). The notation 1/1 is used here to indicate instruction read access times only.
The microcontroller family requires one wait state when writing to ROM space. This
results in a minimum write access time of 2–cycles for ROM space. Although this is
important to note, it has no impact here as the system performs all data writes to
DRAM. However, the system is unbuildable due to the unavailability of ROM
devices which can deal with the very fast access times.
The access times for ROM space are determined by three parameters. First, the
period of the memory system clock (CP) –– all memory accesses are synchronized to
the system clock. Second, the delay before synchronous outputs become valid (OV).
Third, the input setup time (IS) for synchronous input signals. When performing
single–cycle memory access, the access time is determined from the ROMOE signal
becoming valid after the falling edge of MEMCLCK (OVF). When wait states (WS)
are used, the access time is determined from the address outputs becoming valid after
Chapter 8
Selecting a Processor
425
Table 8-1. Memory Access Times for Am2920x Microcontroller ROM Space
Memory Bus
Speed (MHz)
12.5
16
20
Clock Period
(CP ns)
80
62.5
50
Output Valid
(OVR, OVF ns)
15, 15
11, 10
11, 10
Input Setup
(IS ns)
12
10
10
Memory Access Times (ns)
0–Wait 1–Wait 2–Wait
13
11.25
5
133
105
79
213
167.5
129
the rising edge of MEMCLK (OVR). The equations below can be used to calculate
the required minimum memory access times.
Memory Access Time =(Clock Period)/2 –(Output Valid) –(Input Setup)
,WS=0
=(CP/2) – OVF – IS
Memory Access Time =(Period)*(1+Wait States)–(Output Valid)–(Input Setup)
=(CP * (1 + WS)) – OVR – IS
,WS > 0
Shown on Table 8-1 are the required memory access times for Am2920x ROM
space memory. The 1/1 access times are given under the zero wait state column. At
16 MHz, an 11.25 ns access time must be supported. ROM devices at this speed are
not available. However, the access times for 2/2 ROM (1–wait) are reasonable, and
can be achieved with readily available 90 ns ROM devices.
It is not until a 20 MHz memory system is required that particularly fast ROM
need be used. At this stage an interleaved ROM system could be built or faster ROM
purchased for a higher cost. Alternatively, FLASH memory could be used. FLASH is
generally available with faster access times than EPROM. Table 8-2 lists a number of
current AMD memory devices and their access times. Faster and larger devices are
always being developed by AMD and other manufactures. It is likely that before long
new memory devices will become available and enable faster systems to be
constructed at lower cost.
The ninth entry shows a 32–bit Am29200 based system using 2/2 ROM and 3/2
DRAM. This may be a popular system for construction. The 32–bit DRAM–only
system has only 71% of the combined ROM–DRAM system. Thus, the addition of
32–bit wide ROM will be justifiable for those systems requiring extra performance.
Faster DRAM–based systems can be constructed using the 2/1 DRAM
controller incorporated into the more expensive Am2924x microcontrollers. The
Am29245 is the least expensive, and it is shown with a 32–bit DRAM–only system
(*/* 2/1) in entry thirteen of the table. The previous Am29200 system based on
ROM–DRAM, has only 71% of the performance of the DRAM–only Am29245
system. Interestingly, the Am29240 system using 16–bit DRAM and Scalable
Clocking is shown to be faster than the Am29245 using 32–bit DRAM. This is due to
the higher internal clock rate and data cache of the Am29240.
426
Evaluating and Programming the 29K RISC Family
Table 8-2. ROM and FLASH Memory Device Access Times
AMD Device
Speeds (ns)
Capacity
Memory Type
Am27C010
Am27C020
Am27C040
Am27C080
90, 120, 150
80, 120, 150
90, 120, 150
90, 120, 150
128k x 8
256k x 8
512k x 8
1M x 8
EPROM
EPROM
EPROM
EPROM
Am28F010
Am28F020
Am29F010
Am29F040
90, 120, 150
90, 120, 150
45, 55, 70
70, 90, 120
128k x 8
256k x 8
128k x 8
512k x 8
FLASH
FLASH
FLASH
FLASH
The fastest DRAM–only system, third from the bottom, is an Am29240 using
Scalable Clocking and 32–bit DRAM. This system is 130% faster than the examined
Am29200 using ROM–DRAM. However, it is more expensive due to the premium
speed microcontroller. An alternative is to use a less expensive Am29200 with
SRAM. Shown on Figure 8-1 is a 32–bit 1/1 SRAM based system which is 100%
faster than the studied ROM–DRAM system. When examining SRAM–only
systems (such as the 1/1 */* example), the benchmark program was linked such that
both instructions and data where accessed from SRAM. In practice this would likely
require programs to be located in 8–bit ROM, and transferred to SRAM during the
initialization stage. Unfortunately, SRAM is about eight times the cost of DRAM on
a per–byte basis. However, if only a small amount of SRAM is required, the system
may be cost effective, given the lower processor cost. At 12.5 MHz, zero wait state
access requires 13 ns SRAM. Such devices are readily available.
Table 8-3. Memory Access Times for Am2924x Microcontroller ROM Space
Memory Bus
Speed (MHz)
16
20
25
33
Clock Period
(CP ns)
62.5
50
40
31.25
Output Valid
(OVR, OVF ns)
10, 9
10, 9
10, 9
10, 9
Input Setup
(IS ns)
7
7
7
7
Memory Access Times (ns)
0–Wait 1–Wait 2–Wait
15.25
9
4
–
108
83
63
45.5
170.5
133
103
76.75
20 MHz Memory Systems
Microcontroller based systems using 20 MHz memory systems are shown in
Figure 8-2 When using DRAM, these systems are always faster than 16 MHz
systems. However, a 20 MHz Am29200 system using 2/2 ROM and 3/2 DRAM has
only 62% of the performance of a 16 MHz SRAM system.
Chapter 8
Selecting a Processor
427
Í
È
ÈÈÈ
È È
Í
ÈÈÈ
Í
È
Í
È
Í
È
Í
ÈÈÈÈÈÈÈÈÈÈ
Í
ÈÈÈÈÈÈÈÈÈÈ
ÍÍ
È
ÈÍ
Í
ÈÍ
ÈÍ
ÍÈ
È
ÈÍ
ÍÈ
È
Í
È
Í
È
Í
È
Í
È Í
Í
È
È
Í
È
Í
È È
Í
Í
È
Í
32–bit memory
Am29200 */* 3/2
Am29200 2/2 3/2
Am29200 2/2 */*
Am29240 */* 2/1
Am29200 2/1 3/2
Am29200 2/1 */*
Am29240 2/2 */*
Am29240 2/2 2/1
Am29200 1/1 3/2
Am29240 */* 2/1
Am29200 1/1 */*
Am29240 2/1 2/1
Am29240 1/1 */*
ROM or
SRAM
memory
access
16–bit memory
SRAM
0
DRAM
memory
access
1000
2000
3000
4000
5000
6000
7000
8000
9000
Packets per second
Figure 8-2. 29K Microcontrollers Running the LAPD Benchmark
With 20 MHz Memory Systems
428
Evaluating and Programming the 29K RISC Family
Building an Am29200 system with 1/1 SRAM at 20 MHz requires 5 ns memory
access times. These are much more expensive than the 11.25 ns memories required at
16 MHz. To reduce cost, an interleaved SRAM system could be constructed. This
would result in 2/1 SRAM access. However, this achieves only 90% of the
performance on an 1/1 SRAM system operating at 16 MHz. It would be better to
build the slower, less expensive, yet higher performing 16 MHz system.
With 20 MHz memory systems, the Am2920x microcontrollers are operating at
their maximum frequency. As more performance is required, the likelihood of
selecting an Am29240 processor increases. This is particularly true if DRAM–only is
to be used. An Am29240 using 32–bit DRAM–only (2/1) is 151% faster than an
Am29200 using a 3/2 DRAM–only system.
It is possible to build SRAM based systems using an Am29240 processor.
Shown in Table 8-3 are the required memory access times for Am2924x ROM space
memory. The table is based on preliminary AMD data which may change in the
future. The 1/1 access times are given under the 0–Wait column. At 20 MHz a 9 ns
access time must be supported. This is difficult to achieve, and probably not
worthwhile economically. In practice, it would be better to slow the clock speed
down to 19.2 MHz and use 10 ns SRAM devices.
However, the Am29240 system using 32–bit 2/1 DRAM–only has 76% of the
performance of a 32–bit 1/1 SRAM system. The performance benefit of SRAM,
relative to DRAM, is diminished when used with an Am2924x microcontroller. This
is partly due to the 2–cycle requirement for all data writes performed to ROM space.
The 1/1 access is only achieved with instruction fetching and data reading. All data
writes are performed with, at best, 2/2 access times. Conversely, DRAM supports 2/1
for all types of access.
25 MHz Memory Systems
The performance of 25 MHz memory systems is shown in Figure 8-3. These
systems can only be built using Am29240 and Am29243 microcontrollers. At this
speed it is not possible to use 1–cycle first access memory. And, 2/1 SRAM has
poorer performance than 2/1 DRAM due to the 2–cycle data–write limitation.
Scalable Clocking is not available at 20 MHz and above. Hence, all memory
systems must run at the speed of the processor. The fast (2/1) DRAM controller
incorporated into the Am2924x microcontrollers makes DRAM the correct memory
choice with these processors. Additionally, the 2/2 ROM which could be used with
such systems would degrade performance from a DRAM–only system. Hence, it
makes sense to use only a slow 8–bit ROM to initialize the DRAM. Program code,
and initialized data, should be transferred from narrow ROM to DRAM during
program initialization. If a program is too large to fit within a single 8–bit ROM, it
would then make sense to use 16–bit ROM for additional capacity.
Chapter 8
Selecting a Processor
429
ÈÈÈÈÈÈÈÈ
Í
ÈÈÈ
ÈÈÈÈÈÈÈÈ
Í È
ÈÈÈ
Í
È
Í
È
Í
ÈÍ
Í
È
È Í
Í
È
ÈÍ
Í
È
È
Í
È
Í
32–bit memory
Am29240 */* 2/1
Am29240 2/2 */*
Am29240 2/2 2/1
Am29240 */* 2/1
Am29240 2/1 2/1
Am29240 1/1 */*
ROM or
SRAM
16–bit memory
0
2000
DRAM
4000
6000
8000
10000
12000
14000
Packets per second
Figure 8-3. 29K Microcontrollers Running the LAPD Benchmark
With 25 MHz Memory Systems
When executing from DRAM there is always the danger of accidentally writing
to memory holding instructions and damaging the program. This can be avoided by
using the on–chip MMU to protect the relevant memory regions (see section 7.4.5).
33 MHz Memory Systems
The performance of systems operating at 33 MHz is shown in Figure 8-4. As
with 25 MHz systems, 2/1 DRAM–only memory is most practical. At 25 and 33 MHz
the choice of systems which are practical is limited compared to the selection at 16 or
even 20 MHz. In fact the word “practical” should not be interpreted to mean easy or
readily available. At 33 MHz a DRAM system is challenged to meet the 2–1 timing
specification. Currently only the fastest DRAM devices are usable. For example,
DRAM with 60 ns access times is required by 25 MHz systems. The most practical
way to use 33 MHz processors is with Scalable Clocking which reduces the memory
system speed to 16 MHz. At these higher clock rates, the Am29240 microcontroller
is able to perform as well as many systems build around a 2–bus microprocessor.
Further Observations
Memory system requirement is not likely to be the only influence on processor
selection. The Am2924x microcontrollers offer additional on–chip peripherals
compared to the Am2920x processors. This may direct processor selection towards
430
Evaluating and Programming the 29K RISC Family
ÈÈÈÈÈÈÈÈÈÈÈ
Í
ÈÈÈÈÈÈÈÈÈÈÈ
Í ÈÈÈ
È
Í
È
Í
È
Í
È
Í
ÈÍ
ÍÈ
È È
Í
È Í
Í
È
Í
È
Í
32–bit memory
Am29240 */* 2/1
Am29240 2/2 */*
Am29240 2/2 2/1
Am29240 */* 2/1
Am29240 2/1 2/1
ROM or
SRAM
16–bit memory
0
2000
4000
6000
8000
DRAM
10000
12000
14000
Packets per second
Figure 8-4. 29K Microcontrollers Running the LAPD Benchmark
With 33 MHz Memory Systems
the more expensive Am2924x grouping. Additionally the Am2924x processors have
additional power saving features. Note, as a means of saving power, the Am2920x
can be temporarily clocked down to 8 MHz, and when necessary the clock returned to
the normal (higher) operating speed.
All microcontrollers are able to use interrupt context caching (see section 2.5.4).
This improves interrupt processing and is somewhat independent of the off–chip
memory system performance. With interrupt context caching, the processor state is
saved and restored from on–chip registers rather than external memory stack. Hence,
even the least expensive system can support interrupts with a performance matching
that of the more expensive systems.
8.1.2 Moving up to an Am2920x Microcontroller
This section presents the performance of Am2920x microcontrollers operating
at 12.5 and 16 MHz. The intention is to evaluate the smallest, least expensive systems
possible. This section should be of interest to the designer looking to use a RISC
microcontroller to upgrade a system which would have previously used an
inexpensive CISC processor. The performance of various 8–bit and 16–bit memory
systems is shown on Figure 8-5.
Much of the information presented in the previous 16 MHz microcontroller
section is applicable to the low cost systems studied here. There are three systems of
primary interest. These systems can be constructed at both frequencies of interest.
First, systems which operate with 16–bit DRAM–only including all systems which
Chapter 8
Selecting a Processor
431
ROM or SRAM is 8–bit
ROM or SRAM is 16–bit
or no ROM
12 S=2/1 D=*/*
12 R=2/1 D=3/2
16 S=2/1 D=*/*
16 R=2/1 D=3/2
DRAM is 16–bit
D=*/* indicates
DRAM not used
12 R=*/* D=3/2
12 R=1/1 D=3/2
12 R=2/2 D=3/2
12 S=1/1 D=*/*
12 S=2/2 D=*/*
16 R=*/* D=3/2
16 R=1/1 D=3/2
16 R=2/2 D=3/2
16 S=1/1 D=*/*
16 S=2/2 D=*/*
bigger
is better
12 R=2/1 D=3/2
12 R=1/1 D=3/2
12 S=2/1 D=*/*
16 R=2/1 D=3/2
12 S=1/1 D=*/*
SRAM
16 R=1/1 D=3/2
16 S=2/1 D=*/*
16 S=1/1 D=*/*
SRAM
0
Processor
speed
(MHz)
Memory
access
400
800
R=ROM
D=DRAM
S=SRAM
1200
1600
2000
2400
2800
3200
3600
Packets per second
Figure 8-5. Am2920x Microcontrollers Running the LAPD Benchmark with 8–bit and 16–bit
Memory Systems Operating at 12 and 16 MHz
432
Evaluating and Programming the 29K RISC Family
have slow ROM, or ROMs which are only 8–bits wide. Having slow or narrow ROM
can help to keep the system cost down. The program must be copied from ROM to
DRAM after processor power up. Hence, the DRAM is the only memory which
influences program execution speeds. Unlike Am2924x microcontrollers the
Am2920x processors have no Translation Look–Aside Buffers (TLBs).
Consequently, they can not protect the DRAM from accidental damage during
program execution This may be more of an issue during code development than in a
final production product.
It is important to note here that the Am29205 processor does not have a
BOOTW (boot width) pin, and hence must initially operate from 16–bit wide
memory. Only the Am29200 processor operation can be initiated from 8–bit ROM.
Consequently, DRAM–only systems are more applicable to the Am29200. This is a
little unfortunate as only the Am29205 is available at the lower cost 12.5 MHz
frequency. Highlighted on Figure 8-5 are the simulation results for an Am29200
processor operating at 16 MHz using DRAM–only (16 R=*/* D=3/2).
The second type of system of interest uses 16–bit ROM (2/2) with 16–bit
DRAM (3/2). This is faster than operating from DRAM–only. If ROM is to be used it
must at least support 2/2 access or faster. Additionally, it must be 16–bits wide. If it is
slower or narrower it is best to execute from DRAM–only. A 12.5 MHz Am29205
with ROM (2/2) and DRAM (3/2) has 97% of the performance of a 16 MHz
Am29200 operating at 16 MHz with DRAM–only.
The third type of system of interest uses 1/1 SRAM. Given the higher cost of
SRAM compared to DRAM, this configuration is only applicable when extra
performance is required. The SRAM–only systems shown in Figure 8-5 would
require an 8–bit ROM for program initialization –– much the same as DRAM–only
systems. The simulation results show that a 16–bit DRAM–only system has only
79% of the performance of an 8–bit 1/1 SRAM system. The 8–bit SRAM system has
2% more performance than the 16–bit 2/2–3/2 system (ROM–DRAM). The reason
for the higher performance can be understood by examining the number of cycles
required to fetch a single 32–bit instruction. With a 16–bit 3/2 DRAM–only system,
6–cycles are required to fetch the first instruction; 4–cycles for burst–mode fetched
instructions. With 8–bit 1/1 SRAM, 4–cycles are required to fetch instructions. The
8–bit SRAM has the advantage.
Building a 16–bit SRAM system which is 1/1 produces a system which has
140% of the performance of a 16–bit DRAM–only system. At 12.5 MHz,
single–cycle access requires 13 ns SRAM, which is readily available. Simple SRAM
based designs can offer surprisingly good performance but the small size of SRAM
devices results in the systems only being suitable for applications requiring small
amounts of memory. Otherwise the cost of the SRAM is likely to be prohibitively
high.
Chapter 8
Selecting a Processor
433
8.1.3 Selecting a Microprocessor
The highest performance systems are constructed around 2–bus 29K
processors. The following subsections present the performances obtained for a
complete range of 29K processors running the LAPD benchmark. Systems are
studied and grouped in subsections according to their memory system speed. All
values were obtained using 32–bit DRAM or SRAM memory systems. Processors
ran, as indicated, at the same speed as the system memory or at two–times using
Scalable Clocking technology. Comparing the fastest and slowest systems, there is a
performance difference of more than 6–to–1. There is also a range of inbetween
systems which offer a wide selection of performance configurations.
DRAM is frequently referred to as, say, 2/1. This assumes the often required
1–cycle of precharge (RAS precharge) is hidden. When DRAM–only systems are
constructed, the precharge encountered when accessing a new memory page can not
always be hidden, and the access is thus 3/1 rather than 2/1. This is explained in
section 1.14.1 under the Am29200 and Am29205 subheading. The previous section
on selecting a microcontroller also referred to DRAM memory speeds without
including the necessary precharge time. The terminology is acceptable because the
precharge can frequently be hidden when the ROM region is used in conjunction with
the DRAM region. Consequently precharge has little effect on performance.
However, when DRAM–only systems are constructed, the effect precharge has on
system performance is more noticeable. Even if a 2/1 DRAM–only system suffers a
1–cycle precharge on all new page accesses, thus resulting in a 3/1 access, it shall still
be termed 2/1. Consistently maintaining the same notation for memory access
throughout this book helps with system comparisons. In summary, a 2/1 DRAM has
2–cycle initial access followed by 1–cycle for same–page accesses. With
DRAM–only systems a 2/1 system equates to a 3/1 system for all new page accesses.
With microcontrollers, the access times for the DRAM memory region
controller are built into the Architectural Simulator. Constructing an event file for
2–bus processors is a little more difficult. The event file shown below describes a 2/1
DRAM system used with a 2–bus processor. The required precharge and refresh
times are included. These parameters are also built into the simulation model for
microcontrollers. If a 2/1 SRAM system was being modeled, the precharge and
refresh parameters would be omitted. Note, in the example, Scalable Clocking is not
selected.
;Architectural
spacerambank
ramwidth
ramread
ramwrite
ramburst
rambread
rambwrite
434
Simulator event
80000000 100000
32
2
2
true
1
1
file
;memory location
;32–bit DRAM
;2/1 access
;b