IDT79R3500 RISC CPU PROCESSOR RISCore ® MILITARY AND COMMERCIAL TEMPERATURE RANGES IDT79R3500 RISC CPU PROCESSOR RISCore™ Integrated Device Technology, Inc. • Supports concurrent refill and execution of instructions. • Partial word stores executed as read-modify-write. • 6 external interrupt inputs, 2 software interrupts, with single cycle latency to exception handler routine. • Flexible multiprocessing support on chip with no impact on uniprocessor designs. • A single chip integrating the R3000 CPU and R3010 FPA execution units, using the R3000A pinout. • Software compatible with R3000, R2000 CPUs and R3010, R2010 FPAs. • TLB disable feature allowing a simple memory model for Embedded Applications. • Programmable Tag bus width allowing reduced cost cache. • Hardware Support of Single- and Double-Precision Floating Point Operations that include Add, Subtract, Multiply, Divide, Comparisons, and Conversions. • Sustained Floating Point Performance of 11 MFlops single precision LINPACK and 7.3MFLOPS double precision • Supports Full Conformance With IEEE 754-1985 Floating Point Specification • 64-bit FP operation using sixteen 64-bit data registers • Military product compliant to MIL-STD 833, class B FEATURES: • Efficient Pipelining—The CPU’s 5-stage pipeline design assists in obtaining an execution rate approaching one instruction per cycle. Pipeline stalls and exceptions are handled precisely and efficiently. • On-Chip Cache Control—The IDT79R3500 provides a high-bandwidth memory interface that handles separate external Instruction and Data Caches ranging in size from 4 to 256kBs each. Both caches are accessed during a single CPU cycle. All cache control is on-chip. • On-Chip Memory Management Unit—A fully-associative, 64-entry Translation Lookaside Buffer (TLB) provides fast address translation for virtual-to-physical memory mapping of the 4GB virtual address space. • Dynamically able to switch between Big- and Little- Endian byte ordering conventions. • Optimizing Compilers are available for C, FORTRAN, Pascal, COBOL, Ada, PL/1 and C++. • 20 through 40MHz clock rates yield up to 32VUPS sustained throughput. • Supports independent multi-word block refill of both the instruction and data caches with variable block sizes. IDT79R3500 PROCESSOR CONTROL Master Pipeline/Bus Control FPA FPA Registers Exponent Add Unit FPA Divide Unit FPA Multiply Unit CPO (System Control Coprocessor) CPU Exception/Control Registers General Registers (32x32) Memory Management Unit Registers ALU Local Control Logic Translation Lookaside Buffer (64 entries) Shifter Integer Multiplier/Divider Address Adder PC Increment/Mux Virtual Page Number/ Virtual Address Data (32+4) TAG (20+4) ADDRESS (18) The IDT logo is a registered trademark and RISCore, CEMOS are trademarks of Integrated Device Technology, Inc. MILITARY AND COMMERCIAL TEMPERATURE RANGES © 1992 Integrated Device Technology, Inc. 5.3 2871 drw 01 OCTOBER 1992 DSC-9054/3 IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES DESCRIPTION: FPA REGISTERS The IDT79R3500 RISC Microprocessor consists of three tightly-coupled processors integrated on a single chip. The first processor is a full 32-bit CPU based on RISC (Reduced Instruction Set Computer) principles to achieve a new standard of microprocessor performance. The second processor is a system control coprocessor, called CP0, containing a fully-associative 64-entry TLB (Translation Lookaside Buffer), MMU (Memory Management Unit) and control registers, supporting a 4GB virtual memory subsystem, and a Harvard Architecture Cache Controller achieving a bandwidth of 320MBs/second using industry standard static RAMs. The third processor is the Floating Point Accelerator which performs arithmetic operations on values in floating-point representations. This processor fully conforms to the requirements of ANSI/IEEE Standard 754-1985, “IEEE Standard for Binary Floating-Point Arithmetic.” In addition, the architecture fully supports the standard’s recommendations. The programmer model of this device will be the same as the programmer model of a system which uses a discrete IDT79R3000 with the IDT79R3010: 32 integer registers, 16 floating point registers; co-processor 0 registers; floating point status and control register; RISC integer ALU; Integer Multiply and Divide ALU; Floating Point Add/Subtract, Multiply, and Divide ALUs. The device pipeline will be the same as for the IDT79R3000, as will the co-processor 0 functionality. No new instructions have been introduced. Pin compatibility extends to AC and DC characteristics, software execution and initialization mode vector selection. This data sheet provides an overview of the features and architecture of the IDT79R3500 CPU, Revision 3.0. A more detailed description of the operation of the device is incorporated in the R3500 Family Hardware User Manual, and a more detailed architectural overview is provided in the MIPS RISC Architecture book, both available from IDT. Documentation providing details of the software and development environments supporting this processor are also available from IDT. The IDT79R3010A FPA provides 32 general purpose 32bit registers, a Control/Status register, and a Revision Identification register. Floating-point coprocessor operations reference three types of registers: • Floating-Point Control Registers (FCR) • Floating-Point General Registers (FGR) • Floating-Point Registers (FPR) IDT79R3500 CPU Registers The IDT79R3500 CPU provides 32 general purpose 32bit registers, a 32-bit Program Counter, and two 32-bit registers that hold the results of integer multiply and divide operations. Only two of the 32 general registers have a special purpose: register r0 is hardwired to the value “0”, which is a useful constant, and register r31 is used as the link register in jump-and-link instructions (return address for subroutine calls). The CPU registers are shown in Figure 2. Note that there is no Program Status Word (PSW) register shown in this figure: the functions traditionally provided by a PSW register are instead provided in the Status and Cause registers incorporated within the System Control Coprocessor (CP0). General Purpose Registers 31 0 r0 Multiply/Divide Registers 31 r1 r2 0 HI 31 0 LO Program Counter r29 r30 31 0 PC r31 2871 drw 02 Figure 2. IDT79R3500 CPU Registers Floating-Point General Registers (FGR) There are 32 Floating-Point General Registers (FGR) on the FPA. They represent directly-addressable 32-bit registers, and can be accessed by Load, Store, or Move Operations. Floating-Point Registers (FPR) The 32 FGRs described in the preceding paragraph are also used to form sixteen 64-bit Floating-Point Registers (FPR). Pairs of general registers (FGRs), for example FGR0 and FGR1 (Figure 3) are physically combined to form a single 64-bit FPR. The FPRs hold a value in either single- or doubleprecision floating-point format. Double-precision format FPRs are formed from two adjacent FGRs. Floating-Point Control Registers (FCR) There are 2 Floating-Point Control Registers (FCR) on the FPA. They can be accessed only by Move operations and include the following: • Control/Status register, used to control and monitor exceptions, operating modes, and rounding modes; • Revision register, containing revision information about the FPA. IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES General Purpose Registers (FGR/FPR) 63 32 31 0 FGR1 FGR0 FGR3 FGR2 FGR5 FGR4 Control/Status Register 31 0 Exceptions/Enables/Modes 31 FGR27 FGR26 FGR29 FGR28 FGR31 FGR30 Implementation/Revision Register 0 2871 drw 03 Figure 3. FPA Registers Instruction Set Overview All IDT79R3500 instructions are 32 bits long, and there are only three instruction formats. This approach simplifies instruction decoding, thus minimizing instruction execution time. The IDT79R3500 processor initiates a new instruction on every run cycle, and is able to complete an instruction on almost every clock cycle. The only exceptions are the Load instructions and Branch instructions, which each have a single cycle of latency associated with their execution. Note, however, that in the majority of cases the compilers are able to fill these latency cycles with useful instructions which do not require the result of the previous instruction. This effectively eliminates these latency effects. The actual instruction set of the CPU was determined after extensive simulations to determine which instructions should be implemented in hardware, and which operations are best synthesized in software from other basic instructions. This methodology resulted in the IDT79R3500 having the highest performance of any available microprocessor. I-Type (Immediate) 31 26 25 21 20 16 op rs rt 15 0 immediate J-Type (Jump) 31 26 25 op target R-Type (Register) 31 26 25 21 20 16 op rs rt 15 11 rd 0 10 6 re 5 0 funct 2871 drw 04 Figure 4. IDT79R3500 Instruction Formats The IDT79R3500 instruction set can be divided into the following groups: • Load/Store instructions move data between memory and general registers. They are all I-type instructions, since the only addressing mode supported is base register plus 16bit, signed immediate offset. The Load instruction has a single cycle of latency, which means that the data being loaded is not available to the instruction immediately after the load instruction. The compiler will fill this delay slot with either an instruction which is not dependent on the loaded data, or with a NOP instruction. There is no latency associated with the store instruction. Loads and Stores can be performed on byte, half-word, word, or unaligned word data (32-bit data not aligned on a modulo-4 address). The CPU cache is constructed as a write-through cache. • Computational instructions perform arithmetic, logical and shift operations on values in registers. They occur in both R-type (both operands and the result are registers) and I-type (one operand is a 16-bit immediate) formats. FP computational instructions perform arithmetic operations on floating point values in the FPA registers. Note that computational instructions are three operand instructions; that is, the result of the operation can be stored into a different register than either of the two operands. This means that operands need not be overwritten by arithmetic operations. This results in a more efficient use of the large register set. • Conversion instructions perform conversion operations on the floating point values in the FPA registers. • Compare intructions perform comparisons of the contents of FPA registers and set a condition bit based on the results. The result of the compare operations is tied directly to Cp Cond (1) for software testing. • Jump and Branch instructions change the control flow of a program. Jumps are always to a paged absolute address formed by combining a 26-bit target with four bits of the Program counter (J-type format, for subroutine calls), or 32-bit register byte addresses (R-type, for returns and IDT79R3500 RISC CPU PROCESSOR RISCore OP LB LBU LH LHU LW LWL LWR SB SH SW SWL SWR MILITARY AND COMMERCIAL TEMPERATURE RANGES Description OP Load/Store Instructions Load Byte Load Byte Unsigned Load Halfword Load Halfword Unsigned Load Word Load Word Left Load Word Right Store Byte Store Halfword Store Word Store Word Left Store Word Right Description SRA SLLV SRLV SRAV Shift Instructions (Cont.) Shift Right Arithmetic Shift Left Logical Variable Shift Right Logical Variable Shift Right Arithmetic Variable CVT.S.fmt CVT.D.fmt CVT.W.fmt FPA Conversion Instructions Floating point Convert to Single FP Floating point Convert to Double FP Floating point Convert to fixed point FPA Load/Store/Move Instructions Load Word to FPA Store Word from FPA Move Word to FPA Move Word from FPA Move Control word to FPA Move Control word from FPA MULT MULTU DIV DIVU MFHI MTHI MFLO MTLO MultIply/Divide Instructions Multiply Multiply Unsigned Divide Divide Unsigned Move From HI Move To HI Move From LO Move To LO ANDI ORI XORI LUI Arithmetlc Instructions (ALU Immediate) Add Immediate Add Immediate Unsigned Set on Less Than Immediate Set on Less Than Immediate Unsigned AND Immediate OR Immediate Exclusive OR Immediate Load Upper Immediate J JAL JR JALR BEQ BNE BLEZ BGTZ BLTZ BGEZ ADD ADDU SUB SUBU SLT SLTU AND OR XOR NOR Arithmetic Instructions (3-operand, register-type) Add Add Unsigned Subtract Subtract Unsigned Set on Less Than Set on Less Than Unsigned AND OR Exclusive OR NOR ADD.fmt SUB.fmt MUL.fmt DlV.fmt ABS.fmt MOV.fmt NEG.fmt FPA Computational Instructions Floating point Add Floating point Subtract Floating point Multiply Floating point Divide Floating-point Absolute value Floating point Move Floating point Negate C.cond.fmt FPA Compare Instructions Floating-point Compare SLL SRL Shift Instructions Shift Left Logical Shift Right Logical LWC1 SWC1 MTC1 MFC1 CTC1 CFC1 ADDI ADDIU SLTI SLTIU BLTZAL BGEZAL Jump and Branch Instructions Jump Jump and Link Jump to Register Jump and Link Register Branch on Equal Branch on Not Equal Branch on Less than or Equal to Zero Branch on Greater Than Zero Branch on Less Than Zero Branch on Greater than or Equal to Zero Branch on Less Than Zero and Link Branch on Greater than or Equal to Zero and Link SYSCALL BREAK Special Instructions System Call Break LWCZ SWCZ MTCZ MFCZ CTCZ CFCZ COPZ BCZT BCZF Coprocessor Instructions Load Word from Coprocessor Store Word to Coprocessor Move To Coprocessor Move From Coprocessor Move Control to Coprocessor Move Control From Coprocessor Coprocessor Operation Branch on Coprocessor z True Branch on Coprocessor z False MTC0 MFC0 TLBR TLBWI TLBWR TLBP RFE System Control Coprocessor (CPO) Instructions Move To CP0 Move From CP0 Read indexed TLB entry Write Indexed TLB entry Write Random TLB entry Probe TLB for matching entry Restore From Exception IDT79R3500 Instruction Summary 2871 tbl 01 IDT79R3500 RISC CPU PROCESSOR RISCore dispatches). Branches have 16-bit offsets relative to the program counter (I-type). Jump and Link instructions save a return address in Register 31. The R3500 instruction set features a number of branch conditions. Included is the ability to compare a register to zero and branch, and also the ability to branch based on a comparison between two registers. Thus, net performance is increased since software does not have to perform arithmetic instructions prior to the branch to set up the branch conditions. • Coprocessor instructions perform operations in the coprocessors. Coprocessor Loads and Stores are I-type. • Coprocessor 0 instructions perform operations on the System Control Coprocessor (CP0) registers to manipulate the memory management and exception handling facilities of the processor. • Special instructions perform a variety of tasks, including movement of data between special and general registers, system calls, and breakpoint. They are always R-type. MILITARY AND COMMERCIAL TEMPERATURE RANGES SYSTEM CONTROL COPROCESSOR (CP0) INSTRUCTIONS Register Description EntryHi EntryLo Index Random High half of a TLB entry Low half of a TLB entry Programmable pointer into TLB array Pseudo-random pointer into TLB array Status Cause EPC Context BadVA Mode, interrupt enables, and diagnostic status info Indicates nature of last exception Exception Program Counter Pointer into kernel’s virtual Page Table Entry array Most recent bad virtual address PRId Processor revision identification (Read only) 2871 tbl 02 STATUS Table 1 lists the instruction set of the IDT79R3500 processor. IDT79R3500 System Control Coprocessor (CP0) The IDT79R3500 can operate with up to four tightlycoupled coprocessors (designated CP0 through CP3). The System Control Coprocessor (or CP0), is incorporated on the IDT79R3500 chip and supports the virtual memory system and exception handling functions of the IDT79R3500. The virtual memory system is implemented using a Translation Lookaside Buffer and a group of programmable registers as shown in Figure 5. System Control Coprocessor (CP0) Registers The CP0 registers shown in Figure 5 are used to control the memory management and exception handling capabilities of the IDT79R3500. Table 2 provides a brief description of each register. ENTRYHI ENTRYLO CAUSE EPC INDEX 63 RANDOM TLB CONTEXT 8 7 NOT ACCESSED BY RANDOM BADVA 0 Used with Virtual Memory System Used with Exception Processing 2871 drw 05 Figure 5. The System Coprocessor Registers IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES Memory Management System The IDT79R3500 has an addressing range of 4GB. However, since most IDT79R3500 systems implement a physical memory smaller than 4GBs, the IDT79R3500 provides for the logical expansion of memory space by translating addresses composed in a large virtual address space into available physical memory address. Two TLB modes are supported. When the TLB is used, the 4GB address space is divided into 2GBs which can be accessed by both the users and the kernel, and 2GBs for the kernel only. Virtual addresses within the kernel/user segment are translated to physical addresses on a 4kB page basis. This mode is typical of UNIX and other sophisticated operating systems. When the TLB is disabled, mapping is locked as 2GBs as kernel/user, and 1.5GBs as kernel only. This mode requires no TLB manipulation, provides large linear address space, and is typical for embedded applications. ID (PID) and upper 20 bits of the address against PID and VPN (Virtual Page Number) fields in the TLB. When both match (or the TLB entry is Global), the VPN is replaced with the PFN (Physical Frame Number) to form the physical address. TLB misses are handled in software, with the entry to be replaced determined by as imple RANDOM function. The routine to process a TLB miss in the UNIX environment requires only 10-12 cycles, which compares favorably with many CPUs which perform the operation in hardware. TLB Disabled Operation Many embedded systems do not like the complexity or uncertainty associated with the on-chip TLB. However, many systems still desire the ability to implement a kernel/user mode. Therefore, to implement a hierachical task model, the TLB must be used. The R3500 gives the system designer one more option, allowing the TLB to be disabled and performing a fixed mapping of virtual to physical addresses, while maintaining separation of kernel and user resources. The user may elect to disable the TLB through the reset sectors. In this case, the mapping shown in Figure 8. is used, and device power consumption is reduced. Note tha “cached“ segments means that there is no mechanism to exclude addresses in these regions from the cache. This mapping means that applications designed to run in kseg0 and kseg1 (to avoid the TLB) can use the R3500, disable the TLB to reduce power, and not have to change software to take advantage of this new feature. TLB (Translation Lookaside Buffer) Virtual memory mapping is assisted by the Translation Lookaside Buffer (TLB). The on-chip TLB provides very fast virtual memory access and is well-matched to the requirements of multi-tasking operating systems. The fully-associative TLB contains 64 entries, each of which maps a 4kB page, with controls for read/write access, cacheability, and process identification. The TLB allows each user to access up to 2GBs of virtual address space. Figure 6 illustrates the format of each TLB entry. The Translation operation involves matching the current Process 63 44 VPN 43 38 TLBPID 37 32 O 31 12 PFN ENTRYHI 11 10 9 8 N D V G 7 0 O ENTRYLO VPN – Virtual Page Number TLBPID – Process ID PFN – Physical Frame Number N – Non-cacheable flag D – Dirty flag (Write protect) V – Valid entry flag G – Global flag (ignore PID) O – Reserved 2871 drw 06 Figure 6. TLB Entry Format IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES 0xFFFFFFFF KERNEL MAPPED CACHEABLE (kseg2) 0xFFFFFFFF ANY 0xC0000000 KERNEL UNMAPPED UNCACHED (kseg1) PHYSICAL MEMORY 3584 MB 0xA0000000 KERNEL UNMAPPED CACHED (kseg0) 0x80000000 0x7FFFFFFFF 0x20000000 KERNEL/USER MAPPED CACHEABLE (kuseg) 0x1FFFFFFF ANY MEMORY 512 MB 0x00000000 0 2871 drw 07 Figure 7. IDT79R3500 Virtual Address Mapping MNU Address Translation Virtual –> Physical (TBL Disabled) 0xffffffff Kernel Cached (kseg2) Kernel Cacheable Tasks 1024 MB Kernel/User Cacheable Tasks 2048 MB Inaccessible 512 MB Kernel Boot and I/O 512 MB 0xc0000000 0xa0000000 0x80000000 Kernel Uncached (kseg1) Kernel Uncached (kseg0) User Cached (kseg) 0x00000000 2871 drw 08 NOTE: This model is consistent with the mapping available in the IDT79R3051 family. The identical mapping provides software compatibility to the lower cost CPUs. Figure 8. TLB Disabled Mapping IDT79R3500 RISC CPU PROCESSOR RISCore Operating Modes The IDT79R3500 has two operating modes: User mode and Kernel mode. The IDT79R3500 normally operates in the User mode until an exception is detected forcing it into the Kernel mode. It remains in the Kernel mode until a Restore From Exception (RFE) instruction is executed. The manner in which memory addresses are translated or mapped depends on the operating mode of the IDT79R3500. Figure 7 shows the MMU translation performed for each of the operating modes. User Mode—in this mode, a single, uniform virtual address space (kuseg) of 2GB is available. When the TLB is used, each virtual address is extended with a 6-bit process identifier field to form unique virtual addresses. All references to this segment are mapped through the TLB. Use of the cache for up to 64 processes is determined by bit settings for each page within the TLB entries. If the TLB is not used, these addresses are translated to begin at 1GB of the physical address space. Kernel Mode—four separate segments are defined in this mode: • kuseg—when in the kernel mode, references to this segment are treated just like user mode references, thus streamlining kernel access to user data. • kseg0—references to this 512MB segment use cache memory but are not mapped through the TLB. Instead, they always map to the first 0.5GB of physical address space. • kseg1—references to this 512MB segment are not mapped through the TLB and do not use the cache. Instead, they are hard-mapped into the same 0.5GB segment of physical address space as kseg0. • kseg2—when the TLB is not used, references to this 1GB segment directly addresses the upper 1GB of physical address space. These addresses are defined to be kernel mode which are cacheable. When the TLB is used, references to this 1GB segment are always mapped through the TLB and use of the cache is determined by bit settings within the TLB entry. FPA COPROCESSOR OPERATION (CP1) The FPA continually monitors the processor instruction stream. If an instruction does not apply to the coprocessor, it is ignored; if an instruction does apply to the coprocessor, the FPA executes that instruction and transfers necessary result and exception data synchronously to the main processor. The FPA performs three types of operations: • Loads and Stores; • Moves; • Two- and three-register floating-point operations. MILITARY AND COMMERCIAL TEMPERATURE RANGES Load, Store, and Move Operation Load, Store, and Move operations data between memory or the integer registers and the FPA registers. These operations perform no format conversions and cause no floatingpoint exceptions. Load, Store, and Move operations reference a single 32-bit word of either the Floating-Point General Registers (FGR) or the Floating-Point Control Registers (FCR). Floating-Point Operations The FPA supports the following single- and double-precision format floating-point operations: • Add • Subtract • Multiply • Divide • Absolute Value • Move • Negate • Compare In addition, the FPA supports conversions between singleand double-precision floating-point formats and fixed-point formats. The FPA incorporates separate Add/Subtract, Multiply, and Divide units, each capable of independent and concurrent operation. Thus, to achieve very high performance, floating point divides can be overlapped with floating point multiplies and floating point additions. These floating point operations occur independently of the actions of the CPU, allowing further overlap of integer and floating point operations. Figure 9 illustrates an example of the types of overlap permissible. Exceptions The FPA supports all five IEEE standard exceptions: • Invalid Operation • Inexact Operation • Division by Zero • Overflow • Underflow The FPA also suppoerts the optional, Unimplemented Operation exception that allows unimplemented instructions to trap to software emulation routines. The FPA provides precise exception capability to the CPU; that is, the execution of a floating point operation which generates an exception causes that exception to occur at the CPU instruction which caused the operation. This precise exception capability is a requirement in applications and languages which provide a mechanism for local software exception handlers within software modules. IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES 0 2 4 6 8 10 12 DIV.S MUL.S ADD STORE (SWC1) NON FPU MUL.S ADD.S Only Load, Store, and Move operations are permitted in FPA during these cycles. STORE (SWC1) Other FPA instructions can proceed during these cycles. However, two multiply or two divide operation cannot be overlapped. LOAD (LWC1) These cycles are free for integer operations in the CPU. STORE (SWC1) NON FPU 2871 drw 09 Figure 9. Examples of Overlapping Floating Point Operation IF RD I-Cache RF ALU MEM OP D-Cache WB FWB Register file write back or FP exceptions *FpWB * FP ops only One Cycle 2871 drw 10 Figure 10. Instruction Execution IF RD ALU MEM WB *FWB IF RD ALU MEM WB *FWB IF RD ALU MEM WB *FWB IF RD ALU MEM WB *FWB IF RD ALU MEM WB *FWB IF RD ALU MEM WB Instruction Flow Current CPU Cycle Figure 11. IDT79R3500 Execution Sequence *FWB 2871 drw 11 IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES IDT79R3500 PIPELINE ARCHITECTURE The execution of a single IDT79R3500 integer instruction consists of five pipe stages while floating point instruction takes six pipe stages. They are: 1) IF—Instruction fetch. The processor calculates the instruction address required to read from the I cache. 2) RD—The instruction is present on the data bus during phase one of this pipe stage. Instruction decode occurs during phase two. Operands are read from the registers if required. 3) ALU—Perform the required operation on instruction operands. If this is a FPA instruction, instruction execution commences. 4) MEM—Access memory. If the instruction is a load or store, the data is presented or captured during phase 2 of this pipe stage. 5) WB—Write integer results back into register file. In FPA cycles this pipe stage is used for exceptions. 6) FWB—The FPA uses this stage to write back ALU results to its register file. Each of these steps requires approximately one FPA cycle as shown in Figure 10. (parts of some operations spill over into another cycle while other operations require only 1/2 cycle.) The CPU uses a five stage pipeline while while the FPA uses a 6 stage to achieve an instruction execution rate approaching one instruction per cycle. Thus, execution of six instructions at a time are overlapped as shown in Figure 11. This pipeline operates efficiently because different CPU resources (address and data bus accesses, ALU operations, register accesses, and so on) are utilized on a non-interfering basis. Microprocessor (CPU) Data Address Memory (and I/O) 2871 drw 12 Figure 12. A Simple Microprocessor Memory System Figure 13 illustrates a memory system that supports the significantly greater memory bandwidth required to take full advantage of the IDT79R3500’s performance capabilities. The key features of this system are: IDT79R3500A Microprocessor Data Address Instruction Cache Data Cache MEMORY SYSTEM HIERARCHY The high performance capabilities of the IDT79R3500 processor demand system configurations incorporating techniques frequently employed in large, mainframe computers but seldom encountered in systems based on more traditional microprocessors. A primary goal of systems employing RISC techniques is to minimize the average number of cycles each instruction requires for execution. Techniques to reduce cycles-perinstruction include a compact and uniform instruction set, a deep instruction pipeline (as described above), and utilization of optimizing compilers. Many of the advantages obtained from these techniques can, however, be negated by an inefficient memory system. Figure 12 illustrates memory in a simple microprocessor system. In this system, the CPU outputs addresses to memory and reads instructions and data from memory or writes data to memory. The address space is completely undifferentiated: instructions, data, and I/O devices are all treated the same. In such a system, a primary limiting performance factor is memory bandwidth. Write Buffer Data Address Main Memory 2871 drw 13 Figure 13. An IDT79R3500 System with a High-Performance Memory System IDT79R3500 RISC CPU PROCESSOR RISCore • External Cache Memory—Local, high-speed memory (called cache memory) is used to hold instructions and data that is repetitively accessed by the CPU (for example, within a program loop) and thus reduces the number of references that must be made to the slower-speed main memory. Some microprocessors provide a limited amount of cache memory on the CPU chip itself. The external caches supported by the IDT79R3500 can be much larger; while a small cache can improve performance of some programs, significant improvements for a wide range of programs require large caches. • Separate Caches for data and Instructions—Even with high-speed caches, memory speed can still be a limiting factor because of the fast cycle time of a high-performance microprocessor. The IDT79R3500 supports separate caches for instructions and data and alternates accesses of the two caches during each CPU cycle. Thus, the processor can obtain data and instructions at the cycle rate of the CPU using caches constructed with commercially available IDT static RAM devices. In order to maximize bandwidth in the cache while minimizing the requirement for SRAM access speed, the IDT79RR3500 divides a single-processor clock cycle into two phases. During one phase, the address for the data cache access is presented while data previously addressed in the instruction cache is read; during the next phase, the data operation is completed while the instruction cache is being addressed. Thus, both caches are read in a single processor cycle using only one set of address and data pins. • Write Buffer—in order to ensure data consistency, all data that is written to the data cache must also be written out to main memory. The cache write model used by the IDT79R3500 is that of a write-through cache; that is, all data written by the CPU is immediately written into the main memory. To relieve the CPU of this responsibility (and the inherent performance burden) the IDT79R3500 supports an interface to a write buffer. The IDT79R3020 Write Buffer captures data (and associated addresses) output by the CPU and ensures that the data is passed on to main memory. IDT79R3500 Processor Subsystem Interfaces Figure 14 illustrates the three subsystem interfaces provided by the IDT79R3500 processor: • Cache control interface (on-chip) for separate data and instruction caches permits implementation of off-chip caches using standard IDT SRAM devices. The IDT79R3500 directly controls the cache memory with a minimum of external components. Both the instruction and data cache can vary from 0 to 256kB (64K entries). The IDT79R3500 also includes the TAG control logic which determines whether or not the entry read from the cache is the desired data. The IDT79RR3500 implements an advanced feature that allows certain tag comparisons to MILITARY AND COMMERCIAL TEMPERATURE RANGES be eliminated, which in turn reduces the number of cache SRAMs required. The Int(5) reset mode vector contains two bits which sets the tag comparison options. Table 3 illustrates the tag disable encoding. The first row in the table implements the standard IDT79R3000A operating mode where all the tag and tag parity are used. The second row eliminates the upper 4 tag bits, eliminating normally required SRAMs and limiting main memory addressing to 128mB. The third row elimnates the lower 4 tag bits, which requires the cache to be at least 64kB each. The fourth row eliminates the upper 4 and lower 4 tag bits, requiring at least 16K cache entries, and limits main memory addressing to 128mB. In all cases, the IDT79R3500 continues to check tag parity which are selected as driven from the cache. The IDT79R3500 cache controller implements a direct mapped cache for high net performance (bandwidth). It has the ability to refill multiple words when a cache miss occurs, thus reducing the effective miss rate to less than 2% for large caches. When a cache miss occurs, the IDT79R3500 can support refilling the cache in 1, 4, 8, 16, or 32 word blocks to minimize the effective penalty of having to access main memory. The IDT79R3500 also incorporates the ability to perform instruction streaming; while the cache is refilling, the processor can resume execution once the missed word is obtained from main memory. In this way, the processor can continue to execute concurrently with the cache block refill. • Memory controller interface for system (main) memory. This interface also includes the logic and signals to allow operation with a write buffer to further improve memory bandwidth. In addition to the standard full word access, the memory controller supports the ability to write bytes and half-words by using partial word operations. The memory controller also supports the ability to retry memory accesses if, for example, the data returned from memory is invalid and a bus error needs to be signalled. • Coprocessor Interface—The IDT79R3500 features a set of on board tightly coupled coprocessors. Coprocessor 0 is defined to be the system control coprocessor and Coprocessor 1 is the Floating Point Accelerator. They have direct access to the internal data bus which allows them direct load and store of data in the same fashion as accessing the CPU registers. This relieves the typical bottleneck of having to load data into the CPU register set and then passing that data off to the co-processors. In applications where the FPA was off chip, as in using the IDT79R3010A, several control pins were used for communications with the CPU and a Phase Lock Loop was located on the IDT79R3010A to synchronize the two together. As they are now integrated into a single chip, these are no longer needed. The FpCond output, which is used in coprocessor branch instructions, is now internally tied to the CpCond(1) input of the CPU leaving the external CpCond(1) pin available for another function. This signal is selectable to either output the FpBusy or the Fplnt. Cp IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES Cond(1) output selection is determined at reset time according to the value read on Int(4). Table 4 illustrates the options that allow the FpInt to be routed to either the CpCond(1) output, or one of the internal Int pins. If it is internally routed, that interrupt is dedicated and that input will no longer affect the IDT79R3500. The selection of using CpCond(1) allows some external Logic to be added to the path, which might be required in some applications. Another method for Fpint handling is also accommodated. A mode pin, previously VCC can be programmed to route the FPU interrupt to a dedicated Fpint output that was previously a GND. If the mode pin is sampled at reset as a 0, the dedicated Fpint indicates the FPU interrupt - if a 1, then the routing of Table 4 applies. The internal CPBusy input, which is used to stall the CPU if the coprocessor needs to hold off subsequent operations, has two sources-FPBusy and the external CpBusy pin which are logically ORed together. Further, Run and Exception of both the FPA and CPU are internally tied and brought out with the external CPBusy input to accommodate off chip coprocessor 2 and 3. This external interface is available to support application specific functions. MULTIPROCESSING SUPPORT Tag Mode 1 Tag Mode 0 Check Which TAGs Ignore Which Tags 0 0 Tag (31:12) None 0 1 Tag (27:12) Tag (31:28) 1 0 Tag (31:16) Tag (15:12) 1 1 Tag (27:16) Tag (31:28;15:12) 2871 tbl 03 Table 3. Tag Disable Encoding W Cycle X X Cycle X Y Cycle X Z Cycle Action "HIGH" FPint driven onto CpCond(1) "LOW" Use Int(3) for Fpint "LOW" "HIGH" "LOW" Use Int(1) for Fpint "LOW" "LOW" "LOW" "LOW" "LOW" "HIGH" "LOW" The IDT79R3500 supports multiprocessing applications in a simple but effective way. Multiprocessing applications require cache coherency across the multiple processors. The IDT79R3500 offers two signals to support cache coherency: the first, MPStall, stalls the processor within two cycles of being received and keeps it from accessing the cache. This allows an external agent to snoop into the processor data cache. The second signal, MPInvalidate, causes the processor to write data on the data cache bus which indicates the externally addressed cache entry is invalid. Thus, a subsequent access to that location would result in a cache miss, and the data would be obtained from main memory. The two MP signals would be generated by a external logic which utilizes a secondary cache to perform bus snooping functions. The IDT79R3500 does not impose an architecture for this secondary cache, but rather is flexible enough to support a variety of application specific architectures and still maintain cache coherency. Further, there is no impact on designs which do not require this feature. The IDT79R3500 further allows the use of cache RAMs with internal address latches in multiprocessor systems. "LOW" Use Int(2) for Fpint "LOW" "HIGH" "HIGH" "LOW" Use Int(0) for Fpint "HIGH" "LOW" "LOW" "LOW" Use Int(4) for Fpint ADVANCED FEATURES "HIGH" "LOW" "HIGH" "LOW" Use Int(5) for Fpint "HIGH" "HIGH" "LOW" "LOW" Reserved, Undefined "HIGH" "HIGH" "HIGH" "LOW" Reserved, Undefined The IDT79R3500 offers a number of additional features such as the ability to swap the instruction and data caches, facilitating diagnostics and cache flushing. Another feature isolates the, caches, which forces cache hits to occur regardless of the contents of the tag fields. The IDT79R3500 allows the processor to execute user tasks of the opposite byte ordering (endianness) of the operating system, has a programmable Tag width bus, and further allows certain parity checking to be disabled. More details on these features can be found in the IDT79R3000A Family Hardware User’s Manual. Further features of the IDT79R3500 are configured during the last four cycles prior to the negation of the RESET input. These functions include the ability to select cache sizes and cache refill block sizes; the ability to utilize the multiprocessor interface; whether or not instruction streaming is enabled; whether byte ordering follows “Big-Endian” or “Little-Endian” protocols, etc. Additionally, the IDT79R3500 mode must be 2871 tbl 04 Table 4. Int(4) Encoding for Fpint IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES true to enable any of the new features that the X,Y, and Z cycles define. Table 6 shows the configuration options selected at Reset. These are further discussed in the IDT79R3000A Family Hardware User’s Manual. BACKWARD COMPATIBILITY The primary goal of the IDT79R3500 is the ability to replace the IDT79R3000A and IDT79R3010A with a single chip solution. The pinout of the IDT79R3500 has been selected to ensure this compatibility, with new functions mapped onto previously used pins. The instruction set is compatible with that of the R2000 at the binary level. As a result, code written for the older processor can be executed. In most IDT79R3000A applications, the IDT79R3500 can be placed in the socket with no modification to initialization settings. Additionally, the IDT79R3500 can be used in systems that did not include the IDT79R3010 in the original design. Further application assistance on these topics are available from IDT. PACKAGE THERMAL SPECIFICATIONS The IDT79R3500 utilizes special packaging techniques to improve both the thermal and electrical characteristics of the microprocessor. In order to improve the electrical characteristics of the device, the package is constructed using multiple signal planes, including individual power planes and ground planes to reduce noise associated with high-frequency TTL parts. In addition, the 161-pin PGA package utilizes extra power and ground pins to reduce the inductance from the internal power planes to the power planes of the PC Board. In order to improve the thermal characteristics of the microprocessor, the device is housed using cavity down Input Int0 Int1 Int2 Int3 Int4 Int5 W Cycle DBIkSize0 IBIkSize0 DispPar/RevEnd Reserved(1) FPINT decode 7R3500 mode X Cycle DBIkSize1 IBIkSize1 IStream StorePartial FPINT decode TLB disable packaging. In addition, these packages incorporate a coppertungsten thermal slug designed to efficiently transfer heat from the die to the case of the package, and thus effectively lower the thermal resistance of the package. The use of an additional external heat sink affixed to the package thermal slug further decreases the effective thermal resistance of the package. The case temperature may be measured in any environment to determine whether the device is within the specified operating range. The case temperature should be measured at the center of the top surface opposite the package cavity (the package cavity is the side where the package lid is mounted). The equivalent allowable ambient temperature, TA, can be calculated using the thermal resistance from case to ambient (∅ca) for the given package. The following equation relates ambient and case temperature: T A = Tc - P*∅ca where P is the maximum power consumption, calculated by using the maximum lcc from the DC Electrical Characteristics section. Typical values for ∅ca at various airflows are shown in Table 5 for the various CPU packages. Airflow - (ft/min) 0 200 400 600 800 1000 ∅ca (161-PGA) 21 7 3 2 1 0.5 ∅ca (160 MQUAD) 17 11 9 8 7 6.5 2871 tbl 05 Table 5. R3500 Package Characteristics Y Cycle Extend Cache MPAdrDisable IgnoreParity MultiProcessor FPINT decode Tag Mode 1 NOTES: 1. Reserved entries must be driven high. 2. These values must be driven stable throughout the enfire RESET period. Table 6. R3500 Mode Selectable Features Z Cycle Big Endian TriState NoCache BusDriveOn FPINT onto CpCond Tag Mode 0 2871 tbl 06 IDT79R3500 RISC CPU PROCESSOR RISCore Data Bus MILITARY AND COMMERCIAL TEMPERATURE RANGES Data Bus Tag Bus Data Bus Tag Bus AdrLo Bus AdrLo Bus Tag TagV TagP Transparent Latch Data Tag Instruction Cache AdrLo Data DataP IClk DClk IDT79R3500A Processor with System Control Coprocessor IAdr [15:2] DAdr Tag [15:2] IRd DRd OE WE IWr DWr WE XEn SysOut AccTy(2:0) MemRd MemWr RdBusy WrBusy CpCond(0) BusError Data Data Cache OE Clk2xSys Clk2xSmp Memory Interface Transparent Latch Clocks 2 Clk2Rd Clk2xPhi Reset CpSync 3 Coprocessors Run Exc CpBusy CpCond[2:3] Int[5:0] Hardware Interrupts Figure 14. IDT79R3500 Subsystem Interfaces Example; 64 KB Caches 2871 drw 14 IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES PIN CONFIGURATION 1 2 3 4 5 6 A (No Pin) AdrLo 6 AdrLo 10 AdrLo 11 VCC AdrLo 14 B AdrLo 3 DRd2 AdrLo 7 AdrLo 9 AdrLo 12 C AdrLo 0 AdrLo 4 Mode AdrLo 5 D Data 1 AdrLo 2 FpInt GND E DataP 0 Data 0 10 11 12 13 14 15 AdrLo CpCond AdrLo 15 0 16 AdrLo 17 Int(2) Int(5) Wr Busy Reset VCC IRd2 AdrLo CpCond 13 1 Int(1) Int(3) Cp Busy Bus Error DWr2 Tag12 Tag15 AdrLo 8 GND GND VCC Int(0) Int(4) Rd Busy GND Tag13 TagP0 Tag18 VCC GND VCC GND VCC GND VCC GND Tag14 Tag17 Tag19 AdrLo 1 Tag16 Tag20 VCC VCC Data 7 Data 2 GND Tag21 Tag23 G Data 4 Data 3 GND GND Tag22 TagP1 H Data 6 Data 5 Data 8 VCC Tag25 Tag24 J Data 10 DataP 1 Data 9 Tag28 Tag29 Tag26 K Data 15 Data 11 GND GND TagP2 Tag27 L VCC Data 12 Data 17 Acc Typ2 Tag31 Tag30 M Data 13 Data 16 DataP 2 GND VCC Acc Typ1 VCC N Data 14 Data 18 Data 19 GND Data 24 Run TagV P Data 23 Data 20 IWr2 Data 22 Data 26 Data 27 XEn Data 30 Q VCC Data 21 Data 25 Data 31 Data 28 GND Data 29 Exception F GND 7 VCC 8 GND 9 VCC GND VCC GND GND DataP VCC VCC GND 161-Pin PGA (Top View) 3 GND DRd1 Mem Wr Mem Rd Clk2x Sys Clk2x Rd DClk IRd1 IWr1 Cp Sync Acc Typ0 Clk2x Phi Clk2x Smp SysOut VCC IClk DWr1 VCC NOTE: 1. AdrLo 16 and 17 are multifunction pins which are controlled by mode select programming on interrupt pins at reset time AdrLo 16: MP Invalidate, CpCond (2). AdrLo 17: MP Stall, CpCond (3). 2. This package is pin-compatible with the 175-pin PGA for the R3000A. 15 2871 drw 16 IDT79R3500 RISC CPU PROCESSOR RISCore MILITARY AND COMMERCIAL TEMPERATURE RANGES 81 Run GND VCC GND Tag(27) Tag(28) Tag(29) Tag(30) TagP(2) VCC Tag(31) TagV AccTyp(2) AccTyp(1) AccTyp(0) 80 41 40 160 Pin EIAJ MQUAD Top Side View (Cavity Down) GND Data(16) Data(17) Data(18) Data(23) DataP(2) Data(19) Data(20) VCC GND VCC Data(13) Data(14) 1 160 VCC Data(12) 120 121 GND AdrLo(4) AdrLo(3) AdrLo(2) AdrLo(1) AdrLo(0) Data(0) Data(1) Data(2) GND Data(7) DataP(0) Data(3) VCC Data(4) GND Data(5) GND Data(6) Data(8) Data(9) Data(10) Data(15) DataP(1) Data(11) GND GND Reset Bus Error RdBusy WrBusy CpBusy Int(5) Int(4) Int(3) Int(2) Int(1) Int (0) VCC AdrLo(17) VCC AdrLo(16) VCC CpCond(1) CpCond(0) FPInt AdrLo(15) Mode AdrLo(14) VCC AdrLo(13) VCC VCC AdrLo(12) GND AdrLo(11) GND AdrLo(10) GND AdrLo(9) AdrLo(8) AdrLo(7) AdrLo(6) AdrLo(5) VCC GND Tag(15) GND TagP(0) GND Tag(16) Tag(17) Tag(18) Tag(19) Tag(20) VCC Tag(21) Tag(22) Tag(23) TagP(1) GND Tag(24) Tag(25) GND Tag(26) VCC Tag(12) Tag(13) GND Tag(14) PIN CONFIGURATION NOTE: 1. AdrLo 16 and 17 are multifunction pins which are controlled by mode select programming on interrupt pins at reset time AdrLo 16: MP Invalidate, CpCond (2). AdrLo 17: MP Stall, CpCond (3). 2. This package is pin-compatible with the 175-pin PGA for the R3000A. CpSync MemRd MemWr DWr(1) DWr(2) IWr(1) IWr(2) DRd(1) DRd(2) IRd(1) IRd(2) GND IClk GND DClk SysOut Clk2xRd Clk2xSys VCC Clk2xSmp Clk2xPhi Exception GND Data(30) GND Data(29) GND XEn VCC Data(28) VCC Data(27) VCC DataP(3) Data(31) Data(26) Data(25) Data(24) Data(22) Data(21) 2860 drw 14a