Tri Cor e F ami l y AP3216 8 Application Performance Optimization for TriCore V1.6 Architecture Applic atio n Note V1.0 2011-03 Mic rocon t rolle rs Edition 2011-03 Published by Infineon Technologies AG 81726 Munich, Germany © 2011 Infineon Technologies AG All Rights Reserved. LEGAL DISCLAIMER THE INFORMATION GIVEN IN THIS APPLICATION NOTE IS GIVEN AS A HINT FOR THE IMPLEMENTATION OF THE INFINEON TECHNOLOGIES COMPONENT ONLY AND SHALL NOT BE REGARDED AS ANY DESCRIPTION OR WARRANTY OF A CERTAIN FUNCTIONALITY, CONDITION OR QUALITY OF THE INFINEON TECHNOLOGIES COMPONENT. THE RECIPIENT OF THIS APPLICATION NOTE MUST VERIFY ANY FUNCTION DESCRIBED HEREIN IN THE REAL APPLICATION. INFINEON TECHNOLOGIES HEREBY DISCLAIMS ANY AND ALL WARRANTIES AND LIABILITIES OF ANY KIND (INCLUDING WITHOUT LIMITATION WARRANTIES OF NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY) WITH RESPECT TO ANY AND ALL INFORMATION GIVEN IN THIS APPLICATION NOTE. Information For further information on technology, delivery terms and conditions and prices, please contact the nearest Infineon Technologies Office (www.infineon.com). Warnings Due to technical requirements, components may contain dangerous substances. For information on the types in question, please contact the nearest Infineon Technologies Office. Infineon Technologies components may be used in life-support devices or systems only with the express written approval of Infineon Technologies, if a failure of such components can reasonably be expected to cause the failure of that life-support device or system or to affect the safety or effectiveness of that device or system. Life support devices or systems are intended to be implanted in the human body or to support and/or maintain and sustain and/or protect human life. If they fail, it is reasonable to assume that the health of the user or other persons may be endangered. AP32168 Application Performance Optimization for TriCore V1.6 Architecture Device1 Revision History: V1.0, 2011-03 Previous Version: none Page Subjects (major changes since last revision) We Listen to Your Comments Is there any information in this document that you feel is wrong, unclear or missing? Your feedback will help us to continuously improve the quality of this document. Please send your proposal (including a reference to this document) to: [email protected] Application Note 3 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Table of Contents Table of Contents 1 1.1 1.2 1.3 Introduction ........................................................................................................................................5 Naming Convention..............................................................................................................................5 Tool chain and Target Hardware..........................................................................................................5 Application Benchmarks.......................................................................................................................5 2 2.1 2.2 Application Performance Optimization............................................................................................5 Performance Criteria ............................................................................................................................5 Measuring Performance using TriCore Performance Counters...........................................................6 3 3.1 3.2 3.2.1 3.2.1.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 Hardware Configuration ....................................................................................................................7 CPU Clock............................................................................................................................................7 Memory System ...................................................................................................................................9 PMU0 and PMU1 Flash memories.....................................................................................................10 PMU0 and PMU1 Flashes wait states configuration..........................................................................11 Caches: ICACHE and DCACHE ........................................................................................................11 Instruction Cache ICACHE.................................................................................................................12 Data Cache: DCACHE .......................................................................................................................12 ICACHE and DCACHE configuration.................................................................................................13 Evaluating execution time in cacheable architecture .........................................................................13 4 4.1 4.1.1 4.1.2 4.1.3 4.1.3.1 4.1.3.2 4.1.3.3 4.1.3.4 4.1.3.5 4.1.4 4.2 4.3 4.3.1 4.3.2 4.3.3 4.3.4 Application Software .......................................................................................................................15 Compiler Optimizations ......................................................................................................................15 Predefined compiler configurations....................................................................................................15 Optimizations Tradeoffs: speed vs. code size ..................................................................................16 Short addressing ................................................................................................................................17 Near Addressing.................................................................................................................................17 Configuration for near addressing......................................................................................................17 Base + Long Offset addressing using global Base Registers (A0, A1, A8, A9).................................18 Configuration for A0 / A1 and A8 / A9 addressing .............................................................................18 Qualifiers controlling all addressing modes .......................................................................................19 Memory Location (Typical Use Case) ................................................................................................19 Linker Script files ................................................................................................................................20 Additional optimization options...........................................................................................................20 Floating-point arithmetic/algorithms ...................................................................................................20 Fixed-point arithmetic/algorithms .......................................................................................................20 Intrinsic Functions ..............................................................................................................................20 Inline assembler .................................................................................................................................21 5 Performance Optimization Checklist .............................................................................................21 6 6.1 6.2 6.3 6.4 6.5 Performance relevant differences TC1.6 vs. TC1.3.1 ...................................................................21 Pipeline...............................................................................................................................................21 Base + Long Offset addressing..........................................................................................................22 Hardware Floating Point Unit (FPU)...................................................................................................22 Integer Division...................................................................................................................................22 Performance TC1.6 vs. TC1.3.1 architecture ....................................................................................22 7 References ........................................................................................................................................24 8 8.1 8.2 8.3 8.4 8.5 8.6 Appendix A .......................................................................................................................................25 Compiler Options................................................................................................................................25 Compiler optimizations flags ..............................................................................................................28 Predefined compiler optimization profiles ..........................................................................................28 Compiler generated sections .............................................................................................................29 Compiler memory qualifiers ...............................................................................................................30 Libraries..............................................................................................................................................30 Application Note 4 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Introduction 1 Introduction TriCore, the new processors family generation based on TC1.6 core provides high performance architecture for embedded applications. To fully utilize its capability, the necessary steps involving hardware configuration and software optimization are described for achieving the best application performance. Hardware configuration is a well defined task of parameters setting appropriate for defined target system. Completeness and correctness of these settings are crucial for best performance. Software optimization is iterative process of compiler/linker settings with no clear convergence to optimum. Interaction between various settings and not unique best selection satisfying all application functions determine the optimization complexity. This application note will guide you through the optimization process to achieve best application performance. Included application benchmarks results will demonstrate the effect of various optimization settings. 1.1 TC1.6 Naming Convention - TriCore Architecture V1.6 TC1.3.1 - TriCore Architecture V1.3.1 TC1793 - Processor based on TC1.6 Architecture TC1797 - Processor based on TC1.3.1 Architecture SFR-R - 1.2 • • TriCore registers stored in Altium TASKING install directory under regtc179x.sfr Tool chain and Target Hardware All included tool chain configuration data are based on Altium TASKING VX-toolset for TriCore V3.4r1. Target Hardware for performance evaluation - TC1793 and TC1797 TriBoards 1.3 Application Benchmarks Included • • • • APP-1 Application with code size smaller than cache size, linear code with limited loops. APP-2 Application with code size smaller than cache size, including many loops. APP-3 Complete application with code size significantly exceeding the cache size. Intensive use of load store operation and integer arithmetic. APP-4 DSP fixed point algorithm. 2 Application Performance Optimization 2.1 Performance Criteria In this document performance optimization and measurement will focus on • • Code size CPU execution time Performance evaluation is used to: • • • Demonstrate that the system meets performance criteria. Compare two systems to find which performs better. Measure what parts of the system or workload cause the system to perform badly Application Note 5 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Performance Optimization In performance testing, it is often crucial for the test conditions to be similar to the expected actual use. Included performance data is primary based on real applications expected to be implemented on described devices. 2.2 Measuring Performance using TriCore Performance Counters Real-time measurement of core performance provides useful insights to system developers, compiler developers, application developers and OS developers. TriCore includes the ability to measure different performance aspects of the processor without any real-time effect on its execution. Following performance counters are integrated in the TriCore core module: • • Dedicated counters o CCNT: CPU Clocks Counter o ICNT: Instruction Counter Configurable counters, each with selectable one of four count modes o M1CNT: 1. IP_DISPACTH_STALL, 2. PCACHE_HIT, 3.DCACHE_HIT, 4.TOTAL_BRANCH o M2CNT: 1. LS_DISPACTH_STALL, 2. PCACHE_MISS, 3. DCACHE_MISS_CLEAN, 4. PMEM_STALL o M3CNT: 1. LP_DISPACTH_STALL, 2. MULTI_ISSUE, 3. DCACHE_MISS_DIRTY, 4. DMEM_STALL You can use directly all the available performance counters to measure execution time, instruction count or other values. Alternatively dedicated performance analyzing tools integrated within debuggers can be used. Instruction Per Cycle (IPC) defined as ratio of (Number of Instruction)/(CPU Clocks) is very useful as a measure of optimization progress. It reflects a compiler configuration quality, efficiency of memory system and code/data location. Figure 1 shows measured IPC values of some representative benchmarks using TriCore Performance Counters and calculated by ratio of ICNT/CCNT. This measurement is very accurate and can be evaluated without any special equipment. Higher is better 1.700 1.800 1.600 1.400 Benchmark(); IPC 1.200 // Stop Measure t2a=__mfcr(CCNT); t2b=__mfcr(ICNT); cpu_clk = t2a-t1a; instr_cnt = t2b-t1b; 0.898 1.000 0.800 0.600 // Start Measure t1a=__mfcr(CCNT); t1b=__mfcr(ICNT); 0.554 0.476 0.400 // CCNT, ICNT are // defined in SFR file 0.200 0.000 APP-1 Figure 1 APP-2 APP-3 APP-4 IPC Measurement with TriCore Performance Counters Note: Performance Counters need be enabled before can be used. To enable all the counters set the CCTRL.CE bit to 1. Application Note 6 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration 3 Hardware Configuration Hardware configuration is a well defined task of parameters setting, appropriate for defined target system, including: • • • CPU and other clocks frequencies Flashes wait states ICACHE and DCACHE configuration 3.1 CPU Clock CPU Clock is the most fundamental configuration parameter having impact on the other system settings and a primary factor determining execution time. Figure 2 shows the clock control unit providing separate clocks for several TriCore modules. Each clock frequency is configurable by the dedicated divider driven by PLL or PLL_ERAY clocks. Output frequency is the result of the input frequency and the divider value. Due to common input clock the output frequency to each module is not free selectable. Maximal allowable frequencies and some defined constraint determine the possible clocks frequencies. Figure 2 Clock Control Unit Application Note 7 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration Rules for calculation of modules clocks: • • • • • • • fCPU: fFSI fCPU: fFPI fPCP: fFPI fFPI: fMCDS fCPU: fSRI fMCDS: fCPU fBBB: fMCDS Table 1 1:1 or 2:1 n:1 (n = 1...16) 1:1 or 2:1 1:n or special case fFPI : fMCDS = 2:3 1:1 1:2; 1:1 or 2:1 1:1 or 1:2 Examples of modules clocks configuration fPLL [MHz] SRIDIV fCPU [MHz] FSIDIV fFSI [MHz] PCPDIV fPCP [MHz] FPIDIV fFPI [MHz] 600 1 300 3 150 2 200 5 100 520 1 260 3 130 2 173.3 5 86.6 200 0 200 1 100 0 200 1 100 Note: In the Table 1 to have the real clock divider values you need add 1 to the table FSDIV values (e.g. FSIDIV = 3 means divide by (3+1) ) Note: Configuration of PLL Module used for generation of CPU Clock involves correct sequence of register configuration, monitoring of lock condition and final configuration. Application Note 8 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration 3.2 Memory System The gap between processor and memory performance is steadily growing. To compensate this speed differences, primary between PMU Flash units and the TriCore CPU, two 16KB caches ICACHE for instruction and DCACHE for data are part of TriCore memory hierarchy. 32KB Program Scratchpad RAM (PSPR) and 128KB Data Scratchpad RAM (DSPR) provides a fast, deterministic program fetch and data access for use by performance critical code sequences. PMI DMI TriCore CPU 32KB PSPR 16KB ICACHE EXT RAM 128KB DSPR 16KB DCACHE Cross Bar Interconnect (SRI) EBU EXT Flash Bridge XBAR PMU0 PMU1 LMU 2MB PFlash 192KB DFlash 16KB BROM 2MB PFlash 128KB SRAM 16KB PRAM PCP2 Core 32KB CMEM Figure 3 TriCore Memory Hierarchy (TC1.6) Table 2 contains additional details of available TriCore memories, organized from the fasted to slowest. “Memory Config. WS” column includes a memory configurable wait states which should be set according to working conditions. Segment NC/C (not cacheable/cacheable) column contains segments addressable by each memory and whether it’s cacheable or not. Application Note 9 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration Table 2 TriCore TC1.6 Memories (Start from fastest) Memory Type Memory Config. Wait States (WS) Segment NC* Segment C Remarks PSPR: Program Scratch-Pad RAM - 0xC - ICACHE: Instruction Cache DSPR: Data Scratch-Pad RAM - - - 0xD - DCACHE: Data Cache PMU0 - PFlash: Program Flash - - 1..15 0xA 0x8 1..15 0xA 0x8 See WS Table 3 - 0xB 0x9 Used as overlay RAM, general data and code EXT RAM Memory type dependent 0xA 0x8 To be set according to used memory type and EBU frequency EXT FLASH Memory type dependent 0xA 0x8 To be set according to used memory type and EBU frequency PMU1 - PFlash: Program Flash LMU SRAM Used for deterministic and performance critical code (e.g. DSP Algorithms, OS) Reduce significantly average access time for memories using cacheable segments Used for deterministic and performance critical data access (e.g. DSP alg., OS) Reduce significantly average access time for memories using cacheable segments See WS Table 3 *NC –Not cacheable, C- Cacheable 3.2.1 PMU0 and PMU1 Flash memories CPU and PMU-Flashes are connected to separate clocks which are derived from the same PLL output but using dedicated clock dividers SRIDIV and FSIDIV. Flashes and CPU can use the same clock frequency if it doesn’t exceed 150 MHz, otherwise higher FSIDIV value need be used because FSI frequency is limited to 150MHz. The required number of wait states for an initial access to PFlash or DFlash is related to the maximum FSI frequency. Because the default after reset is a worst case setting sufficient for all frequencies, the access times have to be configured by the user according to the application’s frequency for optimum performance. This configuration of wait states (in number of FSI clock cycles) must be configured via the 4-bit-fields “WSPFLASH” and “WSDFLASH” in register FCON (Flash Configuration Register). Table 3 includes example configurations for some selected CPU frequencies of the TC1.6 derivatives. PLL frequency determines the FSIDIV value which is selected to keep the FSI frequency below 150MHz. Table 3 fPLL [MHz] PMU0, PMU1 Flash wait states (WS) with PFLASH Ta=26 ns and DFLASH Ta=50 ns SRIDIV fCPU [MHz] FSIDIV fFSI [MHz] PFlash WS PFlash WS DFlash WS DFlash WS Computed Rounded Computed Rounded 7.5 8 600 1 300 3 150 3.9 4 520 1 260 3 130 3.4 4 6.5 7 200 0 200 1 100 2.6 3 5.0 5 Note: The calculated wait states are relative to the fFSI. To calculate the wait states relative to fCPU WScpu =( fCPU / fFSI) * WS. e.g. PFlash WS relative to CPU are 8, 8, 6 Application Note 10 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration 3.2.1.1 PMU0 and PMU1 Flashes wait states configuration To configure FSI clock divider use CCUCON0 register (SFR-R SCU_CCUCON0) • FSIDIV = [19:16] bits (Divide Value = FSIDIV+1, e.g. FSIDIV=3 means divide by 4) To configure PMU0 Program and Data Flash wait states use FLASH0 register FCON (SFR-R FLASH0_FCON) bit fields: • • • • WSPFLASH = WSECPF = WSDFLASH = WSECDF = [3:0] bits [4] bit [11:8] bits [12] bit To configure PMU1 Program and Data Flash wait states use FLASH1 register FCON (SFR-R FLASH1_FCON) bit fields: • • • • WSPFLASH = WSECPF = WSDFLASH = WSECDF = [3:0] bits [4] bit [11:8] bits [12] bit Higher is better 1.800 -0.1% 1.600 1.400 IPC 1.200 0,0% 4-ws 1.000 0.800 6-ws -2.6% -3.6% 8-ws 0.600 15-ws 0.400 0.200 0.000 APP-1 APP-2 APP-3 Figure 4 IPC versus PMU0 Program Flash wait states 3.2.2 Caches: ICACHE and DCACHE APP-4 16KB of ICACHE and 16KB DCACHE are used to reduce an average access time of much slower but much bigger PMU0/PMU1 Flashes, LMU RAM and external RAM/Flash. The code and data used can be fully cacheable partially or not cacheable depended on application requirements. Application Note 11 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration 3.2.3 Instruction Cache ICACHE The ICACHE is a four-way set-associative cache with a Pseudo Least-Recently-Used (PLRU) replacement algorithm. Each ICACHE line contains 256 bits of instruction along with a single associated valid bit and associated ECC bits. Code located in PMU0/PMU1, external RAM/Flash or LMU will be cacheable if all conditions are fulfilled • • • • Code located in 0x8 segment (not LMU) Code located in 0x9 segment (LMU) ICACHE enabled (default, not configurable) ICACHE bypass deactivated Code located in PMU0/PMU1, external RAM/Flash or LMU will be not cacheable if one of the conditions is fulfilled • • • Code located in 0xA segment (not LMU) Code located in 0xB segment (LMU) ICACHE bypass activated Note: By default all the segments are defined as described above. Still there is additional register PMA0 that can partially change the cacheability of some segments. Note: Code can be located in 0x8 segment but still be not cacheable or in case of LMU 0x9 segment. Note: If some code should be not cacheable (e.g. deterministic behavior) locate it in 0xA segment or in case of LMU 0xB segment. Code located in following memories can be cacheable: • • • • PMU0-PFlash and BROM PMU1-PFlash External memories LMU 3.2.4 Data Cache: DCACHE Four-way set associative cache, Pseudo least recently used (PLRU) replacement algorithm • • • • Cache line size: 256 bits Validity granularity: One valid bit per cache line Write-back Cache: Writeback granularity: 256 bits Refill mechanism: full cache line refill Data located in PMU0/PMU1, External RAM/Flash and LMU will be cacheable if all conditions are fulfilled • • • • Data located in 0x8 segment (not LMU) Data located in 0x9 segment (LMU) DCACHE enabled (default, not configurable) DCACHE bypass deactivated Data located in PMU0/PMU1, External RAM/Flash or LMU will be not cacheable if one of the conditions is fulfilled • • • Data located in 0xA segment (not LMU) Data located in 0xB segment (LMU) DCACHE bypass activated Note: By default all the segments are defined as described above. Still there is additional register PMA0 that can partially change the cacheability of some segments. Note: Data can be located in 0x8 segment but still be not cacheable or in case of LMU 0x9 segment. Application Note 12 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration Note: If some data should be not cacheable (e.g. deterministic behavior) locate it in 0xA segment or in case of LMU 0xB segment. Data located in following memories can be cacheable: • • • • PMU0-PFlash, DFlash PMU1-PFlash External memory LMU 3.2.5 ICACHE and DCACHE configuration After reset DCACHE and ICACHE are enabled but bypassed. To use ICACHE you need to disable the bypass: • Set configuration register PCON0.PCBYB=0 (SFR-R PCON0) To use DCACHE you need to disable the bypass: • Set configuration register DCON0.PCBYB=0 (SFR-R DCON0) 3.2.6 Evaluating execution time in cacheable architecture Application code or benchmarks which need to be optimized and the achieved execution speed measured often include a small selected part of application or in special cases a complete code. During optimization process reducing of execution time is one of most important targets. The progress can be evaluated based on relative performance compared to other setting measured under the same conditions. In case the absolute (not relative) execution times are important, the caches impact under different test condition is noticeable and the results interpretation is not straight forward. By running the same application more than one time (assume the same code and conditions are met) it can be observed different run time values for the first and following tests. Different behavior can be also observed for small code size fully matching in cache and big application causing cache swapping. In case of small applications, first execution start with empty cache while in following tests the complete code and data are in caches, executing as fast as from scratch pad rams. The run time differences are dependent on the code data structure as seen in Figure 5 APP-1 and APP-2. Small differences are hints to insensitivity to cache swapping and flashes wait states using cacheable segments. Mapping of the first/second execution time results to the real execution environment is not straightforward. In case of big application, the cache swapping takes place already in the first and the following tests. The execution time of the second run should be similar to real execution environment. Application Note 13 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Hardware Configuration Higher is better 2.000 1.70 1.72 1.800 1.600 1.400 IPC 1.200 1.000 0.800 0.600 0.78 run1 0.90 0.90 run2 0.550.55 0.48 0.400 0.200 0.000 APP-1 Figure 5 APP-2 APP-3 APP-4 IPC dependency on ICACHE and DCACHE states (run1, run2) Application Note 14 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software 4 Application Software 4.1 Compiler Optimizations The most significant opportunity for influencing the performance of a given application is by compiler and linker optimizations. Optimizing is a tradeoff between code size and performance. Code optimization is primary controlled by set of optimization flags which together makes optimization profile. Some proved predefined optimization profiles are usually provided by compiler vendors with additional tradeoff parameter for speed or code size. Application performance optimization is an iterative process of • • • • • Selecting optimization profile Executing Result recording of execution time and code size Comparing to other result Repeating if not satisfied Two main approaches can be used • • Global Optimization: the same optimization setting is applied to complete code Profiling driven selective optimization: based on profiling information the critical functions are identified and the best (custom) optimization is applied. The focus in this document is on global optimization. Included results are also based on global optimization. Note: Before you start with compiler optimization you should be aware that many options are predefined (having default values) influencing the performance without explicitly be defined. To see option summary including their default values run the ctc.exe using -? option ( ctc -?) located in Altium TASKING compiler directory 4.1.1 Predefined compiler configurations Unless you have your proven preferred configuration you should start the optimization process with available predefined optimization profiles -O0 till –O3. Table 4 Predefined compiler optimization profiles optimize[=flags] -Oflags --optimize=0 -O0 --optimize=1 -O1 --optimize=2 -O2 --optimize=3 -O3 Description No optimization Alias for -OaCEFGIKLMNOPRSUVWY Optimize Alias for -OaCefgIKLMNOPRSUVWy Optimize more (default) Alias for -OacefgIklMNoprsUvwy Optimize most Alias for -OacefgiklmNoprsuvwy The predefined optimizations profiles including 18 compiler optimization flags (small letter means active) building proven recommended configuration. You can see the differences between various settings by comparing the different letters. For example –O3 has two additional optimization activated m and u. You can Application Note 15 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software use your own setting by explicitly defining all options (letters) or make small modification to available configuration e.g. to use -O2 but disable code compaction (r/R) use –O2 –OR. Table 5 C Compiler Flags definitions --optimize[=flags] -O[=flags] Description +/-coalesce a/A Coalescer: remove unnecessary moves +/-cse c/C Common sub expression elimination +/-expression e/E Expression simplification +/-flow f/F Control flow simplification +/-glo g/G Generic assembly code optimizations +/-inline i/I Automatic function inlining +/-schedule k/K Instruction scheduler +/-loop l/L Loop transformations +/-simd m/M Perform SIMD optimizations +/-align-loop n/N Align loop bodies +/-forward o/O Forward store +/-propagate p/P Constant propagation +/-compact r/R Code compaction (reverse inlining) +/-subscript s/S Subscript strength reduction +/-unroll u/U Unroll small loops +/-ifconvert v/V Convert IF statements using predicates +/-pipeline w/W Software pipelining +/-peephole y/Y Peephole optimizations Optimization profiles are always extended and influenced by –tradeoff parameter controlling the balance of speed versus code size. 4.1.2 Optimizations Tradeoffs: speed vs. code size Important part of the optimization is finding appropriate tradeoff between code size and performance. Using trade of parameter --tradeoff={0|1|2|3|4} or -t{0|1|2|3|4} with default: --tradeoff=4 optimize for size. --tradeoff=0: optimize for speed --tradeoff=4: optimize for code size --tradeoff=2: balance for speed and size If the compiler uses certain optimizations (e.g. –O2), you can use this option to specify whether the used optimizations should optimize for more speed (regardless of code size) or for smaller code size (regardless of speed). If you have not specified the option –optimize (or –O), the compiler uses the default Optimize more optimization (-O2). In this case it is still useful to specify a trade-off level. It is recommended to start with -O2 and --tradeoff=2 setting, analyzing the speed and code size variation as you modifying the Optimization (-O) and --tradeoff settings using your target Application. Additionally to compiler optimization settings, using of efficient addressing modes can improve significantly overall performance and code size as described in next paragraphs. Application Note 16 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software 4.1.3 Short addressing TriCore Architecture has an address width of 32 bit and can access up to 4 GB of memory. Within limited addressing ranges more efficient code and faster execution time can be achieved. Absolute addressing is available for the first 16 KB of each segment. Base + Long Offset addressing using global Base Registers (A0, A1, A8, A9) provide efficient data access in the address range of 64KB. Appropriate tool-chain settings are required to use these memory segments. 4.1.3.1 Near Addressing Figure 6 shows the location of near segments occupying first 16kB of each TriCore 256MB memory segment. . You can use it to locate variables (initialized or not) and constants. The included example demonstrates the efficiency of absolute addressing used in near segment. Table 6 shows the memories suitable for near addressing. Note: Do not block near segment with CSA, Stacks, etc. that are not accessed with near addressing Figure 6 Near versus far Addressing Table 6 Memories suitable for near data addressing (memories including 0..0x3FFF range) Memory Data types Compiler default Section PMU0-PFlash constant .zrodata DSPR variables initialized, uninitialized .zdata .zbss .nearnoclear LMU variables initialized, uninitialized .zdata .zbss .nearnoclear 4.1.3.2 Configuration for near addressing Near addressing can be enabled by • Compiler option: --default-near-size [=threshold] (To put all data objects with a size of [threshold] bytes or smaller in __near sections (default threshold =8)) • Pragma in C source code: default_near_size [value] [default | restore] • Memory qualifier: __near By using compiler option, the setting is valid for the entire program unless not the same options are used for all modules. Pragmas overrules the compiler options and can be used for defined code blocks. With __near memory qualifier you can control the location for single data objects. Application Note 17 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software 4.1.3.3 Base + Long Offset addressing using global Base Registers (A0, A1, A8, A9) Figure 7 shows the location of A0/A1, A8/A9 addressing range occupying 64kB memory segment. The simple example demonstrates the efficiency of this addressing mode. Figure 7 A0/A1 versus far addressing A0 register is only available for variables (initialized or not). A1 register is only available for constants. In case of A8 and A9 both are available for variables (initialized or not) and constants. Table 7 Memories suitable for A0/A1 addressing Memory Data types PMU0-PFlash Constant (A0) .ldata DSPR* variables initialized, uninitialized (A1) .sdata .sbss LMU variables initialized, uninitialized (A1) .sdata .sbss Ext. RAM variables initialized, uninitialized (A1) .sdata .sbss Ext. FLASH Constant (A0) .ldata Table 8 Compiler default Section Memories suitable for A8/A9 addressing Memory Data types Compiler default Section PMU0-PFlash constant .rodata_a8 or .rodata_a9 DSPR* variables initialized, uninitialized .data_a8 .bss_a8 or .data_a9 .bss_a9 LMU variables initialized, uninitialized .data_a8, .bss_a8 or .data_a9 .bss_a9 Ext. RAM variables initialized, uninitialized .data_a8 .bss_a8 or .data_a8 .bss_a9 Ext. FLASH constant .rodata_a8 or .rodata_a9 4.1.3.4 Configuration for A0 / A1 and A8 / A9 addressing A0/A1 addressing can be enabled by • • Compiler option: --default-a0-size [=threshold] (To put all data objects with a size of [threshold] bytes or smaller in A0 section (default threshold =0)) Compiler option: --default-a1-size [=threshold] (To put all data objects with a size of [threshold] bytes or smaller in A1 section (default threshold =0)) Application Note 18 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software • • • Pragma in C source code: default_a0_size [value] [default | restore] Pragma in C source code: default_a1_size [value] [default | restore] Memory qualifier: __a0, __a1 A8/A9 addressing can be enabled by • Memory qualifier: __a8, __a9 By using compiler option, the settings are valid for the entire program unless not the same options are used for all modules. Pragmas overrules the compiler options and can be used for defined code blocks. With __a0, __a1, __a8, __a9 memory qualifier you can control the location for single data objects. Note: A8/A9 location options doesn’t include default-a8/a9-size with threshold as in A0/A1 (e.g. --default-a1size) 4.1.3.5 Qualifiers controlling all addressing modes Following qualifiers controlling all addressing modes • • • • for_constant_data_use_memory memory for_extern_data_use_memory memory for_initialized_data_use_memory memory for_uninitialized_data_use_memory memory Use the specified memory for the type of data mentioned in the pragma name. You can specify the following memories: near, far, a0, a8 or a9. For pragma for_constant_data_use_memory you can also specify the a1 memory. 4.1.4 Memory Location (Typical Use Case) Figure 8 Memory locations (Typical Use Case) Application Note 19 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Application Software 4.2 Linker Script files TriCore memory system with its hierarchy of different memories and caches plays important role in overall system performance. Controlling the location of data and code whether it should be cacheable or not in fast or slow memories is primary resolved by linker scripts files. Altium TASKING provides predefined linker script file for each device stored in *.lsl files 4.3 Additional optimization options You can further improve the performance of your software if following measures are suitable to your application. 4.3.1 Floating-point arithmetic/algorithms FPU: TriCore architecture includes high performance IEEE-754 compliant single-precision Floating Point Unit. Set the compiler option –fpu-present to use the FPU instead of emulation library. Double as float: In some cases double precision floating point data is included in the source code (e.g. model based automatic code generation) but single precision can be used. By setting --no-double option the compiler treats variables of the type double as float. Exception handling: Hardware implemented FPU exception monitoring with appropriate trapping functionality releasing the application from continuous monitoring in software 4.3.2 Fixed-point arithmetic/algorithms TriLib - TriCore DSP Library: Hand-coded assembly implemented C-callable highly optimized library of common DSP algorithms using fixed-point arithmetic. 4.3.3 Intrinsic Functions Some specific assembly instructions have no equivalence in C. Intrinsic functions give the possibility to use these instructions. Intrinsic functions are predefined functions that are recognized by the compiler which always inlines the corresponding assembly instructions in the assembly source (rather than calling it as a function). Table 9 Intrinsic Functions overview Intrinsic functions groups Example Description Minimum and maximum of (short) integers int __min( int, int ) Return minimum of two integers Fractional data type support __sfract __round16( __fract ) Convert __fract to __sfract Packed data type support Insert single assembly instruction char __extractbyte1( __packb Extract first byte from a __packb ) void __enable ( void ) Enable interrupts immediately at function entry void __nop( void ) Insert NOP instruction Register handling int __clz ( int ) Count leading zeros in int Insert / extract bit-fields and bits void __putbit( int value, int* address, int bitoffset ) Store a single bit Interrupt handling Application Note 20 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Performance Optimization Checklist 4.3.4 Inline assembler Using inline assembly you can use assembly instructions in the C source and pass C variables as operands to the assembly code. Primary usable for small and efficient code sequences or target specific operations not available in ANSI-C. 5 Performance Optimization Checklist Table 10 Hardware Configuration Description Details Default (after reset) CPU Clock Check the CPU Clock frequency See 3.1 ICACHE Verify the ICACHE is usable = bypass is disabled Enabled but bypassed See 3.2.5 DCACHE Verify the DCACHE is usable = bypass is disabled See 3.2.5 Enabled but bypassed PMU0 FLASH0, PMU1 FLASH1 Check the wait state setting for PFLASH and DFLASH see 3.2.1.1 PFLASH 8 wait states DFLASH 15 wait states Performance Counters To be used need first be enabled see Note: Disabled Table 11 Software Configuration Description Compiler configuration Linker configurations and location Using of cacheable segments Using FPU Details To check used compiler configuration see *.src generated files. Check the MAP file to control used memories, segments and data/code location To use caches, additionally to HW configuration the code/data need to be located in cacheable segment see 3.2.3 and 3.2.4 FPU should be used instead of emulation library. To verify check MAP file, should include libc_fpu.a or libcs_fpu.a and libfp_fpu.a. See 4.3.1 and Table 17 Double as Single 6 Free running mode (~17 MHz) Default You can see the defaults settings by running ctc -? - Derived from --cpu option or explicitly defined by --fpu-present If applicable, replace double with single Not set. precision. libcs_fpu.a instead of libc_fpu.a will be used. See 4.3.1 and Table 17 Performance relevant differences TC1.6 vs. TC1.3.1 TriCore 1.6 architecture includes many hardware improvements having impact on the application performance but not visible for software developer. Still there are changes as longer pipeline, new instruction or extended instruction functionality having impact on performance and should be considered in optimization process. 6.1 Pipeline Longer pipeline of TC1.6 has impact on optimal coding for dual issue instruction involving the Load/Store and Integer pipeline. It is primary relevant for hand-coded assembly implementations. In case of C- language the compiler is already adapted to handle this change. Application Note 21 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Performance relevant differences TC1.6 vs. TC1.3.1 6.2 Base + Long Offset addressing TC1.3.1 Load/Store instructions using Base + Long offset addressing is limited to Word data type. TC1.6 supports Word, Halfword and Byte data types. This addressing mode, used for A0/A1 and A8/A9 addressing segments, can be used for all three data types over full segment range of 64 KB. 6.3 Hardware Floating Point Unit (FPU) TC1.6 implements a fully pipelined FPU working in parallel with the existing integer pipeline. Overall higher floating point performance is expected. 6.4 Integer Division TC1.3.1 32bit/32bit division ~20cyc dvinit dvstep dvstep dvstep dvstep dvadj e8,d4,d0 e8,e8,d0 e8,e8,d0 e8,e8,d0 e8,e8,d0 e8,e8,d0 TC1.6 new instructions DIV & DIV.U have been implemented executing in ~9 cycles. 6.5 Performance TC1.6 vs. TC1.3.1 architecture Figure 9 and Figure 10 shows the performance comparison of both architectures using the same test conditions. Usually the new TC-1.6 architecture is faster and generated smaller code size. Figure 9 Execution Time of TC-1.6 vs. TC-1.3.1 (same configuration and CPU Clk) Application Note 22 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Performance relevant differences TC1.6 vs. TC1.3.1 Lower is better 1.200 -11.9% -7.1% -3.5% 0.0% Relative Code Size 1.000 0.800 0.600 TC-1.3.1 TC-1.6 0.400 0.200 0.000 APP-1 Figure 10 APP-2 APP-3 APP-4 Code Size of TC-1.6 vs. TC-1.3.1 (same configuration and CPU Clk) Application Note 23 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture References 7 References [1] Infineon: TriCore V1.6 User Manual (Vol1) [2] Infineon: TriCore V1.6 User Manual (Vol2) [3] Altium TASKING User Manual: ctc_user_guide.pdf [4] Altium TASKING AppNote: LSL Sample Cases using the Control Program Application Note 24 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A 8 Appendix A Summary of Altium TASKING configuration options and naming conventions 8.1 Compiler Options Table 12 Compiler Options (command line syntax) Long option name Short option name optimize[=flags] -Oflags --tradeoff={0|1|2|3|4} Default Description -O2 C Compiler optimization options. Options can have flags or suboptions. To switch a flag 'on', use a lowercase letter or a +longflag. To switch a flag off, use an uppercase letter or a -longflag. Separate longflags with commas. See also Table 13 and Table 14. 4 If the compiler uses certain optimizations (option --optimize), you can use this option to specify whether the used optimizations should optimize for more speed (regardless of code size) or for smaller code size (regardless of speed). 8 With this option you can specify a threshold value for __near allocation. If you do not specify __near or __far in the declaration of an object, the compiler chooses where to place the object. The compiler allocates objects smaller or equal to the threshold in __near sections. Larger objects are allocated in __a0, __a1 or __far sections. If you omit a threshold value, all objects will be allocated __near, including arrays and string constants. Used for data: initialized/uninitialized (a0data,a0bss) Allocation in __a0 memory means that the object is addressed indirectly, using A0 as the base pointer. The total amount of memory that can be addressed this way is 64 KB. With this option you can specify a threshold value for __a0 allocation. If you do not specify a memory qualifier such as __near or __far in the declaration of an object, the compiler chooses where to place the object based on the size of the object. First, the size of the object is checked against the near size threshold, according to the description of the --default-nearsize (-N) option. If the size of the object is larger than the near size threshold, but lower or equal to the a0 size threshold, the object is allocated in __a0 memory. Larger objects, arrays and strings will be allocated __far. -t{0|1|2|3|4} --default-near-size [=threshold] -N[threshold] --default-a0-size [=threshold] 0 -Z[threshold] --default-a1-size [=threshold] 0 Same as -a0 but used for constant (a1rom) - With this option the compiler can generate single precision floating-point instructions in the assembly file. If you select a valid target processor (command line option -cpu (-C)), this option is automatically set, based on the -Y[threshold] --fpu-present Application Note 25 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A chosen target processor. With --core=tc1.6, the compiler can generate TriCore 1.6 instructions in the assembly file. If you select a valid target processor (command line option -cpu (-C)), the core is automatically set, based on the chosen target processor --core=core derived from --cpu, if used, otherwise tc1.3 --cpu=cpu -Ccpu - --switch=auto auto --align=value 0 By default the C compiler aligns objects to the minimum alignment required by the architecture. With this option you can increase this alignment for objects of four bytes or larger. The value must be a power of two. --inline-max-size= threshold -1 --inline-max-incr= percentage -1 --immediate-in-code - With the option --inline-max-size you can specify the maximum size of functions that the compiler inlines as part of the optimization process. The compiler always inlines all functions that are smaller than the specified threshold. The threshold is measured in compiler internal units and the compiler uses this measure to decide which functions are small enough to inline. The default threshold is -1, which means that the threshold depends on the option --tradeoff. After the compiler has inlined all functions that have the function qualifier inline and all functions that are smaller than the specified threshold, the compiler looks whether it can inline more functions without increasing the code size too much. With the option --inline-max-incr you can specify how much the code size is allowed to increase. The default value is -1, which means that the value depends on the option -tradeoff. By default the TriCore C compiler creates a data object to represent an immediate value of 32 or 64 bits, then loading this constant value directly into a register. With this option you can tell the compiler to code the immediate values directly into the instructions, thus using less data, but more code. Actually when option --default-near-size < 4, 32-bit immediates will be coded into instructions anyhow, when it is >= 4 they will be located in neardata. When --default-near-size < 8, 64-bit immediates will be located in fardata, when it is >= 8 they will be located in neardata as well. --compact-max-size= value 200 Application Note With this option you define the target processor for which you create your application. Based on this option the compiler always includes the special function register file regcpu.sfr, unless you disable the option Automatic inclusion of '.sfr' file on the Preprocessing page (option--no-tasking-sfr). Based on the target processor the compiler automatically detects whether a FPU-unit is present and whether the architecture is a TriCore1.6. This means you do not have to specify the compiler options --fpu-present and --core=tc1.6 explicitly when one of the supported derivatives is selected. You can give one of the following arguments: auto Choose most optimal code jumptab Generate jump tables linear Use linear jump chain code lookup Generate lookup tables This option is related to the compiler optimization -optimize=+compact (Code compaction or reverse inlining). Code compaction is the opposite of inlining functions: large sequences of code that occur more than once are 26 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A --max-call-depth= value -1 transformed into a function. This reduces code size (possibly at the cost of execution speed). However, in the process of finding sequences of matching instructions, compile time and compiler memory usage increase quadratically with the number of instructions considered for code compaction. With this option you tell the compiler to limit the number of matching instructions it considers for code compaction. This option is related to the compiler optimization -optimize=+compact (Code compaction or reverse inlining). During code compaction it is possible that the compiler generates nested calls. This may cause the program to run out of its stack. To prevent stack overflow caused by too deeply nested function calls, you can use this option to limit the call depth. This option can have the following values: -1 Poses no limit to the call depth (default) 0 The compiler will not generate any function calls. (Effectively the same as if you turned off code compaction with option --optimize=-compact) >0 Code sequences are only reversed if this will not lead to code at a call depth larger than specified with value. Function calls will be placed at a call depth no larger than value-1. (Note that if you specified a value of 1, the option -optimize=+compact may remain without effect when code sequences for reversing contain function calls.) Application Note 27 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A 8.2 Compiler optimizations flags Table 13 Compiler –optimize (-O) Flags (command line syntax) --optimize[=flags] -O[=flags] Description +/-coalesce a/A Coalescer: remove unnecessary moves +/-cse c/C Common sub expression elimination +/-expression e/E Expression simplification +/-flow f/F Control flow simplification +/-glo g/G Generic assembly code optimizations +/-inline i/I Automatic function inlining +/-schedule k/K Instruction scheduler +/-loop l/L Loop transformations +/-simd m/M Perform SIMD optimizations +/-align-loop n/N Align loop bodies +/-forward o/O Forward store +/-propagate p/P Constant propagation +/-compact r/R Code compaction (reverse inlining) +/-subscript s/S Subscript strength reduction +/-unroll u/U Unroll small loops +/-ifconvert v/V Convert IF statements using predicates +/-pipeline w/W Software pipelining +/-peephole y/Y Peephole optimizations 8.3 Predefined compiler optimization profiles Table 14 Predefined compiler optimization profiles optimize[=flags] -Oflags --optimize=0 -O0 --optimize=1 -O1 --optimize=2 -O2 --optimize=3 -O3 Application Note Description No optimization Alias for -OaCEFGIKLMNOPRSUVWY Optimize Alias for -OaCefgIKLMNOPRSUVWy Optimize more (default) Alias for -OacefgIklMNoprsUvwy Optimize most Alias for -OacefgiklmNoprsuvwy 28 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A 8.4 Compiler generated sections Table 15 Compiler generated sections Section type Name prefix Description code .text program code neardata .zdata initialized __near data fardata .data initialized __far data nearrom .zrodata constant __near data farrom .rodata constant __far data nearbss .zbss uninitialized __near data (cleared) farbss .bss uninitialized __far data (cleared) nearnoclear .zbss uninitialized __near data farnoclear .bss uninitialized __far data a0data .sdata initialized __a0 data a0bss .sbss uninitialized __a0 data (cleared) a1rom .ldata constant __a1 data a8data .data_a8 initialized __a8 data a8rom .rodata_a8 constant __a8 data a8bss .bss_a8 uninitialized __a8 data (cleared) a9data .data_a9 initialized __a9 data a9rom .rodata_a9 constant __a9 data a9bss .bss_a9 uninitialized __a9 data (cleared) Application Note 29 V1.0, 2011-03 AP32168 Application Performance Optimization for TriCore V1.6 Architecture Appendix A 8.5 Compiler memory qualifiers Table 16 Compiler memory qualifiers Qualifiers Description Location __near Near data, direct addressable First 16 kB of a 256 MB 16 kB block 32-bit neardata, nearrom, nearbss, nearnoclear __far Far data, indirect addressable Anywhere no limt 32-bit fardata, farrom, farbss, farnoclear __a0 Small data 64 kB 32-bit a0data, a0bss __a1 Literal data, read-only Sign-extended 16-bit offset from address register A0 Sign-extended 16-bit offset from address register A1 64 kB 32-bit a1rom __a8 Data, reserved for OS 64 kB 32-bit a8data, a8rom, a8bss __a9 Data, reserved for OS Sign-extended 16-bit offset from address register A8 Sign-extended 16-bit offset from address register A9 64 kB 32-bit a9data, a9rom, a9bss 8.6 Libraries Table 17 Libraries Maximum object size Pointer Section types size Libraries Description libc[s].a libc[s]_fpu.a C libraries Optional letter: s = single precision floating-point (compiler option --no-double _fpu = with FPU instructions (compiler option --fpu-present) libfp[t].a libfp[t]_fpu.a Floating-point libraries: Optional letter t = trapping (control program option --fp-trap _fpu = with FPU instructions (compiler option --fpu-present) librt.a Run-time library libpb.a libpc.a libpct.a libpd.a libpt.a Profiling libraries pb = block/function counter pc = call graph pct = call graph and timing pd = dummy pt = function timing libcp[s][x].a C++ libraries Optional letter: s = single precision floating-point x = exception handling libstl[s]x.a STLport C++ libraries (exception handling variants only) Optional letter: s = single precision floating-point Application Note 30 V1.0, 2011-03 w w w . i n f i n e o n . c o m Published by Infineon Technologies AG