AN89610 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Authors: Mark Ainsworth, Asha Ganesan, Mahesh Balan, Keith Mikoleit Associated Project: No Associated Part Family: All PSoC 4 and PSoC 5LP Parts Software Version: PSoC Creator™ 3.0 Related Documents: For a complete list, click here. To get the latest version of this application note, or the associated project file, please visit http://www.cypress.com/go/AN89610. ® AN89610 shows how to optimize C and assembler code for the ARM Cortex CPUs in PSoC 4 and PSoC 5LP. Coding techniques exist for improved CPU performance and effective use of the PSoC memory architecture, which can lead to increased efficiency and reduced power consumption. This application note covers both the gcc and Keil Microcontroller Development Kit (MDK) C compilers supported by PSoC Creator™. Contents 1 2 3 4 5 6 7 8 9 Introduction ...............................................................2 PSoC 4 and PSoC 5LP Architectures .......................3 2.1 Register Set ........................................................3 2.2 Address Map .......................................................4 2.3 Interrupts .............................................................7 Compiler General Topics ..........................................8 3.1 Compiler Predefined Macros...............................8 3.2 Viewing Compiler Output ....................................8 3.3 Compiler Optimizations .......................................9 3.4 Attributes .............................................................9 Accessing Variables ............................................... 10 4.1 Global and Static Variables ............................... 10 4.2 Automatic Variables .......................................... 11 4.3 Function Arguments and Result ........................ 11 4.4 LDR and STR instructions................................. 12 Mixing C and Assembler Code ............................... 13 5.1 Syntax ............................................................... 13 5.2 Automatic Variables .......................................... 14 5.3 Global and Static Variables ............................... 15 5.4 Function Arguments .......................................... 16 Special-Function Instructions.................................. 17 6.1 Saturation Instructions ...................................... 17 6.2 Intrinsic Functions ............................................. 18 6.3 Assembler ......................................................... 19 Packed and Unpacked Structures .......................... 19 Compiler Libraries .................................................. 21 Placing Code and Variables ................................... 23 www.cypress.com 9.1 Linker Script Files ............................................. 23 9.2 Placement Procedure ....................................... 27 9.3 Example ............................................................ 29 9.4 General Considerations .................................... 29 9.5 EMIF Considerations (PSoC 5LP Only) ............ 30 10 Cortex-M3 Bit Band (PSoC 5LP Only) .................... 31 11 DMA Addresses (PSoC 5LP only) .......................... 32 12 Summary ................................................................ 32 12.1 Use All of the Resources in Your PSoC......... 33 13 Related Documents ................................................ 34 13.1 Application Notes ........................................... 34 13.2 C Documentation ........................................... 34 13.3 ARM Cortex Documentation .......................... 34 A Appendix A: Compiler Output Details ..................... 35 A.1 Assembler Examples, gcc for Cortex-M3 ....... 35 A.2 Assembler Examples, gcc for Cortex-M0 ....... 42 A.3 Assembler Examples, MDK for Cortex-M3 .... 49 A.4 Assembler Examples, MDK for Cortex-M0 .... 54 A.5 Compiler Test Program.................................. 60 Document History............................................................ 64 Worldwide Sales and Design Support ............................. 65 Products .......................................................................... 65 PSoC Solutions ............................................................... 65 Cypress Developer Community....................................... 65 Technical Support ........................................................... 65 Document No. 001-89610 Rev. *A 1 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 1 Introduction The ARM Cortex CPUs in the PSoC 4 and PSoC 5LP devices are designed to implement C code in a highly efficient manner. Thus, most of the time, you will not need any special knowledge to do C programming for PSoC 4 or PSoC 5LP. This application note helps you to solve more advanced, unique problems, typically around: Fitting an application into a small amount of flash or SRAM Time-constrained applications, that is, maximizing code speed and efficiency A number of methods are provided to solve these types of problems. This application note assumes that you know how to program embedded applications in the C language. Some knowledge of the gcc (GNU Compiler Collection) or Keil MDK (Microcontroller Development Kit) C compiler is recommended. Knowledge of the Thumb-2 assembly language used by the CPUs will also help. You should also know how to use PSoC Creator, the integrated development environment for PSoC 3, PSoC 4, and PSoC 5LP. If you are new to PSoC 4 or PSoC 5LP, you can find introductions in AN79953, Getting Started with PSoC 4 and AN77759, Getting Started with PSoC 5LP. If you are new to PSoC Creator, see the PSoC Creator home page. Note: Although many of the examples show code in Thumb-2, the Cortex assembly language, this application note is not intended to be a tutorial on this language. For details and tutorials on Thumb-2 assembler, see ARM Cortex Documentation. For information on optimizing C code for the 8051 CPU in PSoC 3, see AN60630, PSoC 3 8051 Code and Memory Optimization. www.cypress.com Document No. 001-89610 Rev. *A 2 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 2 PSoC 4 and PSoC 5LP Architectures To effectively use the methods described in this application note, it is important to understand the register and address architectures on which they are based. This section describes those architectures. 2.1 Register Set The Cortex register set and instruction set are the basis for implementing highly efficient C code. The PSoC 4 CortexM0 and the PSoC 5LP Cortex-M3 registers are very similar, as Figure 1 shows. Figure 1. Cortex CPU Architectures Cortex-M0 in PSoC 4 Cortex-M3 in PSoC 5LP All registers are 32-bit. There are 12 general-purpose registers (low registers R0 – R7 have more extensive support in the instruction set). Special registers include: Dual stack pointers (R13) for more efficient implementation of a real-time operating system (RTOS) Link register (R14) for fast return from function calls Program counter (R15) Program status register (PSR) contains instruction results such as zero and carry flags Interrupt mask register (Cortex-M0) / exception mask registers (Cortex-M3) Control register The PSoC 5LP Cortex-M3 has more features in stack management, and in the PSR, interrupt, and control registers. The Cortex-M3 also has a more extensive instruction set, including divide (UDIV, SDIV), multiply and accumulate (MLA, MLS), saturate (USAT, SSAT), and bitfield instructions. See Special-Function Instructions for information on how to take advantage of these instructions. www.cypress.com Document No. 001-89610 Rev. *A 3 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 2.2 Address Map The Cortex-M0 and Cortex-M3 have a very similar address map, as Figure 2 shows. Figure 2. Cortex Address Map Bit band feature on Cortex-M3 only. For more information see Cortex-M3 Bit Band. The address space is 4 Gbyte (32-bit addressing), and is divided into the access regions shown in Figure 2. The CPUs can execute instructions in the Code, SRAM, and External RAM regions; you can put code or data in any of these regions. The CPUs have a 3-instruction pipeline, which enables parallel fetch and execution of instructions. The PSoC 5LP Cortex-M3 has a bit band feature, where accessing an address in an alias region results in bit-level access in the corresponding bit band region. This lets you quickly set, clear or test a single bit in the bottom 1 Mbyte of the region. See Cortex-M3 Bit Band for more information. Although the Cortex CPUs can access a 4 Gbyte address space, within the PSoC devices only a small fraction of these addresses access PSoC memory or registers. Following is an overview of where in the Cortex address space the PSoC memory and registers are located; for details see the memory maps in the device datasheets or Technical Reference Manuals (TRMs). www.cypress.com Document No. 001-89610 Rev. *A 4 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 2.2.1 P S o C 4 Ad d r e s s M a p Figure 3 shows that a single Cortex-M0 bus, the System Bus, is used to access most of the regions in the address map. Figure 3. PSoC 4 Address Map Cortex-M0 Addresses 0xFFFF FFFF System Region 0xE000 0000 0xDFFF FFFF External Device Region 0xA000 0000 0x9FFF FFFF External RAM Region PSoC 4 Memory and Register Addresses 0x6000 0000 0x5FFF FFFF Cortex-M0 Core Peripheral Region System Bus 0x4000 0000 0x3FFF FFFF PSoC 4 Registers See TRM for specific register addresses SRAM Region Up to 4 KB SRAM 0x2000 0000 0x1FFF FFFF Code Region 0x0000 0000 Up to 32 KB flash 0x2000 0FFF 0x2000 0000 0x0000 7FFF 0x0000 0000 The PSoC 4 memory and registers are addressed as follows: The PSoC 4 flash starts at address 0, in the Cortex Code region. The flash block includes a read accelerator; see a PSoC 4 device datasheet for details. The PSoC 4 SRAM starts at address 0x20000000, in the Cortex SRAM region. The PSoC 4 registers are addressed starting at 0x40000000, in the Cortex Peripheral region. See a PSoC 4 Technical Reference Manual (TRM) for specific register addresses. All memory accesses are 32-bit. Code can be placed in PSoC 4 SRAM; see Placing Code and Variables for details. Note: Because PSoC 4 has only one bus, the speed and efficiency of code execution and data access depend solely on the speed of the memory occupying those regions. SRAM is usually faster than flash; however, the combination of the Cortex instruction pipeline and the flash read accelerator makes Code region accesses almost as fast as SRAM region accesses. It is possible to execute code from SRAM but significant performance gains may not necessarily be realized. www.cypress.com Document No. 001-89610 Rev. *A 5 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 2.2.2 PSoC 5LP Address Map The PSoC 5LP / Cortex-M3 architecture is more complex and has more features than that of the PSoC 4, as Figure 4 shows. The Cortex-M3 has three buses instead of one: I (instruction) Bus and D (data) Bus: for reading instructions and accessing data, respectively, from the Code region. In PSoC 5LP, the I and D Buses are multiplexed to a single C (code) Bus for accessing the Code region. S (system) Bus: for reading instructions and accessing data from the other regions Because the C Bus and the S Bus are separate, the Cortex-M3 can do simultaneous parallel accesses of the Code region and the other regions, for more efficient operation. Figure 4. PSoC 5LP Address Map Cortex-M3 Addresses 0xFFFF FFFF System Region 0xE000 0000 0xDFFF FFFF External Device Region 0xA000 0000 0x9FFF FFFF External Ram Region PSoC 5LP Memory and Register Addresses 0x60FF FFFF EMIF 0x6000 0000 0x6000 0000 0x5FFF FFFF S Bus Peripherals Region 0x4000 0000 0x3FFF FFFF Cortex-M3 Core SRAM Region 0x2000 0000 0x1FFF FFFF I Bus Bit band alias region Up to 64 KB SRAM Code Region C Bus D Bus PSoC 5LP Registers See TRM for specific register addresses 0x0000 0000 Up to 256 KB flash 0x2200 0000 0x2000 7FFF 0x1FFF 8000 0x0003 FFFF 0x0000 0000 The PSoC 5LP memory and registers are addressed as follows: The PSoC 5LP flash starts at address 0, in the Cortex Code region. A flash cache is included; see a PSoC 5LP device datasheet for details. The PSoC 5LP SRAM is logically split in half, centered at address 0x20000000. For example, in a device with 64 KB SRAM, half of the SRAM, 32 KB, is addressed below 0x20000000 and the other half above 0x20000000. So the SRAM addresses range from 0x1FFF8000 to 0x20007FFF. The addresses in a device with 32 KB SRAM range from 0x1FFFC000 to 0x20003FFF. The lower half of SRAM, called code SRAM, is located in the Cortex Code region. The upper half, called upper SRAM, is located in the Cortex SRAM region. The two halves are accessed by different buses, as Figure 4 shows. Locating half of the SRAM in the Code region enables placement of code and data for possible faster access – see Placing Code and Variables for details. Note: Within the PSoC 5LP, SRAM accesses are usually faster than flash accesses, however the combination of the Cortex instruction pipeline and the flash cache makes flash accesses almost as fast as SRAM accesses. It is possible to execute code from either code SRAM or upper SRAM but significant performance gains may not necessarily be realized. www.cypress.com Document No. 001-89610 Rev. *A 6 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Note that only upper SRAM is in the Cortex-M3 bit band region. The PSoC 5LP registers are addressed starting at 0x40000000, in the Cortex Peripheral region. See a PSoC 5LP Technical Reference Manual (TRM) for specific register addresses. The PSoC 5LP External Memory Interface (EMIF) addresses start at 0x60000000, in the Cortex External RAM region. For more information on PSoC 5LP EMIF see the device datasheet, TRM, or the EMIF Component datasheet. All memory accesses are 32-bit except EMIF, which can be set to either 8-bit or 16-bit. The PSoC 5LP also includes a direct memory access (DMA) controller. It shares bandwidth with the CPU as dual bus masters, using bus arbitration techniques. For more information, see AN52705, Getting Started with PSoC DMA. See also DMA Addresses in this application note. 2.3 Interrupts Both Cortex CPUs offer sophisticated support for rapid and deterministic interrupt handling. For more information see a device datasheet or TRM, the PSoC Creator Interrupt Component datasheet, or AN54460, PSoC Interrupts. www.cypress.com Document No. 001-89610 Rev. *A 7 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 3 Compiler General Topics Before we begin in-depth examination of the gcc and MDK compilers, let us examine a few general compiler topics. Note: All of the C code examples shown in this application note are designed for use with the C compilers supported by PSoC Creator 3.0: gcc 4.7.3 and the Keil Microcontroller Development Kit (MDK) version 5.03. The gcc 4.7.3 compiler is included free with your PSoC Creator installation. MDK must be purchased however an object-size-limited evaluation version, MDK-Lite, is available free from Keil. Compiler optimizations are turned off (the PSoC Creator default) except where noted. All of the C code examples in this application note use ANSI standard C except for compiler-specific extensions. 3.1 Compiler Predefined Macros It is a best practice to write C code that can be directly ported between as many different compilers as possible. However there are cases where this is not possible and you must write multiple versions of the same code, to be used with multiple compilers. If you need to do this you can use predefined macros, provided with most compilers, to identify the compiler being used. This allows you to compile only the code for the compiler being used, for example: #if defined(MY_COMPILER_MACRO) /* put your compiler-unique code here */ #endif To apply this technique to PSoC Creator projects, use the following macros that are included with the gcc and MDK compilers, respectively. Note that for MDK you are checking just for whether __ARMCC_VERSION is defined, indicating that that compiler is being used. You do not necessarily need to care about its actual value, i.e., the compiler version. #if defined(__GNUC__) /* put your gcc unique code here */ #elif defined(__ARMCC_VERSION) /* put your MDK unique code here */ #endif 3.2 Viewing Compiler Output To understand how a compiler performs under different conditions, you must review the output assembler code. There are two ways to do that in PSoC Creator. 1. Figure 5. Listing Files Open the list file corresponding to the compiled C file (filename.lst), as Figure 5 shows. The default PSoC Creator project build setting is to create a list file; see menu item Project > Build Settings > Compiler > General. 2. Use the disassembly window in the debugger (menu item Debug > Windows > Disassembly). Right-click in that window to bring up options to show mixed source and assembler. Of course, this method has the disadvantage that you must have working target hardware and a PSoC Creator project that builds correctly before you can use the debugger. Note: The free evaluation version of MDK, MDKLite, does not include assembler in the .lst file, so method 2 must be used to see the output assembler code. www.cypress.com Document No. 001-89610 Rev. *A 8 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 3.3 Compiler Optimizations Turning on optimization options makes the compiler attempt to improve the C code‟s performance and/or size, at the expense of compilation time and possibly the ability to debug the program. PSoC Creator allows you to set compiler optimizations for an entire project, under Project > Build Settings > Compiler > Optimization. The optimizations offered by PSoC Creator for both gcc and MDK are just for “speed” or “size”. (This is different from the 11 levels of optimization offered for the Keil 8051 compiler.) The optimization option selected in the Build Settings dialog applies to all C files in the project. You can also apply optimizations to individual C files – in the Workspace Explorer window right-click on the file and select Build Settings. With gcc you can‟t set optimization levels for individual functions except for using certain function attributes. With MDK you can use #pragma to set optimization for an individual function. For more information see C Documentation. It is strongly recommended that after compiling C code with optimizations you carefully review the assembler output and confirm that it is doing what you expect. Stepping through the assembler code in the debugger may also be helpful. One best practice is to get your C code working without optimizations, then rebuild with optimizations and repeat your tests. You can do this using the Debug and Release configurations in your PSoC Creator project build settings. For specific examples of how various optimization options work, see Appendix A. 3.4 Attributes An extension to the C language that is supported by both gcc and MDK is to apply attributes to functions, variables, and structure types. Attributes can be used, for example, to control: specific function optimizations how structures occupy memory (see Packed and Unpacked Structures) function and variable location in memory (see Placing Code and Variables) The syntax is (two underscore characters before and after the "attribute"): __attribute__ ((<attribute-list>)) Specific attributes are described in detail in subsequent sections in this application note. For more information, see C Documentation. www.cypress.com Document No. 001-89610 Rev. *A 9 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 4 Accessing Variables When reviewing compiler output, one of the first areas to examine is how variables (and arrays and structures) are read and written. In their assembly language output, the gcc and MDK compilers both implement certain techniques for accessing: global and static variables automatic (local) variables function arguments and function result Let us examine how each of these is done. 4.1 Global and Static Variables The Thumb-2 assembly language used by both Cortex CPUs does not generally support loading 32-bit immediate values into a register. (There are exceptions; for example small immediate values can be loaded and sign-extended.) This makes it difficult to load the address of a global or static variable, or in general to load any address. Table 1 shows two methods for handling this problem, for the following example C code: /* loading a global variable */ uint32 myVar; . . . myVar = 7; Table 1. Example Methods for Loading Addresses into CPU Registers Method 1: two 16-bit immediate loads ; rx = address of myVar ; load the lower and upper halves of the ; address movw rx, #<LS word> ; 32-bit instruction movt rx, #<MS word> ; 32-bit instruction . . . movs ry, #7 str ry, [rx, #0] ; myVar = 7 Method 2: PC-relative load ; rx = address of myVar ; load the value stored in flash, below ldr rx, [pc, #<offset>] ; 16-bit instruction . . . movs ry, #7 str ry, [rx, #0] ; myVar = 7 . . . ; address value stored after the end of ; the function .word <address of myVar> ; 32-bit value In general method 2 is preferred for size-limited applications because it uses two fewer bytes (one 16-bit word). However, method 1 may execute faster due to the Cortex instruction pipeline. Note that with PSoC, instruction execution speed also depends on the flash cache (PSoC 5LP) or accelerator (PSoC 4) and thus is not necessarily deterministic. See Memory Map for details. Different compiler optimizations implement one or the other of these two methods; for detailed examples see Appendix A. It is a coding best practice to minimize use of global variables. Doing so with the PSoC Cortex CPUs may also act to reduce code size by reducing the loading of memory and PSoC register addresses. www.cypress.com Document No. 001-89610 Rev. *A 10 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 4.2 Automatic Variables In C, automatic variables are variables that are defined within (local to) a function. Depending on the size and complexity of the function, and the compiler optimization setting, an automatic variable may be assigned to a CPU register or it may be saved on the stack, as Table 2 shows: Table 2. Example Methods for Using Automatic Variables C Code void MyFunc(void) { uint8 i = 3; uint8 j = 10; . . . } Assembler Code ; use rx as i, do NOT save it on the stack movs rx, #3 ; initialize i ; store j on the stack; use ry to temporarily ; hold the initial value movs ry, #10 ; initialize j strb ry, [sp, #<offset>] ; on the stack Both size and speed optimizations tend to reduce stack usage for automatic variables; see Appendix A for examples. 4.3 Function Arguments and Result The Procedure Call Standard for ARM Architecture allocates registers R0 – R3 for passing arguments to a function, and R0 for passing a function result. If the number of arguments is greater than four, the first four arguments are placed in the registers and the rest are pushed onto the stack. Within the function, the arguments may be maintained in their respective registers, transferred to other registers, or saved on the stack. Given that a function‟s automatic variables may also be stored on the stack (Table 2), stack management may become complex. To handle this complexity two sequences of instructions, known as prolog and epilog, may be included in a compiled function, as the example in Table 3 shows: Table 3. Example Function With Prolog and Epilog Instructions C Code Assembler Code /* Function with 6 arguments and a return value */ uint32 MyFunc(uint32 a, uint32 b, uint32 c, uint32 d, uint32 e, uint32 f) { return a + b + c + d + e + f; } ; function prolog push {r7} sub sp, sp, #20 add r7, sp, #0 str r0, [r7, #12] str r1, [r7, #8] str r2, [r7, #4] str r3, [r7, #0] ; function body ldr r2, [r7, #12] ldr r3, [r7, #8] adds r2, r2, r3 ldr r3, [r7, #4] adds r2, r2, r3 ldr r3, [r7, #0] adds r2, r2, r3 ldr r3, [r7, #24] adds r2, r2, r3 ldr r3, [r7, #28] adds r3, r2, r3 mov r0, r3 ; function epilog add r7, r7, #20 mov sp, r7 pop {r7} bx lr ; ; ; ; ; make room on the stack for the arguments use r7 as a base pointer save the arguments on the stack ; build the sum ; in r2 and r3 ; argument e ; argument f ; return value in r0 ; restore stack and r7 ; return To minimize code size and maximize speed, you should limit the number of function arguments to four. www.cypress.com Document No. 001-89610 Rev. *A 11 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Depending on the size and complexity of the function, the size optimization tries to reduce the function prolog and epilog code; see Appendix A for examples. 4.4 LDR and STR instructions These instructions are used to read to and write from memory and PSoC registers. They are quite powerful and offer many flexible options. Variants of the instructions support byte and halfword (16-bit) accesses, zero and sign extensions, and immediate and register offsets. The offset options are particularly useful for handling pointer offsets and for accessing members of arrays and structures, as Table 4 shows. Table 4. Example Usage of LDR and STR Instructions C Code Assembler Code /* loading an array member */ uint8 myArray[100]; . . . myArray[6] = 7; ldr rx, [pc, #<offset>] ; rx = address of myArray movs strb ry, #7 ry, [rx, #6] uint8 myArray[100]; . . . uint8 i; . . . myArray[i] = 7; ldr rx, [pc, #<offset>] ; rx = address of myArray ldrb ry, [sp, #<offset>] ; ry = i (automatic variable) movs rz, #7 strb rz, [rx, ry] Note that the LDR and STR instructions are always register-relative, so before an LDR or STR instruction there is always another instruction to load a register with the target address; see Table 1. Variations and options for these instructions are different in the Cortex-M0 (PSoC 4) and the Cortex-M3 (PSoC 5LP). For more information, see ARM Cortex Documentation. www.cypress.com Document No. 001-89610 Rev. *A 12 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 5 Mixing C and Assembler Code One of the most effective ways to make your code shorter, faster, and more efficient is to write it in assembler. Using assembler may also enable you to take advantage of special-function instructions that are supported by the CPU but are not used by the C compiler; see Special-Function Instructions. However, coding in assembler is a daunting task for all but the smallest applications, and once written the code is not easy to maintain or port to other compilers or CPUs. That is why most code is written in C, and if assembler is used at all it is used only for a few critical functions. Another problem with assembler is that it must be written in its own file, separate from the C files with which it must coexist. This can cause difficulties integrating and maintaining the code. A solution to both of these problems is an extension to the C language called inline assembler, where assembler code can be placed directly in C files and is treated as just another C statement. This lets you use assembler only where it‟s needed to increase efficiency, and makes it easier to mix C and assembler. The gcc and MDK compilers both support inline assembler. In addition, MDK supports a similar feature called embedded assembler, where a function is written entirely in assembler but is included in a C file. This section shows how to use combined C and assembler code, for both the gcc and MDK compilers. To effectively use the methods described in this section, it is important to understand the register architectures on which they are based – see CPU Register Architectures for details. Note: The following examples show assembler for the Cortex-M3; the Cortex-M0 uses a more limited subset of the Cortex-M3 instructions. For details see ARM Cortex Documentation. Note: Most assembler instructions act on the Cortex registers (see Register Set). The Procedure Call Standard for ARM Architecture requires that some of these registers be preserved by functions. If needed, use the PUSH and POP instructions to save and restore registers on the stack. 5.1 Syntax The gcc syntax for inline assembler is: asm("assembler instruction"); which adds a single line of assembler code to the C code. For example, the following increments the R0 register: asm("ADD r0, r0, #1") ; /* R0 = R0 + 1 */ The syntax for multi-line inline assembler is: asm("line "line "line . . . "line 1\n" 2\n" 3\n" __asm("line "line "line . . . "line 1\n" 2\n" 3\n" n"); __asm return-type function-name(argument-list) { /* This is a C comment */ instruction ; assembler comment ... instruction } For example: /* R0 = R0 + 1; R1 = R0 */ asm("ADD r0, r0, #1\n" "MOV r1, r0"); Note: The keyword __asm__ can be used instead of asm; see C Documentation for details. Note: You can add the keyword volatile to prevent the statement from being optimized out by the compiler: www.cypress.com __asm("assembler instruction"); The syntax for the MDK embedded assembler is: n"); asm volatile(" ... "); The MDK syntax for inline assembler is the same as that for gcc, except that the “asm” is preceded by two underscore characters: For example: __asm int DoSum(int x, int y) { ADD r0, r0, r1 BX lr } Document No. 001-89610 Rev. *A 13 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 5.2 Automatic Variables With gcc, to access an automatic (or local) variable from inline assembler, you must first force the variable to occupy a register Rx. Declare the variable as follows: register int foo asm("r0"); /* foo occupies register R0 */ Note: gcc actually supports a complex language of C expression operands for the asm keyword. A tutorial on this language is beyond the scope of this application note. Details can be found in C Documentation, especially section 6.41 of “Using the GNU Compiler Collection”. As an example, let us define two automatic variables, „foo‟ and „bar‟, and do a simple math operation between them: void main() { register int foo asm("r0") = 5L; /* register variables can be initialized */ register int bar asm("r1"); bar = foo + 1; /* C code version */ asm("ADD r1, r0, #1"); /* bar = foo + 1 */ } In the above example, the C code and the inline assembler do the same operation. However, the compiled C code (no optimization) uses an intermediate register and consequently produces 3x the instructions using 2x the flash memory, as this excerpt from the .lst file shows: 20:.\main.c **** 42 0008 0346 mov 43 000a 03F10103 add 44 000e 1946 mov bar r3, r3, r1, = foo + 1; r0 r3, #1 r3 22:.\main.c **** 47 0010 00F10101 ADD asm("ADD r1, r0, #1"); /* bar = foo + 1 */ r1, r0, #1 Depending on the function size and complexity it may be possible to eliminate the intermediate register by using a compiler optimization option. With MDK, there is no need to force an automatic (local) variable to occupy a register. Instead, you can access the variables directly: void main() { /* no need to declare variables in registers */ int foo = 5; int bar; bar = foo + 1; /* C code version */ __asm("ADDS bar, foo, #1"); /* bar = foo + 1 */ } In this example, the C code and the inline assembler do the same operation and produce the exact same code. www.cypress.com Document No. 001-89610 Rev. *A 14 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 5.3 Global and Static Variables The previous methods can also be used with global and static variables (“globals”). Note that before accessing a global you must load a register with the address of the variable – see Global and Static Variables. With gcc, use the following syntax to load an address: LDR rx, =variable_name LDR ry, =0x1FFF9000 ; or any address Let us repeat the previous example using globals instead of automatic variables: int foo = 5L; int bar; void main() { bar = foo + 1; /* bar = asm("LDR "LDR "LDR "ADD "STR foo r0, r1, r2, r2, r2, Again, the C code and the inline assembler do the same operation but due to different address load methods (see Table 1) the compiled C code (no optimizations) produces two more instructions and uses four more bytes of memory than the inline assembler (for the Cortex-M3), as the following debugger snip shows: /* bar = foo asm("LDR r0, "LDR r1, "LDR r2, "ADD r2, "STR r2, 4804 4905 6802 F1020201 600A + 1 */ =foo\n" =bar\n" [r0]\n" r2, #1\n" [r1]"); ldr r0, ldr r1, ldr r2, add.w r2, str r2, However, the assembler output is quite different: 4805 6800 1C40 4905 6008 ldr ldr adds ldr str r0, r0, r0, r1, r0, [pc, #14] [r0, #0] r0, #1 [pc, #14] [r1, #0] The embedded assembler method looks like this: } r3, r3, r3, r2, r3, r3, r2, /* bar = foo + 1 */ __asm("ADDS bar, foo, #1"); The additional instructions are required for loading the variable addresses; see Global and Static Variables for more information. In this case the inline assembler is effectively a pseudoinstruction, generating five actual assembler instructions. The output is the same as if it were written in C, so in this case there is no advantage to using inline assembler. + 1 */ =foo\n" =bar\n" [r0]\n" r2 #1\n" [r1]"); bar = foo + 1; F248130C movw F6C173FF movt 681B ldr F1030201 add.w F248132C movw F6C173FF movt 601A str With MDK there are two methods to access global variables. The first method, inline assembler, is similar to that for accessing automatic variables: #810c #1fff [r3, #0] r3, #1 #812c #1fff [r3, #0] __asm void AddGlobals(void) { extern foo extern bar LDR LDR LDR ADD STR BX r0, r1, r0, r0, r0, lr =foo =bar [r0] r0, #1 [r1] } In this case the resultant code is the same as for the inline assembler method. [pc, #10] [pc, #14] [r0, #0] r2, #1 [r1, #0] The above results may be different if compiler optimizations are used. www.cypress.com Document No. 001-89610 Rev. *A 15 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 5.4 Function Arguments As noted in Function Arguments and Result, the Procedure Call Standard for ARM Architecture allocates registers R0 – R3 for passing arguments to a function, and R0 for passing a function result. If the number of arguments is greater than four, the first four arguments are placed in the registers and the rest are pushed onto the stack. So if you limit the number of arguments to four, you can write assembler to directly access the registers that have those arguments. The following example shows multiple ways to implement a function to calculate the sum of four arguments: uint32 addFunc(uint32 a, uint32 b, uint32 c, uint32 d) { return a + b + c + d; } gcc: uint32 addFunc(uint32 a, uint32 b, uint32 c, uint32 d) { /* define return value in R0 */ /* does not overwrite input argument 'a' in R0 */ register uint32 rtnval asm("r0"); /* the arguments are in registers R0 – R3 */ asm volatile ("add r0, r0, r1\n" "add r0, r0, r2\n" "add r0, r0, r3"); return rtnval; /* return value in R0 */ } MDK embedded assembler: __asm uint32 addFunc(uint32 a, uint32 b, uint32 c, uint32 d) { ; the arguments are in registers R0 – R3 ADD r0, r0, r1 ADD r0, r0, r2 ADD r0, r0, r3 ; return value in R0 BX lr } www.cypress.com Document No. 001-89610 Rev. *A 16 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 6 Special-Function Instructions Some CPUs have special-function instructions which are not normally used by C compilers; this type of instruction can be accessed with C intrinsic functions or in some cases can only be accessed using assembler. This section explains how to use C intrinsic functions and mixed C and assembler to more easily gain access to these instructions. Let us look at the PSoC 5 Cortex-M3 saturation instructions as an example; for other special-function instructions see C Documentation. 6.1 Saturation Instructions Saturation is commonly used in signal processing, for example when a signal is amplified, as Figure 6 shows. Suppose we are using a 16 bit ADC and are interested in just the 12 LS bits. After amplification, if the value is adjusted by simply removing the unused MS bits, overflow may seriously distort the resulting signal. Saturation avoids overflow and reduces distortion. Figure 6. Saturation Operation Saturation can be done in C using multiple comparison and if-else statements, but the Cortex-M3 has two assembler instructions that make the process far more efficient: SSAT and USAT for signed and unsigned, respectively. These instructions work as follows: 6.1.1 S S AT I n s t r u c t i o n n–1 n–1 The SSAT instruction saturates to the signed range −2 ≤ x ≤ 2 −1: if the value to be saturated is less than −2 n−1 if the value to be saturated is greater than 2 , the result returned is −2 n−1 n-1 n-1 −1, the result returned is 2 −1 otherwise, the result returned is the same as the value to be saturated. www.cypress.com Document No. 001-89610 Rev. *A 17 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 6.1.2 USAT Instruction n−1 The USAT instruction saturates to the unsigned range 0 ≤ x ≤ 2 : 6.1.3 if the value to be saturated is less than 0, the result returned is 0 if the value to be saturated is greater than 2 n−1 n−1 , the result returned is 2 otherwise, the result returned is the same as the value to be saturated S yn t a x op Rd, #n, Rm where: op is one of the following: SSAT saturates a signed value to a signed range USAT saturates a signed value to an unsigned range Rd is the destination register. n specifies the bit position to saturate to: n ranges from 1 to 32 for SSAT n ranges from 0 to 31 for USAT Rm is the register containing the value to saturate. Note: The SSAT and USAT instructions operate on a 32-bit value in the input register. The corresponding C variable should be of type int, int32 or uint32. Sign extension may be required before executing the saturation instruction. 6.2 Intrinsic Functions In C, an intrinsic function has the appearance of a function call but is replaced during compilation by a specific sequence of one or more assembler instructions. The ARM Cortex Microcontroller Software Interface Standard (CMSIS) library includes a set of intrinsic functions for most of the Cortex special-function assembler instructions. After a PSoC Creator project is built, you can find these functions in the Workspace Explorer window in the folder Generated Source > PSoCx > cyboot > core_cmInstr.h. The saturation intrinsics look like this: __SSAT(ARG1, ARG2) __USAT(ARG1, ARG2) where ARG1 is the input value to be saturated and ARG2 is the bit position to saturate to. Call the functions as follows: int data_unsat = -1L; int data_sat = __USAT(data_unsat, 8); In this example, we saturate data_unsat to 8 bits, unsigned. If the value of data_unsat exceeds 255 (0xFF), the result is saturated to 255 (0xFF) and is stored in data_sat. If the value of data_unsat is negative, 0 is stored in data_sat. www.cypress.com Document No. 001-89610 Rev. *A 18 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 6.3 Assembler You can also use the techniques described in Mixing C and Assembler Code to insert special-function instructions, as Table 5 shows. Table 5. Using Saturation Instructions in Mixed C and Assembler gcc Example MDK Example void main() void main() { { register int data_unsat asm("r0"); register int data_sat asm("r3"); int data_unsat; int data_sat; asm("ssat r3, 8, r0"); . . . . . . __asm("ssat data_sat, 8, data_unsat"); } } In this example, we saturate data_unsat to 8 bits, signed. With 8-bit signed saturation the value can range from -128 to +127. So if the value is less than -128, the result is -128 and if the value is greater than +127, the result is +127. So the result is saturated to 0x7F in the positive direction and 0x80 in the negative direction. We can use the saturation instructions to saturate a value to the required number of saturation bits. USAT can be used with an ADC configured in single ended mode and SSAT can be used with an ADC configured in differential mode. 7 Packed and Unpacked Structures 2 In most embedded systems, data is transmitted in a byte-by-byte fashion, for example with a UART or I C port. (The SPI protocol is an exception, for details see one of the PSoC Creator SPI Component datasheets.) With 8-bit CPUs complex data structures can be transmitted and received byte by byte and the result will exactly match the original. However with larger CPUs (16-bit, 32-bit, etc.) this is not necessarily true. Let us examine in detail why this is so, using the PSoC 4 Cortex-M0 and PSoC 5LP Cortex-M3 CPUs as examples. The 32-bit Cortex CPUs in PSoC access memory as 32-bit words, and therefore they work most efficiently when data is stored on 32-bit boundaries, that is, where the two LS address bits are zero. If for example a 16-bit or 32-bit variable is saved starting at an odd address, where the LS bit of the address is 1, then two 32-bit memory reads are required to read it, and two read-modify-write cycles are required to write it. This can significantly impact execution speed. Unfortunately, in C it is easy to create structures where words are on odd boundaries. Consider the following example: struct myStruct { uint8 m1; /* stored on a 32-bit boundary, address = ...xx00 */ uint32 m2; /* stored on an 8-bit boundary, address = ...xx01 */ } The above is called a packed structure, because the structures are placed in memory byte-by-byte regardless of address boundary considerations. Compilers for 8-bit CPUs usually generate packed structures. The gcc and MDK compilers as a default for the Cortex CPUs save structures in unpacked format, where the address is determined by the size of the structure member. For example: struct myUnpackedStruct { uint8 m1; /* stored on a 32-bit boundary, address = ...xx00 */ /* 3 unused filler bytes */ uint32 m2; /* stored on a 32-bit boundary, address = ...yy00 */ } www.cypress.com Document No. 001-89610 Rev. *A 19 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization An unpacked structure can be accessed more efficiently but is larger, which may be a problem with devices with limited SRAM. But a more serious problem can occur when for example an unpacked structure is transmitted byte-bybyte and the receiver saves the bytes in a packed structure – the data becomes corrupted causing hard-to-find system-level defects. There are several ways to correct this problem in code; the easiest is to simply optimize the order of structure member declarations. For example, we could reorder the original structure as: struct myStruct { uint32 m2; /* stored on a 32-bit boundary, address = ...xx00 */ uint8 m1; /* stored on a 32-bit boundary, address = ...yy00 */ } Now the structure is packed and each of its members‟ addresses are on 32-bit boundaries. 7.1.1 Compiler Considerations For both gcc and MDK compilers, by default, structures are unpacked according to the following rules: A char or uint8 (one byte) is 1-byte aligned A short or uint16 (two bytes) is 2-byte aligned; LS address bit is 0 A long or uint32 (four bytes) is 4-byte aligned; two LS address bits are 00 A float (four bytes) is 4-byte aligned; two LS address bits are 00 Any pointer, e.g., char *, int * (four bytes) is 4-byte aligned; two LS address bits are 00. It is possible to force a structure to be packed, using the following syntax: gcc: MDK: struct myStruct { . . . } __attribute__ ((packed)); __packed struct myStruct { . . . }; Note: It is recommended to use the packed statement in structure definitions only. It should not be used in declarations of actual structure variables nor should it be used in typedef declarations. www.cypress.com Document No. 001-89610 Rev. *A 20 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 8 Compiler Libraries For both gcc and MDK compilers, replacing C standard library function calls with equivalent C statements can significantly reduce memory usage. For example, consider the following C fragment: #include <math.h> uint32 a, b; a = 5; b = pow(a,3); Table 6 shows the Flash and SRAM memory consumption for both compilers for PSoC 5LP and PSoC 4: Table 6. Memory Consumption With a pow() Function Call PSoC 5LP gcc MDK PSoC 4 gcc MDK Flash 8939 7696 Flash 14374 7700 SRAM 405 196 SRAM 364 312 If the call to the pow() library function is replaced with the following equivalent code, you use a lot less memory, as Table 7 shows: b = a * a * a; Table 7. Memory Consumption Without Using a pow() Function Call PSoC 5LP gcc MDK Flash 1582 (-82.3%) 1444 (-81.2%) SRAM 301 (-25.7%) 200 (-32.4%) PSoC 4 gcc MDK Flash 1198 (-91.7%) 1004 (-87.0%) SRAM 252 (-30.8%) 216 (-30.8%) The reason for the size reduction is that by ANSI C definition the pow() function takes arguments of type double and returns a type double. When you call this function with integers they are automatically cast to the proper type before and after the function call, and this requires a lot of code to implement. With PSoC Creator 3.0, a choice of gcc libraries is available: newlib and newlib-nano. The newlib-nano library cuts some less-used features from the standard C library functions, to reduce memory usage. Note: One of features removed from newlib-nano is floating-point support in printf(), which may cause problems if you intend to display floating-point values. For example, consider the following code fragment: char My_String[30]; float My_Float = 3.14159; sprintf(My_String, "Value of pi is: %.2f to 2dp", My_Float); www.cypress.com Document No. 001-89610 Rev. *A 21 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization With newlib-nano, the string "Value of pi is: to 2dp" is created in My_String; the expected value 3.14 is not included. There are two possible work-arounds: 1. Enable floating-point formatting support in newlib-nano, as Figure 7 shows. This feature is disabled as a default. Enabling it increases flash usage by 10K to 15K bytes. Figure 7. Enable floating-point formatting support in newlib-nano 2. Change the library to the full-featured newlib, as Figure 8 shows. Note that the Use Default Libraries option must also be set to False. The default is to use newlib-nano; changing to newlib increases flash usage by 25K to 30K bytes, and increases SRAM usage by approximately 2K bytes. Figure 8. Disabling newlib-nano MDK also offers a reduced-function library called Microlib. For more information see the MDK documentation. www.cypress.com Document No. 001-89610 Rev. *A 22 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 9 9.1 Placing Code and Variables This section shows how to place C code and variables into custom locations in memory. There are a number of reasons to do this, see Define Custom Locations for examples. If you use a custom linker script file, it is a best practice to add it to the project (menu Project > Existing Item…) and save it in the project folder. A custom .scat file must be saved in the PSoCx folder under Generated Source. To effectively use the methods described in this section, it is important to understand the CPU architectures on which they are based – see Address Map for details. Figure 9. PSoC Creator Linker Script Files Linker Script Files To place code and variables in custom locations, you must know how to modify linker script files. This section shows the basics of how linker script files control the use of memory in PSoC 4 and PSoC 5LP. Details can be found in your gcc or MDK documentation. After your PSoC Creator project is built, the default linker script files can be found in the Generated Source folder, as Figure 9 shows. For gcc the linker script file is of type .ld, and for MDK the linker script file is of type .scat (for “scatter”). Note: Linker script files are automatically generated by PSoC Creator at project build time, and changes that you make to those files may be overwritten on the next build. You can instruct PSoC Creator to use a custom script file, using the PSoC Creator menu Project > Build Settings > Linker > General > Custom Linker Script. 9.1.1 Linker Script File for gcc An .ld file has two major commands: MEMORY {} and SECTIONS {}. The MEMORY command describes the type, location and usage of all physical memory in the PSoC. For example, for a PSoC 4 with 32 KB flash and 4 KB SRAM: MEMORY { rom (rx) : ORIGIN = 0x0, LENGTH = 32768 ram (rwx) : ORIGIN = 0x20000000, LENGTH = 4096 } The region rom describes the PSoC flash and the region ram describes the PSoC SRAM. The letters "rwx" are memory attribute indicators: read, write, and execute, respectively. All origin and length units are in bytes, and the values can be in decimal or hexadecimal. PSoC 5LP is similar; the ram ORIGIN value describes the SRAM crossing the Cortex-M3 Code / SRAM region boundary (Figure 4). For example, for a PSoC 5LP with 256 KB flash and 64 KB SRAM: MEMORY { rom (rx) : ORIGIN = 0x0, LENGTH = 262144 ram (rwx) : ORIGIN = 0x20000000 - (65536 / 2), LENGTH = 65536 } www.cypress.com Document No. 001-89610 Rev. *A 23 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization The SECTIONS command lists all of the sections in address order, for example: SECTIONS { .text: { ... } .rodata: { ... } .ramvectors: { ... } .noinit: { ... } .data: { ... } .bss: { ... } .heap: { ... } .stack: { ... } } Figure 10. SECTIONS Command and PSoC Memory EMIF (PSoC 5LP) stack (Not all of the sections are shown above, only the major ones.) Figure 10 shows where these sections are placed in PSoC 4 and PSoC 5LP flash and SRAM: .text: executable code SRAM .rodata: const variables; initialization data ram (rwx) heap .bss .ramvectors: Cortex exception vectors table .data .noinit: variables that are not initialized .noinit .data: variables that are explicitly initialized .ramvectors .bss: variables that are initialized to 0 heap stack .text: { ... } >rom .data: { ... } >ram AT>rom .heap: { ... } >ram Flash The AT statement enables explicit initialization of variables; see Variable Initialization. Note: For PSoC 5LP, the placement of the sections in SRAM are indeterminate relative to position of the code SRAM / upper SRAM boundary (0x20000000; see Figure 4. rom (rx) A closer examination of the linker script file shows that many of the sections end with a region statement. This statement tells the linker the memory region in which to place the section, for example: .rodata other sections .text Note: For most applications it can be assumed that the stack is in upper SRAM and the other sections are all in code SRAM. This can be changed; see Modify the Linker Script File. Complete documentation of .ld file usage can be found in your gcc documentation. www.cypress.com Document No. 001-89610 Rev. *A 24 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 9.1.2 Linker Script File for MDK A .scat (for “scatter”) file has no commands; instead it defines regions and sections. It has a single load region called APPLICATION {}. The load region contains several execution regions, which in turn contain one or more section attributes, for example: APPLICATION ... // load region { CODE ... // execution region { * (+RO) // section attribute } ISRVECTORS ... { * (.ramvectors) } NOINIT_DATA ... { * (.noinit) } DATA ... { .ANY (+RW, +ZI) } ARM_LIB_HEAP ... { } ARM_LIB_STACK ... { } } Note: . For most applications it can be assumed that the stack and heap are in upper SRAM and the other sections are all in code SRAM. This can be changed; see Modify the Linker Script File. Complete documentation of .scat file usage can be found in your MDK documentation. Figure 11. .scat File Sections and PSoC Memory EMIF (PSoC 5LP) stack heap SRAM (Not all of the execution regions and section attributes are shown above, only the major ones.) Figure 11 shows where the regions and sections are placed in PSoC 4 and PSoC 5LP flash and SRAM. Special sections RO, RW, and ZI are defined as follows: RO: all code, const variables, initialization bytes for the RW section RW: all variables that are explicitly initialized; see Variable Initialization ZI: all variables that are initialized to zero; see Variable Initialization DATA { } .noinit .ramvectors and Other attributes such as .noinit cause placement of code or variables that match that attribute; see Define Custom Locations. Flash Note: For PSoC 5LP, the placement of the regions and sections in SRAM is indeterminate relative to the position of the code SRAM / upper SRAM boundary (0x20000000; see Figure 4. CODE { } other sections www.cypress.com Document No. 001-89610 Rev. *A 25 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization The following points are valid for both linker file types: 9.1.3 1. Only global and C static variables are included. C automatic variables are handled differently; see Automatic Variables. 2. The size of the heap and stack are defined in the PSoC Creator project DWR window, System tab. The stack pointer is initialized to the highest SRAM address plus 1, and the stack grows downward. The heap, which is used by C functions such as malloc() and free(), grows upward from its base. 3. Although the heap has a defined size in the DWR window, in practice it can use all of SRAM between the sections in SRAM and the current value of the stack pointer. Similarly, the stack can grow downward beyond the defined stack section. If the stack starts to overlap the memory regions below it, hard-to-find defects can occur. One way to detect stack overflow is to add code to each function to check the current value of the stack pointer. Variable Initialization When placing global and static variables in custom locations, it is important to understand how they are initialized. For example, consider these global variable definitions: uint8 foo = 5; uint16 myArray[10] = {1234, 12, ... }; Because these variables are located in SRAM, their values are undefined when the PSoC is powered up. To properly initialize them the values are saved in flash and the C startup code, i.e., the code that is executed before main(), copies the values from flash to SRAM. The values in flash are in the .rodata section for gcc (Figure 10) or the CODE (RO) section for MDK (Figure 11). They are copied into the variables in the .data section for gcc or the DATA (RW) section for MDK – explicitly initialized variables must be located in these sections. Global and static variables that are not explicitly initialized are set to zero by the C startup code. They must be located in the .bss section (gcc) or the DATA (ZI) section (MDK). Global and static variables that are located outside the above sections are not initialized and their initial values are undefined – they must be initialized in your code. Explicit initializations are ignored. 9.1.4 Map File The gcc and MDK linkers both have an option to produce a .map file when a project is built. You can find the file in the Results tab in the Workspace Explorer window, as Figure 12 shows. Figure 12. .map File in PSoC Creator The .map file shows where in memory all code modules and variables have been placed by the linker. You should review it after a build operation and confirm that: All code and variables have been placed in the expected locations, and There are no section overlaps. For MDK, the linker‟s default is to produce a .map file with no symbols, which makes it difficult to determine where your code and variables have been placed. To include the symbols, add --symbols to the linker command line, using the PSoC Creator menu Project > Build Settings > Linker > Command Line > Custom Flags. Now that we have seen the basics of how linker script files work, we can examine how to use them to place code or variables in custom locations in PSoC memory. www.cypress.com Document No. 001-89610 Rev. *A 26 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 9.2 Placement Procedure To place C functions or variables (including arrays and structures) in custom locations, do the following: 1. Define the custom locations. 2. In the C source code, declare the functions and variables that are to be located, along with their custom sections. 3. Build the PSoC Creator project. Copy the generated linker script file to a custom file, then modify it to add and locate the sections from step 2. 4. Rebuild the PSoC Creator project. Review the .map file and confirm that the custom locations have been filled correctly and that there are no section overlaps. Let us examine each of these steps in detail. 9.2.1 Define Custom Locations Before using custom locations, you should understand clearly the reasons why you want to use them. For example, do you want to: Place a function, or a variable of type const, in a custom location in flash? Place a variable such that it is not initialized by C startup code? This is typically used to maintain a variable‟s state through a device reset (except for a power-cycle reset). Place variables in PSoC 5LP upper SRAM, for bit band access? Place a function in SRAM, for possible faster execution? If so, note that in PSoC 4 and PSoC 5LP flash accesses are almost as fast as SRAM accesses, so significant performance gains may not be realized. See Cortex-M0 in PSoC 4 and Cortex-M3 in PSoC 5LP for details. Place variables in other custom locations in SRAM? Place variables in PSoC 5LP EMIF memory? Your answers to the above questions will help you to determine the addresses of your custom locations. 9.2.2 Declare Functions and Variables Once you have determined the custom location addresses, declare your functions and variables. In the declarations, add the sections in which they will reside, using the __attribute__ keyword (two underscore characters before and after the "attribute"): uint8 foo __attribute__ ((section(".MY_section"))); This keyword can be used with both gcc and MDK. PSoC Creator provides a convenient macro CY_SECTION to simplify the above statement. Following are some examples of its usage: uint8 foo CY_SECTION(".MY_section"); /* no explicit initialization, = 0 */ uint8 foo CY_SECTION(".MY_section") = 10; /* explicit initialization */ /* declare a function’s section in the prototype only, and not in the actual function */ uint16 MyFunction(char *x) CY_SECTION(".MY_section"); /* CYISR is a PSoC Creator macro to define an interrupt handler function. See Related Documents for more information. */ CYISR(MyFunction) CY_SECTION(".MY_section"); PSoC Creator also provides a convenient macro CY_NOINIT, which places a variable in the .noinit section; see Figure 10 and Figure 11: /* no initialization by C startup code, initial value is undefined */ uint8 foo CY_NOINIT; www.cypress.com Document No. 001-89610 Rev. *A 27 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 9.2.3 Modify the Linker Script File The final task is to modify your project‟s linker script file, to declare where you want the previously defined sections to be placed. For gcc, modify the linker script .ld file – change the SECTIONS {} command and possibly the MEMORY {} command. A common way to modify it would be to add statements for EMIF memory, or to split the SRAM, for example: MEMORY { rom (rx) : ORIGIN = 0x0, LENGTH = 262144 coderam (rwx) : ORIGIN = 0x20000000 - (65536 / 2), LENGTH = (65536 / 2) upperram (rwx) : ORIGIN = 0x20000000, LENGTH = (65536 / 2) EMIF (rwx) : ORIGIN = 0x60000000, LENGTH = 0x1000000 } Note: Changing the SRAM region names will cause errors in some section definitions in the default file, so you must change each section definition as needed. For example, change .stack: { ... } >ram to .stack: { ... } >upperram . To locate a section, add a section definition to the SECTIONS {} command. The syntax for a section definition is: .MY_section <address> <(NOLOAD)> : <alignment> { *(.MY_section) } ><memory region> Note that the section definition name and the name of the section within that section can be the same. Having the first character of the name be a period “.” is not required but is a common convention. Use a memory region name from the MEMORY {} command, as described previously. You can also just place your section within an existing section, for example: .data: { ... *(.MY_section) ... } >ram AT>rom For MDK, modify the linker script .scat file. The procedure is similar to that for gcc but simpler – there is no MEMORY {} or SECTIONS {} command to change. Instead, just add your execution region, for example: MY_REGION <address> <UNINIT> <length> { * (.MY_section) } You can also just place your section within an existing section, for example: DATA { * (.MY_section) .ANY (+RW, +ZI) } www.cypress.com Document No. 001-89610 Rev. *A 28 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 9.3 Example As noted previously, there are several different applications for custom locations. Let us examine one of them as an example of Placement Procedure. In this example we will place an array in PSoC 5LP upper SRAM so that it can be accessed by the Cortex-M3 bit band feature – see Cortex-M3 Bit Band. First, we define the array to occupy a section that we will call “.bitband”: uint8 myArray[10] CY_SECTION(".bitband"); Then we must modify the linker script file to tell it that we want to place the .bitband section in upper SRAM, starting at address 0x20000000. Table 8 shows how to do this for gcc (.ld file) or MDK (.scat file): Table 8. Example Modifications of Linker Script Files gcc Example, .ld File MDK Example, .scat File /* put our .bitband section between the .heap section in code SRAM and the .stack section in upper SRAM */ .heap (NOLOAD) : { . = _end; . += 256; __cy_heap_limit = .; } >ram .bitband 0x20000000 (NOLOAD) : { *(.bitband) } >ram .stack (__cy_stack - 256) (NOLOAD) : { __cy_stack_limit = .; . += 256; } >ram /* put our BITBAND execution region between the DATA exection region in code SRAM and the heap in upper SRAM */ DATA +0 { .ANY (+RW, +ZI) } BITBAND 0x20000000 UNINIT { * (.bitband) } ARM_LIB_HEAP (0x20000000 + (65536 / 2) - 256 - 256) EMPTY 256 { } ARM_LIB_STACK (0x20000000 + (65536 / 2)) EMPTY -256 { } Build the project, then check the .map file and confirm that the array has been located correctly and that there are no section overlaps. Note the initial values of myArray are undefined; they must be initialized in your code. See Variable Initialization for details. See Cortex-M3 Bit Band for details on how to do bit-level access of variables located in the bit band region. 9.4 General Considerations When declaring code and variables in custom sections, keep the following in mind: Explicitly initialized variables must be placed in the .data section (gcc, Figure 10) or the DATA execution region (MDK, Figure 11). See Variable Initialization for details. Similarly, variables for which there is no explicit initialization and which you expect to be auto-initialized to zero must be placed in the .bss section (gcc) or the DATA execution region (MDK). (The MDK compiler automatically gives variables an RW or ZI section attribute, depending on whether or not they‟re explicitly initialized.) Variables that are not placed in the above sections are not initialized. Explicit initializations are ignored. Functions that are to be located in SRAM must be placed in the .data section (gcc, Figure 10) or the DATA execution region (MDK, Figure 11). It is more efficient to have all constant data, for example fixed data tables in arrays, be located in flash, however the default is to place them in SRAM. To force placement of a variable in flash, set its type to const and explicitly initialize it, for example: uint32 const var_in_flash = 0x12345678; www.cypress.com Document No. 001-89610 Rev. *A 29 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization If you are using custom locations in flash, note that the PSoC Creator bootloader uses the top one or two rows of flash to store information about bootloadable files. For more information see the Bootloader Component datasheet. With MDK, the easiest way to put a variable at any __attribute__((at(address))) variable attribute. For example: specified address is to use the uint32 const var_in_flash[] __attribute__((at(0x300))) = { . . . }; The linker defines a special section and places the variable at the desired address, adjusting the placement of other code and variables as needed, as the following .map file snippet shows: Symbol Name Value . . . .text 0x000002f4 .text 0x000002f4 .ARM.__AT_0x00000300 0x00000300 main.o(.ARM.__AT_0x00000300) .text 0x0000030c .text 0x00000394 . . . Ov Type Size Section Section Section 0 0 12 Section Section 0 0 Object(Section) indicate_semi.o(.text) exit.o(.text) init_alloc.o(.text) h1_free.o(.text) For more information, see C Documentation. The .ramvectors section should always be at the bottom of SRAM and the stack section should always be at the top of SRAM. Complete documentation of .ld file usage can be found in your PSoC Creator installation folder, typically: C:\Program Files\Cypress\PSoC Creator\3.0\PSoC Creator\import\gnu_cs\arm\4.7.3\share\doc\gcc-arm-noneeabi\pdf\ld.pdf Complete documentation of .scat file usage can be found in your MDK installation folder, typically: C:\Keil\ARM\Hlp\armlink.chm and C:\Keil\ARM\Hlp\armlinkref.chm 9.5 EMIF Considerations (PSoC 5LP Only) It is possible to place variables in the external memory (EMIF) supported by PSoC 5LP, using the techniques described previously, but there are some restrictions: You must have an EMIF Component placed on your PSoC Creator project schematic. Note that the external memory address, data and control lines can use a significant number of device pins – plan your design accordingly. See the EMIF Component datasheet for details. You cannot access the external memory until the EMIF API function EMIF_Start() is called. So you can‟t initialize EMIF variables in C startup; you must initialize them after the code reaches main() and EMIF_Start() is called. The EMIF supports 8-bit and 16-bit memories; placement and access of different size variables may be a consideration. It is recommended to align 16-bit and 32-bit variables and structure members on 2-byte and 4-byte boundaries, respectively. Code can be executed from EMIF, but only with 16-bit external memories. The code executes much more slowly than from device internal flash or SRAM. It is also difficult to initialize code in external memory. In general, having code in external memory is not recommended. www.cypress.com Document No. 001-89610 Rev. *A 30 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 10 Cortex-M3 Bit Band (PSoC 5LP Only) As indicated in Address Map, Figure 2 and Figure 4, the PSoC 5LP Cortex-M3 has a bit band feature, where accessing an address in an alias region results in bit-level access in the corresponding bit band region. This lets you quickly set, clear or test a single bit in the first 1 Mbyte of the region. For a given bit in a given byte address, the formula for the corresponding alias address is: alias_address = 0x22000000 + 32 * (byte_address – 0x20000000) + 4 * bit_number So for example if you want to set bit 5 in address 0x20000001, write a 1 to address: 0x22000000 + 32 * 1 + 4 * 5 = 0x22000034. Similarly, to clear the bit, write a 0 to the alias address. To test the bit, read the alias address and test bit 0. Note: In addition to the SRAM region, the Cortex-M3 supports bit band for the peripheral region, in which all of the PSoC 5LP registers are located. However, peripheral region bit band is not supported in the PSoC 5LP. Writing to the peripheral region‟s bit band alias region (0x42000000 – 0x43FFFFFF) may give unpredictable results in the PSoC 5LP registers and is not recommended. To use the bit band feature with a variable, first place the variable in upper SRAM using the techniques described in Placing Code and Variables. Then, define macros to calculate and use the corresponding addresses in the bit band alias region: #define BIT_BAND_ALIAS_BASE 0x22000000 /* 'byte' should be a number 0x20000000 to 0x200FFFFF 'bit' should be a number 0 to 7 */ #define BIT_BAND_ALIAS_ADDR(byte, bit) ((BIT_BAND_ALIAS_BASE + \ 32 * ((uint32)(byte) – \ CYREG_SRAM_DATA_MBASE) + \ 4 * (uint8)(bit)) /* 'a' should be an address (uint32 *) */ #define GET_BIT(a, bit) *(uint32 *)BIT_BAND_ALIAS_ADDR(a, bit) /* 'val' should be 0 or 1 */ #define SET_BIT(a, bit, val) GET_BIT(a, bit) = (uint32)(val) #define TEST_BIT(a, bit, val) (GET_BIT(a, bit) == (uint32)(val)) You can then use the macros to set or test a bit: SET_BIT(&foo, 5, 1); /* set bit 5 of foo */ if (TEST_BIT(&foo, 5, 1)) { ... } /* test bit 5 */ In general, it is more efficient to set or clear a bit with the bit band technique than by reading, modifying and writing the variable, as Table 9 shows. When bit band is used, the read-modify-write cycle is done internally by the CPU, thereby saving one instruction. Table 9. Assembly Language for Bit Band vs Direct Set C Code Assembler Code /* direct set bit */ foo |= (1 << 5); ; R3 = address of foo ldr r2, [r3] orr r2, r2, #32 str r2, [r3] /* use bit band */ SET_BIT(&foo, 5, 1); ; R3 = bit band alias address for foo bit 5 */ mov r2, #1 str r2, [r3] Note that there is no efficient way use bit banding to toggle a bit. It is possible to do: However it is simpler and just as efficient to do: foo ^= (1 << 5); SET_BIT(&foo, 5, GET_BIT(&foo, 5) ^ 1); www.cypress.com Document No. 001-89610 Rev. *A 31 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 11 DMA Addresses (PSoC 5LP only) This section assumes that you know how to use the direct memory access (DMA) controller in PSoC 5LP. The DMA controller can transfer data from a source to a destination with no CPU intervention. This allows the CPU to handle other tasks while the DMA does data transfers, thereby achieving a “multiprocessing” environment. The DMA controller is highly flexible and capable of doing complex transfers of data between PSoC memory and onchip peripherals including ADCs, DACs, the Digital Filter Block (DFB), USB, UART, and SPI. There are 24 independent DMA channels. For more information, see AN52705, Getting Started with PSoC DMA. In PSoC 5LP, the DMA shares the Cortex-M3 S Bus (Figure 4) with the CPU. However, because the S Bus does not access the Code region, the DMA cannot directly access code SRAM (0x1FFF8000 to 0x1FFFFFFF). PSoC 5LP handles this by implementing remapping so that the DMA can access the code SRAM by accessing corresponding addresses 0x20008000 to 0x2000FFFF, as Figure 13 shows. Figure 13. DMA Remapping of Code SRAM DMA Mapping 0x2000 FFFF 0x2000 8000 0x2000 7FFF CPU Mapping Upper SRAM Code SRAM (Remapped) Upper SRAM 0x2000 0000 0x1FFF FFFF Code SRAM 0x1FFF 8000 Since the DMA is a 16-bit subsystem, when it increments an address only the lower 16 bits are incremented, with rollover. Therefore, the next DMA address following 0x2000FFFF (which is mapped to 0x1FFFFFFF) is 0x20000000. This means that the SRAM still functions as a contiguous 64-KB block of memory for DMA. This is also true for devices with less than 64K SRAM because the SRAM is always centered around 0x20000000. The remapping is taken into account by the PSoC Creator DMA Component API functions used to set up a DMA channel. If you do not use the API, always set the upper 16 bits of a DMA address for SRAM to 0x2000 regardless of the actual address. 12 Summary This application note has presented a number of methods to increase the efficiency of your C code for the Cortex CPUs in PSoC 4 and PSoC 5LP. The gcc and MDK compilers supported by PSoC Creator work well for most applications without using these techniques; they are needed for special problems in meeting code size or execution speed requirements. The methods presented, in no particular order, are: Limit the number of function arguments to no more than 4. See Function Arguments and Result. Use inline or embedded assembler to maximize efficiency in critical sections; see Mixing C and Assembler Code. You can also use this technique, or intrinsic functions, to take advantage of special instructions, especially in the Cortex-M3; see Special-Function Instructions. Minimize the number of global and static variables. Not only is this a coding best practice but it may reduce code size by reducing the number of address load operations. See Global and Static Variables. Also see Appendix A for examples of how to write efficient assembler code. Be careful when using standard compiler libraries, as they may use a lot of memory; consider using inline code instead. See Compiler Libraries. www.cypress.com Document No. 001-89610 Rev. *A 32 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 12.1 When using structures, pay careful attention to whether they should be packed or unpacked – there are advantages and disadvantages for each. See Packed and Unpacked Structures. Place speed critical code in SRAM; see Placing Code and Variables. Note that speed gains using this technique may not be realized. Place variables to take advantage of the bit band feature in the PSoC 5LP Cortex-M3; see Cortex-M3 Bit Band and Example. Use All of the Resources in Your PSoC There is one final method available for reducing code size. It is based on the fact that PSoC is designed to be a flexible device that enables you to build custom functions in programmable analog and digital blocks. For example, in PSoC 5LP you have the following peripherals that can act as “co-processors”: DMA Controller. Note that the most common CPU assembler instructions are MOV, LDR, and STR, which implies that the CPU spends a lot of cycles just moving bytes around. Let the DMA controller do that instead. Digital Filter Block (DFB) – a sophisticated 24-bit sum of products calculator The UDBs also have programmable logic devices (PLDs) which can be used to build state machines, c.f. the Lookup Table (LUT) Component datasheet. LUTs can be an effective alternative to programming state machines in the CPU using C switch / case statements. Analog components including ADCs, DACs, comparators, opamps, as well as programmable switched capacitor / continuous time (SC/CT) blocks from which you can create programmable gain amplifiers (PGAs), transimpedance amplifiers (TIAs), and mixers. Consider doing your processing in the analog domain instead of the digital domain. Universal Digital Blocks (UDBs). There are as many as 24 UDBs, and each UDB has an 8-bit datapath that can add, subtract, and do bitwise operations, shifts, and cyclic redundancy check (CRC). The datapaths can be chained for word-wide calculations. Consider offloading CPU calculations to the datapaths. PSoC Creator offers a large number of Components to implement various functions in these peripherals. This allows you to develop an effective multiprocessing system in a single chip, offloading a lot of functionality from the CPU. This in turn can not only reduce code size, but by reducing the number of tasks that the CPU must perform, you can reduce CPU speed and thereby reduce power. For example, with PSoC 5LP a digital system can be designed to control multiplexed ADC inputs, and interface with DMA to save the data in SRAM, to create an advanced analog data collection system with zero usage of the CPU. Cypress offers extensive application note support for PSoC peripherals, as well as detailed data in the device datasheets and technical reference manuals (TRMs). For more information see Related Documents. www.cypress.com Document No. 001-89610 Rev. *A 33 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization 13 Related Documents 13.1 Application Notes Name: Mark Ainsworth AN77759 – Getting Started with PSoC 5LP Title: Applications Engineer Principal Background: Mark Ainsworth has a BS in Computer Engineering from Syracuse University and an MSEE from University of Washington, as well as many years experience designing and building embedded systems. AN90799 – PSoC 4 Interrupts Name: Asha Ganesan Title: Applications Engineer Background: Asha Ganesan, a gold medalist from College of Engineering Guindy, India, earned her BE in Electronics and Communication Engineering. She is currently working on PSoC 3/4/5LP based projects and assisting PSoC users with their designs. Name: Mahesh Balan Title: Applications Engineer Background: Mahesh Balan earned his BTech in Electronics and Communication Engineering from Model Engineering College. He is currently working on PSoC 3/4/5LP based projects and assisting PSoC users with their designs. Name: Keith Mikoleit Title: Systems Engineer Background: Keith Mikoleit graduated from Western Washington University with a Bachelor's Degree in Electrical Engineering Technology. 13.2 About the Authors AN79953 – Getting Started with PSoC 4 AN52705 – Getting Started with PSoC DMA AN54460 – PSoC 3 and PSoC 5LP Interrupts AN60630 – PSoC 3 8051 Code and Memory Optimization C Documentation gcc documentation can be found in your PSoC Creator installation folder. The compiler documentation can be found in: C:\Program Files\Cypress\PSoC Creator\3.0\ PSoC Creator\import\gnu_cs\arm\4.7.3\share\ doc\gcc-arm-none-eabi\pdf\gcc\gcc.pdf The linker script file documentation can be found in: C:\Program Files\Cypress\PSoC Creator\3.0\ PSoC Creator\import\gnu_cs\arm\4.7.3\share\ doc\gcc-arm-none-eabi\pdf\ld.pdf For MDK, the documentation can be found in your MDK installation folder, typically: C:\Keil\ARM\Hlp. Start with armtools.chm. For the compiler, see armccref.chm. armcc.chm and The linker script file documentation can be found in armlink.chm and armlinkref.chm. 13.3 ARM Cortex Documentation ARM provides on their web site a wealth of information about the Cortex-M3 and the Cortex-M0 CPUs: Cortex-M0 Instruction Set ARM Related Books Cortex-M3 Instruction Set Cortex Microcontroller Software Interface Standard (CMSIS) library www.cypress.com Document No. 001-89610 Rev. *A 34 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization A Appendix A: Compiler Output Details This section shows in detail the assembler output for both compilers supported by PSoC Creator (gcc and MKD) and both PSoC CPUs (Cortex-M0 and Cortex-M3), with and without optimizations. The details are shown in several tables, which are organized as follows: Table 10. Compiler Output Details for gcc Compiler for Cortex-M3 CPU Table 11. Compiler Output Details for gcc Compiler for Cortex-M0 CPU Table 12. Compiler Output Details for MDK Compiler for Cortex-M3 CPU Table 13. Compiler Output Details for MDK Compiler for Cortex-M0 CPU Although it may not be exactly what you get when you compile your C code, the assembler code in the tables can serve as useful examples that you can incorporate in your code. For details see Mixing C and Assembler Code. The test program used to generate the tables can be found in Compiler Test Program. A.1 Assembler Examples, gcc for Cortex-M3 Table 10 shows, for the gcc compiler for the Cortex-M3, examples of compiler output for different optimization options. The examples were extracted from the .lst files generated by the compiler. See Function Arguments for details on register usage and stack usage in compiler functions. Table 10. Compiler Output Details for gcc Compiler for Cortex-M3 CPU gcc, Cortex-M3 No Optimization C Code gcc, Cortex-M3, Size Optimization gcc, Cortex-M3, Speed Optimization // Calling a function // with no arguments LCD_Start(); ; do the function call bl LCD_Start ; same as for no optimization ; same as for no optimization // Calling a function with // one argument LCD_PrintInt8(128); ; R0 = first argument ; conditional flags are NOT ; updated by mov mov r0, #80 bl LCD_PrintInt8 ; R0 = first argument ; conditional flags ARE updated ; by movs movs r0, #80 bl LCD_PrintInt8 ; same as for size optimization // Calling a function with // two arguments LCD_Position(0, 2); ; R0 ; R1 mov mov bl ; R0 = first argument ; R1 = second argument movs r1, #2 movs r0, #0 bl LCD_Position ; same as for size optimization www.cypress.com = first argument = second argument r0, #0 r1, #2 LCD_Position Document No. 001-89610 Rev. *A 35 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M3 No Optimization C Code // For loop: void ForLoop(uint8 i) { for(i = 0; i < 10; i++) { LCD_PrintInt8(i); } } ; function prolog ; i is saved on the stack sub sp, sp, #16 add r7, sp, #0 mov r3, r0 strb r3, [r7, #15] ; i = mov strb b 0 r3, #0 r3, [r7, #15] .L2 ; do the function call with ; i as the argument in R0 .L3: ldrb r3, [r7, #15] uxth r3, r3 ; sign extend mov r0, r3 bl LCD_PrintInt8 gcc, Cortex-M3, Size Optimization gcc, Cortex-M3, Speed Optimization ; function prolog push {r4, lr} ; function prolog push {r3, lr} ; R4 = i movs r4, #0 ; unroll the loop ; do the function call ; 10 times ; i as the argument in R0 movs r0, #0 bl LCD_PrintInt8 .L2: ; do the function call with ; i as the argument in R0 mov r0, r4 ; sign extend adds r4, r4, #1 ; i++ uxtb r4, r4 bl LCD_PrintInt8 movs bl . . . ; check i not equal to 10 cmp r4, #10 bne .L2 ; function epilog pop {r4, pc} ; return movs r0, #9 pop {r3, lr} ; function returns back to ; caller of this function b LCD_PrintInt8 ; i++ ldrb r3, [r7, #15] add r3, r3, #1 strb r3, [r7, #15] ; check i 10, by comparing ; it with 9 .L2: ldrb r3, [r7, #15] cmp r3, #9 bls .L3 ; function epilog add r7, r7, #16 mov sp, r7 pop {r7, pc} ; return www.cypress.com Document No. 001-89610 Rev. *A r0, #1 LCD_PrintInt8 36 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M3 No Optimization C Code // While loop // i is type automatic, see // Accessing Automatic Variables // for details uint8 i = 0; ; prolog not shown ; i = 0 mov r3, #0 strb r3, [r7, #7] b .L5 while (i < 10) { LCD_PrintInt8(i); i++; } .L6: ; LCD_PrintInt8(i) ldrb r3, [r7, #7] uxth r3, r3 mov r0, r3 bl LCD_PrintInt8 ; i++ ldrb r3, [r7, #7] add r3, r3, #1 strb r3, [r7, #7] gcc, Cortex-M3, Size Optimization ; prolog not shown ; i = 0 movs r4, #0 .L6: mov r0, r4 adds r4, r4, #1 ; i++ uxtb r4, r4 ; LCD_PrintInt8(i) bl LCD_PrintInt8 ; check i not equal to 10 cmp r4, #10 bne .L6 gcc, Cortex-M3, Speed Optimization ; function prolog push {r3, lr} ; unroll the loop ; do the function call ; 10 times ; i as the argument in R0 movs r0, #0 bl LCD_PrintInt8 movs bl r0, #1 LCD_PrintInt8 . . . ; epilog not shown movs r0, #9 pop {r3, lr} ; function returns back to ; caller of this function b LCD_PrintInt8 ; no prolog ; if(j == 1) cmp r1, #1 beq .L10 ; same as for size optimization .L5: ; while(i 10) ldrb r3, [r7, #7] cmp r3, #9 bls .L6 ; epilog not shown // Conditional statement void Conditional(uint8 i, uint8 j) { if(j == 1) { LCD_PrintInt8(i); } else { LCD_PrintInt8(i + 1); } } ; prolog not shown ; if(j == 1) ldrb r3, [r7, #6] cmp r3, #1 bne .L5 ; LCD_PrintInt8(i) ldrb r3, [r7, #7] uxth r3, r3 mov r0, r3 bl LCD_PrintInt8 b .L4 ; LCD_PrintInt8(i) adds r0, r0, #1 uxtb r0, r0 .L10: ; function returns back to ; caller of this function b LCD_PrintInt8 .L5: ; LCD_PrintInt8(i + 1) ldrb r3, [r7, #7] uxth r3, r3 add r3, r3, #1 uxth r3, r3 mov r0, r3 bl LCD_PrintInt8 .L4: ; epilog not shown www.cypress.com Document No. 001-89610 Rev. *A 37 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M3 No Optimization C Code // Switch case statements void SwitchCase(uint8 j) { switch(j) { case 0: LCD_PrintInt8(1); break; case 1: LCD_PrintInt8(2); break; default: LCD_PrintInt8(0); break; } } gcc, Cortex-M3, Size Optimization ; prolog not shown ; switch(j) ldrb r3, [r7, #7] cmp r3, #0 beq .L9 cmp r3, #1 beq .L10 b .L12 ; no prolog ; switch(j) cbz r0, .L13 cmp r0, #1 bne .L15 .L9: mov bl b .L13: movs r0, #1 ; case 0 b .L16 ; case 0 r0, #1 LCD_PrintInt8 .L7 ; break .L10: mov bl b ; case 1 r0, #2 LCD_PrintInt8 .L9 ; break .L12: mov.w bl nop ; default r0, #0 LCD_PrintInt8 ; break movs b r0, #2 ; case 1 .L16 .L15: movs r0, #0 ; default .L16: ; no epilog b LCD_PrintInt8 gcc, Cortex-M3, Speed Optimization ; no prolog ; switch(j) cbnz r0, .L12 movs r0, #1 ; case 0 ; no epilog b LCD_PrintInt8 .L12: cmp r0, #1 beq .L13 movs r0, #0 ; default ; no epilog b LCD_PrintInt8 movs r0, #2 ; case 1 ; no epilog b LCD_PrintInt8 .L7: ; epilog not shown // Ternary operator void Ternary(uint8 i) { LCD_PrintInt8( (i == 1) ? 80 : 100); } ; prolog not shown ; check value of i ldrb r3, [r7, #7] cmp r3, #1 bne .L17 ; no prolog ; check value of i cmp r0, #1 .L17: mov r3, #100 ; “ite” stands for if-then; else instruction ; “ne” condition checks ; if the previous compare ; instruction has cleared the ; “equal to” flag ite ne .L18: mov r0, r3 bl LCD_PrintInt8 ; epilog not shown ; mov if the result of the ; previous “ite” instruction is ; “not equal” movne r0, #100 mov b r3, #80 .L18 ; same as for size optimization ; mov if the result of the ; previous “ite” instruction is ; “equal” moveq r0, #80 ; no epilog b LCD_PrintInt8 www.cypress.com Document No. 001-89610 Rev. *A 38 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization C Code gcc, Cortex-M3 No Optimization gcc, Cortex-M3, Size Optimization gcc, Cortex-M3, Speed Optimization // Addition operation int DoAdd(int x, int y) { return x + y; } ; prolog not shown ldr r2, [r7, #4] ldr r3, [r7, #0] adds r3, r2, r3 mov r0, r3 ; return value ; epilog not shown ; no prolog adds r0, r0, r1 bx lr ; return with result ; same as for size optimization // Subtraction operation int DoSub(int x, int y) { return x - y; } ; prolog not shown ldr r2, [r7, #4] ldr r3, [r7, #0] subs r3, r2, r3 mov r0, r3 ; return value ; epilog not shown ; no prolog subs r0, r0, r1 bx lr ; return with result ; same as for size optimization // Multiplication int DoMul(int x, int y) { return x * y; } ; prolog not shown ldr r3, [r7, #4] ldr r2, [r7, #0] mul r3, r2, r3 mov r0, r3 ; return value ; epilog not shown ; no prolog muls r0, r1, r0 bx lr ; return with result ; same as for size optimization // Division int DoDiv(int x, int y) { return x / y; } ; prolog not shown ldr r2, [r7, #4] ldr r3, [r7, #0] sdiv r3, r2, r3 mov r0, r3 ; return value ; epilog not shown ; no prolog sdiv r0, r0, r1 bx lr ; return with result ; same as for size optimization // Modulo operator int DoMod(int x, int y) { return x % y; } ; prolog not shown ldr r3, [r7, #4] ldr r2, [r7, #0] ; truncated quotient sdiv r2, r3, r2 ; quotient * divisor ldr r1, [r7,#0] mul r2, r1, r2 ; remainder = dividend ; (quotient * divisor) subs r3, r3, r2 mov r0, r3 ; return value ; epilog not shown ; no prolog sdiv r3, r0, r1 ; multiply and subtract instruction ; implements remainder = ; dividend - (quotient * divisor) mls r0, r3, r1, r0 bx lr ; return with result ; same as for size optimization www.cypress.com Document No. 001-89610 Rev. *A 39 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization C Code // Pointer void Pointer(uint8 x, uint8 *ptr) { *ptr = *ptr + x; ptr++; LCD_PrintInt8(*ptr); } gcc, Cortex-M3 No Optimization ; *ptr = *ptr + x ldr r3, [r7, #0] ; ptr ldrb r2, [r3, #0] ldrb r3, [r7, #7] ; x adds r3, r2, r3 uxtb r2, r3 ldr r3, [r7, #0] ; ptr strb r2, [r3, #0] gcc, Cortex-M3, Size Optimization ; *ptr = *ptr + x ldrb r3, [r1, #0] ; R1 = ptr adds r0, r0, r3 ; R0 = x strb r0, [r1, #0] gcc, Cortex-M3, Speed Optimization ; same as for size optimization ; ptr++ ; LCD_PrintInt8(*ptr) ldrb r0, [r1, #1] b LCD_PrintInt8 ; ptr++ ldr r3, [r7, #0] ; ptr add r3, r3, #1 str r3, [r7, #0] ; LCD_PrintInt8(*ptr) ldr r3, [r7, #0] ldrb r3, [r3, #0] mov r0, r3 bl LCD_PrintInt8 // Function pointer void FuncPtr(uint8 x, void *fptr(uint8)) { (*fptr)(x); } www.cypress.com ; (*fptr)(x) ldrb r2, [r7, #7] ; x ldr r3, [r7, #0] ; fptr mov r0, r2 ; in a blx instruction, the ; LS bit of the register ; must be 1 to keep the CPU ; in Thumb mode, or an ; exception occurs blx r3 Document No. 001-89610 Rev. *A ; (*fptr)(x) ; in a blx instruction, the LS ; bit of the register must be ; 1 to keep the CPU in Thumb ; mode, or an exception occurs blx r1 ; same as for size optimization 40 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization C Code // Packed structures struct FOO_P { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); extern struct FOO_P myfoo_p; void PackedStruct(void) { myfoo_p.membera = 5; myfoo_p.memberb = 10; myfoo_p.memberc = 15; myfoo_p.memberd = 20; } // unpacked structures struct FOO { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; extern struct FOO myfoo; void PackedStruct(void) { myfoo.membera = 5; myfoo.memberb = 10; myfoo.memberc = 15; myfoo.memberd = 20; } www.cypress.com gcc, Cortex-M3 No Optimization gcc, Cortex-M3, Size Optimization ; membera movw r3, movt r3, mov r2, strb r2, ; memberb movw r3, movt r3, mov r2, strb r2, ; memberc movw r3, movt r3, mov r2, orr r2, strb r2, mov r2, strb r2, mov r2, strb r2, mov r2, strb r2, ; memberd movw r3, movt r3, mov r2, orr r2, strb r2, mov r2, strb r2, = 5 #:lower16:myfoo_p #:upper16:myfoo_p #5 [r3, #0] = 10 #:lower16:myfoo_p #:upper16:myfoo_p #10 [r3, #1] = 15 #:lower16:myfoo_p #:upper16:myfoo_p #0 r2, #15 [r3, #2] #0 [r3, #3] #0 [r3, #4] #0 [r3, #5] = 20 #:lower16:myfoo_p #:upper16:myfoo_p #0 r2, #20 [r3, #6] #0 [r3, #7] ldr movs movs strb strb movs movs movs strb strb strb strb strb strb bx ; membera movw r3, movt r3, mov r2, strb r2, ; memberb movw r3, movt r3, mov r2, strb r2, ; memberc movw r3, movt r3, mov r2, str r2, ; memberd movw r3, movt r3, mov r2, strh r2, = 5 #:lower16:myfoo #:upper16:myfoo #5 [r3, #0] = 10 #:lower16:myfoo #:upper16:myfoo #10 [r3, #1] = 15 #:lower16:myfoo #:upper16:myfoo #15 [r3, #4] = 20 #:lower16:myfoo #:upper16:myfoo #20 [r3, #8] ldr movs strb movs movs movs strb str strh bx Document No. 001-89610 Rev. *A .L28: .word .L31: .word r3, r2, r0, r2, r0, r2, r1, r0, r1, r2, r2, r2, r0, r2, lr [pc, #5 #10 [r3, [r3, #0 #15 #20 [r3, [r3, [r3, [r3, [r3, [r3, .L28] #0] ; membera = 5 #1] ; memberb = 10 #2] ; memberc = 15 #3] #4] #5] #6] ; memberd = 20 #7] myfoo_p r3, r2, r2, r0, r1, r2, r0, r1, r2, lr [pc, #5 [r3, #10 #15 #20 [r3, [r3, [r3, myfoo .L31] #0] ; membera = 5 #1] ; memberb = 10 #4] ; memberc = 15 #8] ; memberd = 20 gcc, Cortex-M3, Speed Optimization movw movt movs movs movs strb strb 10 movs movs strb 15 strb strb strb strb 20 strb bx r3, r3, r1, r0, r2, r1, r0, movw movt movs strb movs movs movs strb 10 str 15 strh 20 bx r3, r3, r2, r2, r0, r1, r2, r0, 41 #:lower16:myfoo_p #:upper16:myfoo_p #5 #10 #0 [r3, #0] ; membera = 5 [r3, #1] ; memberb = r1, #15 r0, #20 r1, [r3, #2] ; memberc = r2, r2, r2, r0, [r3, [r3, [r3, [r3, #3] #4] #5] #6] ; memberd = r2, [r3, #7] lr #:lower16:myfoo #:upper16:myfoo #5 [r3, #0] ; membera = 5 #10 #15 #20 [r3, #1] ; memberb = r1, [r3, #4] ; memberc = r2, [r3, #8] ; memberd = lr PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization A.2 Assembler Examples, gcc for Cortex-M0 Table 11 shows, for the gcc compiler for the Cortex-M0, examples of compiler output for different optimization options. The examples were extracted from the .lst files generated by the compiler. See Function Arguments for details on register usage and stack usage in compiler functions. Table 11. Compiler Output Details for gcc Compiler for Cortex-M0 CPU gcc, Cortex-M0, No Optimization C Code gcc, Cortex-M0, Size Optimization gcc, Cortex-M0, Speed Optimization // Calling a function // with no arguments LCD_Start(); ; do the function call bl LCD_Start ; same as for no optimization ; same as for no optimization // Calling a function with // one argument LCD_PrintInt8(128); ; R0 = first argument ; conditional flags are NOT ; updated by mov mov r0, #128 bl LCD_PrintInt8 ; same as for no optimization ; same as for no optimization // Calling a function with // two arguments LCD_Position(0, 2); ; R0 ; R1 mov mov bl ; same as for no optimization ; same as for no optimization www.cypress.com = first argument = second argument r1, #2 r0, #0 LCD_Position Document No. 001-89610 Rev. *A 42 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // For loop: void ForLoop(uint8 i) { for(i = 0; i < 10; i++) { LCD_PrintInt8(i); } } ; function prolog ; i is saved on the stack push {r7, lr} sub sp, sp, #16 add r7, sp, #0 mov r2, r0 add r3, r7, #7 strb r2, [r3] ; i = mov add mov strb b 0 r3, r3, r2, r2, .L2 r7 r3, #15 #0 [r3] ; do the function call with ; i as the argument in R0 .L3: mov r3, r7 add r3, r3, #15 ldrb r3, [r3] mov r0, r3 bl LCD_PrintInt8 ; i++ mov add mov add ldrb add strb r3, r3, r2, r2, r2, r2, r2, ; check i ; it with .L2: mov r3, add r3, ldrb r3, cmp r3, bls .L3 gcc, Cortex-M0, Size Optimization gcc, Cortex-M0, Speed Optimization ; function prolog push {r4, lr} ; function prolog push {r3, lr} ; R4 = i mov r4, #0 ; unroll the loop ; do the function call ; 10 times ; i as the argument in R0 mov r0, #0 bl LCD_PrintInt8 .L2: ; do the function call with ; i as the argument in R0 mov r0, r4 ; sign extend add r4, r4, #1 ; i++ uxtb r4, r4 bl LCD_PrintInt8 mov bl . . . ; check i not equal to 10 cmp r4, #10 bne .L2 mov bl r0, #9 LCD_PrintInt8 pop pop {r3, pc} ; return {r4, pc} ; return r7 r3, #15 r7 r2, #15 [r2] r2, #1 [r3] 10, by comparing 9 r7 r3, #15 [r3] #9 ; function epilog mov sp, r7 add sp, sp, #16 pop {r7, pc} www.cypress.com Document No. 001-89610 Rev. *A r0, #1 LCD_PrintInt8 43 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // While loop // i is type automatic, see // Accessing Automatic Variables // for details uint8 i = 0; while (i < 10) { LCD_PrintInt8(i); i++; } ; prolog not shown ; i = 0 add r3, r7, #7 mov r2, #0 strb r2, [r3] b .L5 .L6: ; LCD_PrintInt8(i) add r3, r7, #7 ldrb r3, [r3] mov r0, r3 bl LCD_PrintInt8 ; i++ add add ldrb add strb r3, r2, r2, r2, r2, r7, #7 r7, #7 [r2] r2, #1 [r3] gcc, Cortex-M0, Size Optimization ; prolog not shown ; i = 0 movs r4, #0 .L6: mov r0, r4 add r4, r4, #1 ; i++ uxtb r4, r4 ; LCD_PrintInt8(i) bl LCD_PrintInt8 ; check i not equal to 10 cmp r4, #10 bne .L6 gcc, Cortex-M0, Speed Optimization ; prolog not shown ; unroll the loop ; do the function call ; 10 times ; i as the argument in R0 mov r0, #0 bl LCD_PrintInt8 mov bl . . . mov bl ; epilog not shown ; epilog not shown Document No. 001-89610 Rev. *A r0, #9 LCD_PrintInt8 ; epilog not shown .L5: ; while(i 10) add r3, r7, #7 ldrb r3, [r3] cmp r3, #9 bls .L6 www.cypress.com r0, #1 LCD_PrintInt8 44 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // Conditional statement void Conditional(uint8 i, uint8 j) { if(j == 1) { LCD_PrintInt8(i); } else { LCD_PrintInt8(i + 1); } } ; prolog not shown ; if(j == 1) add r3, r7, #6 ldrb r3, [r3] cmp r3, #1 bne .L8 ; LCD_PrintInt8(i) add r3, r7, #7 ldrb r3, [r3] mov r0, r3 bl LCD_PrintInt8 b .L7 gcc, Cortex-M0, Size Optimization ; prolog not shown ; if(j == 1) cmp r1, #1 beq .L11 gcc, Cortex-M0, Speed Optimization ; same as for size optimization ; LCD_PrintInt8(i) add r0, r0, #1 uxtb r0, r0 .L11: bl LCD_PrintInt8 ; epilog not shown .L8: ; LCD_PrintInt8(i + 1) add r3, r7, #7 ldrb r3, [r3] add r3, r3, #1 uxtb r3, r3 mov r0, r3 bl LCD_PrintInt8 .L7: ; epilog not shown // Switch case statements void SwitchCase(uint8 j) { switch(j) { case 0: LCD_PrintInt8(1); break; case 1: LCD_PrintInt8(2); break; default: LCD_PrintInt8(0); break; } } ; prolog not shown ; switch(j) add r3, r7, #7 ldrb r3, [r3] cmp r3, #0 beq .L12 cmp r3, #1 beq .L13 b .L15 ; prolog not shown ; switch(j) cmp r0, #0 beq .L14 cmp r0, #1 bne .L17 ; prolog not shown ; switch(j) cmp r0, #0 bne .L14 mov r0, #1 ; case 0 bl LCD_PrintInt8 mov b .L8: pop {r3, pc} ; return .L12 mov bl b .L14: mov r0, #1 ; case 0 b .L18 ; case 0 r0, #1 LCD_PrintInt8 .L10 ; break .L13: ; case 1 mov r0, #2 bl LCD_PrintInt8 b .L10 ; break r0, #2 ; case 1 .L18 .L17: mov r0, #0 ; default .L14: cmp r0, #1 beq .L15 mov r0, #0 ; default bl LCD_PrintInt8 b .L8 .L18: bl LCD_PrintInt8 ; epilog not shown mov bl b .L15: ; default mov r0, #0 bl LCD_PrintInt8 mov r8, r8 ; break - nop .L10: ; epilog not shown www.cypress.com Document No. 001-89610 Rev. *A 45 r0, #2 ; case 1 LCD_PrintInt8 .L8 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // Ternary operator void Ternary(uint8 i) { LCD_PrintInt8( (i == 1) ? 80 : 100); } ; prolog not shown ; check value of i add r3, r7, #7 ldrb r3, [r3] cmp r3, #1 bne .L17 mov b r3, #80 .L18 gcc, Cortex-M0, Size Optimization ; prolog not shown mov r3, #100 cmp r0, #1 bne .L20 mov r3, #80 .L20: mov r0, r3 bl LCD_PrintInt8 ; epilog not shown .L17: mov r3, #100 gcc, Cortex-M0, Speed Optimization ; prolog not shown mov r3, #100 cmp r0, #1 beq .L19 .L17: mov r0, r3 bl LCD_PrintInt8 pop {r3, pc} ; return .L19: mov r3, #80 b .L17 .L18: mov r0, r3 bl LCD_PrintInt8 ; epilog not shown // Addition operation int DoAdd(int x, int y) { return x + y; } ; prolog ldr r2, ldr r3, add r3, mov r0, ; epilog not shown [r7, #4] [r7, #0] r2, r3 r3 ; return value not shown ; no prolog add r0, r0, r1 bx lr ; return with result ; same as for size optimization // Subtraction operation int DoSub(int x, int y) { return x - y; } ; prolog ldr r2, ldr r3, sub r3, mov r0, ; epilog not shown [r7, #4] [r7, #0] r2, r3 r3 ; return value not shown ; no prolog sub r0, r0, r1 bx lr ; return with result ; same as for size optimization // Multiplication int DoMul(int x, int y) { return x * y; } ; prolog ldr r3, ldr r2, mul r3, mov r0, ; epilog not shown [r7, #4] [r7, #0] r2 r3 ; return value not shown ; no prolog mul r0, r1 bx lr ; return with result ; same as for size optimization // Division int DoDiv(int x, int y) { return x / y; } ; prolog not shown ldr r0, [r7, #4] ldr r1, [r7, #0] bl __aeabi_idiv mov r3, r0 mov r0, r3 ; return value ; epilog not shown push {r3, lr} bl __aeabi_idiv ; return with result pop {r3, pc} ; same as for size optimization // Modulo operator int DoMod(int x, int y) { return x % y; } ; prolog not shown ldr r3, [r7, #4] mov r0, r3 ldr r1, [r7] bl __aeabi_idivmod mov r3, r1 mov r0, r3 ; return value ; epilog not shown push {r3, lr} bl __aeabi_idivmod ; return with result mov r0, r1 pop {r3, pc} ; same as for size optimization www.cypress.com Document No. 001-89610 Rev. *A 46 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // Pointer void Pointer(uint8 x, uint8 *ptr) { *ptr = *ptr + x; ptr++; LCD_PrintInt8(*ptr); } ; *ptr = *ptr + x ldr r3, [r7] ; ptr ldrb r2, [r3] add r3, r7, #7 ; x ldrb r3, [r3] add r3, r2, r3 uxtb r2, r3 ldr r3, [r7] ; ptr strb r2, [r3] gcc, Cortex-M0, Size Optimization ; *ptr = *ptr + x ldrb r3, [r1] ; R1 = ptr add r0, r0, r3 ; R0 = x strb r0, [r1] gcc, Cortex-M0, Speed Optimization ; same as for size optimization ; ptr++ ; LCD_PrintInt8(*ptr) ldrb r0, [r1, #1] bl LCD_PrintInt8 ; ptr++ ldr r3, [r7] ; ptr add r3, r3, #1 str r3, [r7] ; LCD_PrintInt8(*ptr) ldr r3, [r7] ldrb r3, [r3] mov r0, r3 bl LCD_PrintInt8 // Function pointer void FuncPtr(uint8 x, void *fptr(uint8)) { (*fptr)(x); } www.cypress.com ; (*fptr)(x) add r3, r7, #7 ; x ldrb r2, [r3] ldr r3, [r7] ; fptr mov r0, r2 ; in a blx instruction, the ; LS bit of the register ; must be 1 to keep the CPU ; in Thumb mode, or an ; exception occurs blx r3 Document No. 001-89610 Rev. *A ; (*fptr)(x) ; in a blx instruction, the LS ; bit of the register must be ; 1 to keep the CPU in Thumb ; mode, or an exception occurs blx r1 ; same as for size optimization 47 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // Packed structures struct FOO_P { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); extern struct FOO_P myfoo_p; void PackedStruct(void) { myfoo_p.membera = 5; myfoo_p.memberb = 10; myfoo_p.memberc = 15; myfoo_p.memberd = 20; } ; membera ldr r3, mov r2, strb r2, ; memberb ldr r3, mov r2, strb r2, ; memberc ldr r3, ldrb r1, mov r2, and r2, mov r1, orr r2, strb r2, ldrb r1, mov r2, and r2, strb r2, ldrb r1, mov r2, and r2, strb r2, ldrb r1, mov r2, and r2, strb r2, ; memberd ldr r3, ldrb r1, mov r2, and r2, mov r1, orr r2, strb r2, ldrb r1, mov r2, and r2, strb r2, .L32: .word www.cypress.com = 5 [pc, #5 [r3] = 10 [pc, #10 [r3, = 15 [pc, [r3, #0 r1 #15 r1 [r3, [r3, #0 r1 [r3, [r3, #0 r1 [r3, [r3, #0 r1 [r3, = 20 [pc, [r3, #0 r1 #20 r1 [r3, [r3, #0 r1 [r3, .L32] .L32] #1] .L32] #2] #2] #3] gcc, Cortex-M0, Size Optimization ldr mov mov strb strb mov mov mov strb strb strb strb strb strb bx .L30: .word r3, r2, r0, r2, r0, r2, r1, r0, r1, r2, r2, r2, r0, r2, lr [pc, #5 #10 [r3] [r3, #0 #15 #20 [r3, [r3, [r3, [r3, [r3, [r3, .L30] gcc, Cortex-M0, Speed Optimization ; same as for size optimization ; membera = 5 #1] ; memberb = 10 #2] ; memberc = 15 #3] #4] #5] #6] ; memberd = 20 #7] myfoo_p #3] #4] #4] #5] #5] .L32] #6] #6] #7] #7] myfoo_p Document No. 001-89610 Rev. *A 48 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization gcc, Cortex-M0, No Optimization C Code // unpacked structures struct FOO { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; extern struct FOO myfoo; void PackedStruct(void) { myfoo.membera = 5; myfoo.memberb = 10; myfoo.memberc = 15; myfoo.memberd = 20; } A.3 ; membera ldr r3, mov r2, strb r2, ; memberb ldr r3, mov r2, strb r2, ; memberc ldr r3, mov r2, str r2, ; memberd ldr r3, mov r2, strh r2, .L35: .word = 5 [pc, #5 [r3] = 10 [pc, #10 [r3, = 15 [pc, #15 [r3, = 20 [pc, #20 [r3, .L35] .L35] #1] .L35] #4] gcc, Cortex-M0, Size Optimization ldr mov strb mov mov mov strb str strh bx .L33: .word r3, r2, r2, r0, r1, r2, r0, r1, r2, lr [pc, #5 [r3] #10 #15 #20 [r3, [r3, [r3, .L33] gcc, Cortex-M0, Speed Optimization ; same as for size optimization ; membera = 5 #1] ; memberb = 10 #4] ; memberc = 15 #8] ; memberd = 20 myfoo .L35] #8] myfoo Assembler Examples, MDK for Cortex-M3 Note: Table 12 shows, for the MDK compiler for the Cortex-M3, examples of compiler output for different optimization options. Since the free evaluation version of MDK, MDK-Lite, does not include assembler in the .lst file, the examples were extracted from the assembler-level debug window in PSoC Creator. See Function Arguments for details on register usage and stack usage in compiler functions. Table 12. Compiler Output Details for MDK Compiler for Cortex-M3 CPU C Code MDK, Cortex-M3, No Optimization MDK, Cortex-M3, Size Optimization MDK, Cortex-M3, Speed Optimization // Calling a function // with no arguments LCD_Start(); ; do the function call bl LCD_Start ; same as for no optimization ; same as for no optimization // Calling a function with // one argument LCD_PrintInt8(128); ; R0 = first argument movs r0, #80 bl LCD_PrintInt8 ; same as for no optimization ; same as for no optimization // Calling a function with // two arguments LCD_Position(0, 2); ; R0 = first argument ; R1 = second argument movs r1, #2 movs r0, #0 bl LCD_Position ; same as for no optimization ; same as for no optimization www.cypress.com Document No. 001-89610 Rev. *A 49 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M3, No Optimization C Code // For loop: void ForLoop(uint8 i) { for(i = 0; i < 10; i++) { LCD_PrintInt8(i); } } ; prolog push {r4, lr} mov r4, r0 r4, #0 ; i = 0 <ForLoop+0x12> <ForLoop+0x8>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 r0, r4, #1 ; i++ r4, r0 ; sign extend <ForLoop+0x12>: cmp r4, #a ; i < 10 blt.n <ForLoop+0x8> pop while (i < 10) { LCD_PrintInt8(i); i++; } r4, #0 ; same as for size optimization ; i = 0 adds uxtb cmp bcc.n r4, r4, #1 ; i++ r4, r4 ; sign extend r4, #a ; i < 10 <ForLoop+0x4> pop {r4, pc} ; return {r4, pc} ; return ; prolog push {r4, lr} movs b.n movs r4, #0 ; i = 0 <WhileLoop+0x10> <WhileLoop+0x6>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 r0, r4, #1 ; i++ r4, r0 ; sign extend <WhileLoop+0x10>: cmp r4, #a ; i < 10 blt.n <WhileLoop+0x6> pop MDK, Cortex-M3, Speed Optimization <ForLoop+0x4>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 ; prolog push {r4, lr} adds uxtb www.cypress.com ; prolog push {r4, lr} movs movs b.n adds uxtb // While loop // i is type automatic, see // Accessing Automatic Variables // for details uint8 i = 0; MDK, Cortex-M3, Size Optimization r4, #0 ; same as for size optimization ; i = 0 <WhileLoop+0x4>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 adds uxtb r4, r4, #1 ; i++ r4, r4 ; sign extend cmp bcc.n r4, #a ; i < 10 <WhileLoop+0x4> pop {r4, pc}; return {r4, pc}; return Document No. 001-89610 Rev. *A 50 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M3, No Optimization C Code // Conditional statement void Conditional(uint8 i, uint8 j) { if(j == 1) { LCD_PrintInt8(i); } else { LCD_PrintInt8(i + 1); } } ; prolog push {r4, r5, r6, lr} mov mov cmp bne.n r4, r0 r5, r1 r5, #1 ; j == 1 <Conditional+0x12> mov bl b.n r0, r4 LCD_PrintInt8 <Conditional+0x1a> MDK, Cortex-M3, Size Optimization ; no prolog ; if(j == 1) cmp r1, #1 beq.n <Conditional+0x8> adds uxtb MDK, Cortex-M3, Speed Optimization ; same as for size optimization r0, r0, #1 ; i + 1 r0, r0 <Conditional+0x8>: ; function returns back to ; caller of this function b.w LCD_PrintInt8 <Conditional+0x12>: ; LCD_PrintInt8(i + 1) adds r1, r4, #1 uxtb r0, r1 ; sign extend bl LCD_PrintInt8 <Conditional+0x1a>: ; return pop {r4, r5, r6, pc} // Switch case statements void SwitchCase(uint8 j) { switch(j) { case 0: LCD_PrintInt8(1); break; case 1: LCD_PrintInt8(2); break; default: LCD_PrintInt8(0); break; } } ; prolog push {r4, lr} ; switch(j) mov r4, r0 cbz r4, <SwitchCase+0xc> cmp r4, #1 bne.n <SwitchCase+0x1c> b.n <SwitchCase+0x14> <SwitchCase+0xc>: movs r0, #1 ; case 0 bl LCD_PrintInt8 b.n <SwitchCase+0x24> <SwitchCase+0x14>: movs r0, #2 ; case 1 bl LCD_PrintInt8 b.n <SwitchCase+0x24> ; prolog not shown ; switch(j) cbz r0, <SwitchCase+0xa> cmp r0, #1 beq.n <SwitchCase+0xe> movs b.n ; same as for size optimization r0, #0 ; default <SwitchCase+0x10> <SwitchCase+0xa>: movs r0, #1 ; case 0 b.n <SwitchCase+0x10> <SwitchCase+0xe>: movs r0, #2 ; case 1 <SwitchCase+0x10>: ; function returns back to ; caller of this function b.w LCD_PrintInt8 <SwitchCase+0x1c>: movs r0, #0 ; default bl LCD_PrintInt8 nop <SwitchCase+0x24>: nop pop {r4, pc} ; return www.cypress.com Document No. 001-89610 Rev. *A 51 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M3, No Optimization C Code // Ternary operator void Ternary(uint8 i) { LCD_PrintInt8( (i == 1) ? 80 : 100); } ; prolog push {r4, lr} mov cmp bne.n r4, r0 ; i == 1 r4, #1 <Ternary+0xc> movs b.n r1, #50 <Ternary+0xe> <Ternary+0xc>: movs r1, #64 MDK, Cortex-M3, Size Optimization ; no prolog cmp r0, #1 beq.n <Ternary+0xa> MDK, Cortex-M3, Speed Optimization ; same as for size optimization <Ternary+0x6>: movs r0, #64 ; 0x64 ; function returns back to ; caller of this function b.w LCD_PrintInt8 <Ternary+0xa>: movs r0, #50 ; 0x50 b.n <Ternary+0x6> <Ternary+0xe>: mov r0, r1 bl LCD_PrintInt8 pop {r4, pc} ; return // Addition operation int DoAdd(int x, int y) { return x + y; } ; no prolog mov r2, r0 adds r0, r2, r1 bx lr ; return with result ; no prolog add r0, r1 bx lr ; return with result ; same as for size optimization // Subtraction operation int DoSub(int x, int y) { return x - y; } ; no prolog mov r2, r0 subs r0, r2, r1 bx lr ; return with result ; no prolog subs r0, r0, r1 bx lr ; return with result ; same as for size optimization // Multiplication int DoMul(int x, int y) { return x * y; } ; no prolog mov r2, r0 mul.w r0, r2, r1 bx lr ; return with result ; no prolog muls r0, r1 bx lr ; return with result ; same as for size optimization // Division int DoDiv(int x, int y) { return x / y; } ; no prolog mov r2, r0 sdiv r0, r2, r1 bx lr ; return with result ; no prolog sdiv r0, r0, r1 bx lr ; return with result ; same as for size optimization // Modulo operator int DoMod(int x, int y) { return x % y; } ; no prolog mov r2, r0 ; truncated quotient sdiv r0, r2, r1 ; multiply and subtract instruction ; implements remainder = ; dividend - (quotient * divisor) mls r0, r1, r0, r2 bx lr ; return with result ; no prolog ; truncated quotient sdiv r2, r0, r1 ; multiply and subtract instruction ; implements remainder = ; dividend - (quotient * divisor) mls r0, r1, r2, r0 bx lr ; return with result ; same as for size optimization www.cypress.com Document No. 001-89610 Rev. *A 52 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M3, No Optimization C Code // Pointer void Pointer(uint8 x, uint8 *ptr) { *ptr = *ptr + x; ptr++; LCD_PrintInt8(*ptr); } // Function pointer void FuncPtr(uint8 x, void *fptr(uint8)) { (*fptr)(x); } ; prolog push {r4, r5, r6, lr} mov r5, r0 mov r4, r1 ; *ptr = *ptr + x; ldrb r0, [r4, #0] add r0, r5 strb r0, [r4, #0] adds r4, r4, #1 ; ptr++; ldrb bl r0, [r4, #0] LCD_PrintInt8 pop {r4, r5, r6, pc} ; return ; prolog push {r4, r5, r6, lr} mov r5, r0 mov r4, r1 ; (*fptr)(x) mov r0, r5 ; in a blx instruction, the ; LS bit of the register ; must be 1 to keep the CPU ; in Thumb mode, or an ; exception occurs blx r4 pop // Packed structures struct FOO_P { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); extern struct FOO_P myfoo_p; void PackedStruct(void) { myfoo_p.membera = 5; myfoo_p.memberb = 10; myfoo_p.memberc = 15; myfoo_p.memberd = 20; } www.cypress.com MDK, Cortex-M3, Size Optimization ; no prolog ; *ptr = *ptr + x ldrb r2, [r1, #0] ; R1 = ptr add r0, r2 ; R0 = x strb r0, [r1, #0] MDK, Cortex-M3, Speed Optimization ; same as for size optimization ; ptr++ ; LCD_PrintInt8(*ptr) ldrb r0, [r1, #1] ; function returns back to ; caller of this function b.w LCD_PrintInt8 ; ; ; ; ; (*fptr)(x) in a bx instruction, the LS bit of the register must be 1 to keep the CPU in Thumb mode, or an exception occurs ; same as for size optimization ; function returns back to ; caller of this function bx r1 {r4, r5, r6, pc} ; return ; no prolog ; membera = 5 movs r0, #5 ldr r1, [pc, #14] strb r0, [r1, #0] ; memberb = 10 movs r0, #a strb r0, [r1, #1] ; memberc = 15 movs r0, #f ; word unaligned access str.w r0, [r1, #2] ; memberd = 20 movs r0, #14 strh r0, [r1, #6] bx lr ; return ; no prolog ldr r0, [pc, #14] ; membera = 5 movs r1, #5 strb r1, [r0, #0] ; memberb = 10 movs r1, #a strb r1, [r0, #1] ; memberc = 15 movs r1, #f ; word unaligned access str.w r1, [r0, #2] ; memberd = 20 movs r1, #14 strh r1, [r0, #6] bx lr ; return .word .word &myfoo_p Document No. 001-89610 Rev. *A ; same as for size optimization &myfoo_p 53 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M3, No Optimization C Code // unpacked structures struct FOO { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; extern struct FOO myfoo; void PackedStruct(void) { myfoo.membera = 5; myfoo.memberb = 10; myfoo.memberc = 15; myfoo.memberd = 20; } A.4 MDK, Cortex-M3, Size Optimization ; no prolog ; membera = 5 movs r0, #5 ldr r1, [pc, #10] strb r0, [r1, #0] ; memberb = 10 movs r0, #a strb r0, [r1, #1] ; memberc = 15 movs r0, #f str r0, [r1, #4] ; memberd = 20 movs r0, #14 strh r0, [r1, #8] bx lr ; return ; no prolog ldr r0, [pc, #10] ; membera = 5 movs r1, #5 strb r1, [r0, #0] ; memberb = 10 movs r1, #a strb r1, [r0, #1] ; memberc = 15 movs r1, #f str r1, [r0, #4] ; memberd = 20 movs r1, #14 strh r1, [r0, #8] bx lr ; return .word .word &myfoo MDK, Cortex-M3, Speed Optimization ; same as for size optimization &myfoo Assembler Examples, MDK for Cortex-M0 Table 13 shows, for the MDK compiler for the Cortex-M0, examples of compiler output for different optimization options. Since the free evaluation version of MDK does not produce a usable .lst file, the examples were extracted from the assembler-level debug window in PSoC Creator. See Function Arguments for details on register usage and stack usage in compiler functions. Table 13. Compiler Output Details for MDK Compiler for Cortex-M0 CPU C Code MDK, Cortex-M0, No Optimization MDK, Cortex-M0, Size Optimization MDK, Cortex-M0, Speed Optimization // Calling a function // with no arguments LCD_Start(); ; do the function call bl LCD_Start ; same as for no optimization ; same as for no optimization // Calling a function with // one argument LCD_PrintInt8(128); ; R0 = first argument movs r0, #80 bl LCD_PrintInt8 ; same as for no optimization ; same as for no optimization // Calling a function with // two arguments LCD_Position(0, 2); ; R0 = first argument ; R1 = second argument movs r1, #2 movs r0, #0 bl LCD_Position ; same as for no optimization ; same as for no optimization www.cypress.com Document No. 001-89610 Rev. *A 54 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M0, No Optimization C Code // For loop: void ForLoop(uint8 i) { for(i = 0; i < 10; i++) { LCD_PrintInt8(i); } } ; prolog push {r4, lr} mov r4, r0 r4, #0 ; i = 0 <ForLoop+0x12> <ForLoop+0x8>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 r0, r4, #1 ; i++ r4, r0 ; sign extend <ForLoop+0x12>: cmp r4, #a ; i < 10 blt.n <ForLoop+0x8> pop while (i < 10) { LCD_PrintInt8(i); i++; } r4, #0 ; same as for size optimization ; i = 0 <ForLoop+0x4>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 adds uxtb cmp bcc.n r4, r4, #1 ; i++ r4, r4 ; sign extend r4, #a ; i < 10 <ForLoop+0x4> pop {r4, pc} ; return ; prolog push {r4, lr} ; prolog push {r4, lr} movs b.n movs r4, #0 ; i = 0 <WhileLoop+0x10> <WhileLoop+0x6>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 adds uxtb pop MDK, Cortex-M0, Speed Optimization {r4, pc} ; return r0, r4, #1 ; i++ r4, r0 ; sign extend <WhileLoop+0x10>: cmp r4, #a ; i < 10 blt.n <WhileLoop+0x6> www.cypress.com ; prolog push {r4, lr} movs movs b.n adds uxtb // While loop // i is type automatic, see // Accessing Automatic Variables // for details uint8 i = 0; MDK, Cortex-M0, Size Optimization r4, #0 ; same as for size optimization ; i = 0 <WhileLoop+0x4>: ; do the function call with ; i as the argument in R0 mov r0, r4 bl LCD_PrintInt8 adds uxtb r4, r4, #1 ; i++ r4, r4 ; sign extend cmp bcc.n r4, #a ; i < 10 <WhileLoop+0x4> pop {r4, pc}; return {r4, pc}; return Document No. 001-89610 Rev. *A 55 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M0, No Optimization C Code // Conditional statement void Conditional(uint8 i, uint8 j) { if(j == 1) { LCD_PrintInt8(i); } else { LCD_PrintInt8(i + 1); } } MDK, Cortex-M0, Size Optimization ; prolog push {r4, r5, r6, lr} ; prolog push {r4, lr} mov mov cmp bne.n r4, r0 r5, r1 r5, #1 ; j == 1 <Conditional+0x12> ; if(j == 1) cmp r1, #1 beq.n <Conditional+0xa> mov bl b.n r0, r4 LCD_PrintInt8 <Conditional+0x1a> <Conditional+0x12>: ; LCD_PrintInt8(i + 1) adds r1, r4, #1 uxtb r0, r1 ; sign extend bl LCD_PrintInt8 adds uxtb MDK, Cortex-M0, Speed Optimization ; same as for size optimization r0, r0, #1 ; i + 1 r0, r0 <Conditional+0xa>: bl LCD_PrintInt8 ; return pop {r4, pc} <Conditional+0x1a>: ; return pop {r4, r5, r6, pc} // Switch case statements void SwitchCase(uint8 j) { switch(j) { case 0: LCD_PrintInt8(1); break; case 1: LCD_PrintInt8(2); break; default: LCD_PrintInt8(0); break; } } ; prolog push {r4, lr} ; prolog push {r4, lr} ; switch(j) mov r4, r0 cmp r4, #0 beq.n <SwitchCase+0xe> cmp r4, #1 bne.n <SwitchCase+0x1e> b.n <SwitchCase+0x16> ; switch(j) cmp r0, #0 beq.n <SwitchCase+0xe> cmp r0, #1 beq.n <SwitchCase+0x12> <SwitchCase+0xe>: movs r0, #1 ; case 0 bl LCD_PrintInt8 b.n <SwitchCase+0x26> <SwitchCase+0x14>: movs r0, #2 ; case 1 bl LCD_PrintInt8 b.n <SwitchCase+0x26> <SwitchCase+0x1e>: movs r0, #0 ; default bl LCD_PrintInt8 nop movs b.n ; same as for size optimization r0, #0 ; default <SwitchCase+0x14> <SwitchCase+0xe>: movs r0, #1 ; case 0 b.n <SwitchCase+0x14> <SwitchCase+0x12>: movs r0, #2 ; case 1 <SwitchCase+0x14>: bl LCD_PrintInt8 pop {r4, pc} ; return <SwitchCase+0x26>: nop pop {r4, pc} ; return www.cypress.com Document No. 001-89610 Rev. *A 56 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M0, No Optimization C Code // Ternary operator void Ternary(uint8 i) { LCD_PrintInt8( (i == 1) ? 80 : 100); } MDK, Cortex-M0, Size Optimization ; prolog push {r4, lr} ; prolog push {r4, lr} mov cmp bne.n r4, r0 ; i == 1 r4, #1 <Ternary+0xc> cmp beq.n r0, #1 ; i == 1 <Ternary+0xe> movs r0, #64 movs b.n r1, #50 <Ternary+0xe> <Ternary+0xc>: movs r1, #64 <Ternary+0xe>: mov r0, r1 bl LCD_PrintInt8 pop MDK, Cortex-M0, Speed Optimization ; same as for size optimization <Ternary+0x8>: bl LCD_PrintInt8 pop {r4, pc} ; return <Ternary+0xe>: movs r0, #50 b.n <Ternary+0x8> {r4, pc} ; return // Addition operation int DoAdd(int x, int y) { return x + y; } ; no prolog mov r2, r0 adds r0, r2, r1 bx lr ; return with result ; no prolog adds r0, r1 bx lr ; return with result ; same as for size optimization // Subtraction operation int DoSub(int x, int y) { return x - y; } ; no prolog mov r2, r0 subs r0, r2, r1 bx lr ; return with result ; no prolog subs r0, r0, r1 bx lr ; return with result ; same as for size optimization // Multiplication int DoMul(int x, int y) { return x * y; } ; no prolog mov r2, r0 muls r0, r2, r1 bx lr ; return with result ; no prolog muls r0, r1 bx lr ; return with result ; same as for size optimization // Division int DoDiv(int x, in`t y) { return x / y; } ; prolog push {r4, r5, r6, lr} mov r4, r0 mov r5, r1 mov r1, r5 mov r0, r4 bl __aeabi_idiv ; return with result pop {r4, r5, r6, pc} ; prolog push {r4, lr} bl __aeabi_idiv ; return with result pop {r4, pc} ; same as for size optimization www.cypress.com Document No. 001-89610 Rev. *A 57 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M0, No Optimization C Code MDK, Cortex-M0, Size Optimization MDK, Cortex-M0, Speed Optimization // Modulo operator int DoMod(int x, int y) { return x % y; } ; prolog push {r4, r5, r6, lr} mov r4, r0 mov r5, r1 mov r1, r5 mov r0, r4 bl __aeabi_idiv ; return with result mov r0, r1 pop {r4, r5, r6, pc} ; prolog push {r4, lr} bl __aeabi_idiv ; return with result mov r0, r1 pop {r4, pc} ; same as for size optimization // Pointer void Pointer(uint8 x, uint8 *ptr) { *ptr = *ptr + x; ptr++; LCD_PrintInt8(*ptr); } ; prolog push {r4, r5, r6, lr} mov r5, r0 mov r4, r1 ; prolog push {r4, lr} ; same as for size optimization // Function pointer void FuncPtr(uint8 x, void *fptr(uint8)) { (*fptr)(x); } ; *ptr = *ptr + x; ldrb r0, [r4, #0] adds r0, r0, r5 strb r0, [r4, #0] adds r4, r4, #1 ; ptr++; ldrb bl r0, [r4, #0] LCD_PrintInt8 pop {r4, r5, r6, pc} ; return ; prolog push {r4, r5, r6, lr} mov r5, r0 mov r4, r1 ; (*fptr)(x) mov r0, r5 ; in a blx instruction, the ; LS bit of the register ; must be 1 to keep the CPU ; in Thumb mode, or an ; exception occurs blx r4 pop www.cypress.com ; *ptr = *ptr + x ldrb r2, [r1, #0] ; R1 = ptr adds r0, r2, r0 ; R0 = x strb r0, [r1, #0] ; ptr++ ; LCD_PrintInt8(*ptr) ldrb r0, [r1, #1] bl LCD_PrintInt8 pop ; ; ; ; ; {r4, pc} ; return (*fptr)(x) in a bx instruction, the LS bit of the register must be 1 to keep the CPU in Thumb mode, or an exception occurs ; same as for size optimization ; function returns back to ; caller of this function bx r1 {r4, r5, r6, pc} ; return Document No. 001-89610 Rev. *A 58 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization MDK, Cortex-M0, No Optimization C Code // Packed structures struct FOO_P { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); extern struct FOO_P myfoo_p; void PackedStruct(void) { myfoo_p.membera = 5; myfoo_p.memberb = 10; myfoo_p.memberc = 15; myfoo_p.memberd = 20; } ; prolog push {r4, lr} ; membera = 5 movs r0, #5 ldr r1, [pc, #18] strb r0, [r1, #0] ; memberb = 10 movs r0, #a strb r0, [r1, #1] ; memberc = 15 adds r1, r1, #2 movs r0, #f bl __aeabi_uwrite4 ; memberd = 20 movs r1, #14 ldr r0, [pc,#8] strb r1, [r0, #6] movs r1, #0 strb r1, [r0, #7] pop {r4, pc} ; return MDK, Cortex-M0, Size Optimization ; prolog push {r4, lr} ldr r4, [pc, #18] ; membera = 5 movs r0, #5 strb r0, [r4, #0] ; memberb = 10 movs r0, #a strb r0, [r4, #1] ; memberc = 15 adds r1, r4, #2 movs r0, #f bl __aeabi_uwrite4 ; memberd = 20 movs r0, #14 strb r1, [r4, #6] movs r0, #0 strb r1, [r4, #7] pop {r4, pc} ; return .word .word // unpacked structures struct FOO { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; extern struct FOO myfoo; void PackedStruct(void) { myfoo.membera = 5; myfoo.memberb = 10; myfoo.memberc = 15; myfoo.memberd = 20; } www.cypress.com ; same as for size optimization &myfoo_p &myfoo_p ; no prolog ; membera = 5 movs r0, #5 ldr r1, [pc, #10] strb r0, [r1, #0] ; memberb = 10 movs r0, #a strb r0, [r1, #1] ; memberc = 15 movs r0, #f str r0, [r1, #4] ; memberd = 20 movs r0, #14 strh r0, [r1, #8] bx lr ; return .word MDK, Cortex-M0, Speed Optimization &myfoo Document No. 001-89610 Rev. *A ; same as for no optimization ; no prolog ldr r0, [pc, #10] ; membera = 5 movs r1, #5 strb r1, [r0, #0] ; memberb = 10 movs r1, #a strb r1, [r0, #1] ; memberc = 15 movs r1, #f str r1, [r0, #4] ; memberd = 20 movs r1, #14 strh r1, [r0, #8] bx lr ; return .word 59 &myfoo PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization A.5 Compiler Test Program The following C code was used to generate the compiler output in the previous tables. It compiles for PSoC 4 and PSoC 5LP, for gcc and MDK, with no optimization and with size and speed optimization. It can be added to a PSoC Creator project; the following must also be done in the project: Add a Character LCD Component to the project schematic, and rename it to “LCD”. For PSoC 4, reduce the heap and stack size settings for these lower-memory parts. This is done in the DesignWide Resource (DWR) window, System tab. Values of 0x100 and 0x400, for heap size and stack size respectively, are usually appropriate. The code is in two files, main.c and test.c. This is main.c: #include <project.h> extern void ForLoop(uint8); extern void WhileLoop(void); extern void Conditional(uint8, uint8); extern void SwitchCase(uint8); extern void Ternary(uint8); extern int DoAdd(int, int); extern int DoSub(int, int); extern int DoMul(int, int); extern int DoDiv(int, int); extern int DoMod(int, int); extern void Pointer(uint8, uint8 *); extern void FuncPtr(uint8, void (*)(uint8)); extern void PackedStruct(void); extern void UnpackedStruct(void); struct FOO /* structures are unpacked by default */ { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; struct FOO_P /* packed structure */ { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); uint8 myData = 6; struct FOO_P myfoo_p; struct FOO myfoo; int main() { /* Place your initialization/startup code here (e.g. MyInst_Start()) */ LCD_Start(); /* CyGlobalIntEnable; */ /* Uncomment this line to enable global interrupts. */ for(;;) { LCD_PrintInt8(128); LCD_Position(0, 2); ForLoop(9); WhileLoop(); www.cypress.com Document No. 001-89610 Rev. *A 60 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Conditional(3, 4); SwitchCase(4); Ternary(5); LCD_PrintNumber((uint16)DoAdd(5, LCD_PrintNumber((uint16)DoSub(5, LCD_PrintNumber((uint16)DoMul(5, LCD_PrintNumber((uint16)DoDiv(5, LCD_PrintNumber((uint16)DoMod(5, Pointer(4, &myData); FuncPtr(3, &LCD_PrintInt8); PackedStruct(); UnpackedStruct(); } /* end of for(;;) */ } /* end of main() */ 4)); 4)); 4)); 4)); 4)); And this is test.c: #include <project.h> struct FOO /* structures are unpacked by default */ { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; }; struct FOO_P /* packed structure */ { uint8 membera; uint8 memberb; uint32 memberc; uint16 memberd; } __attribute__ ((packed)); extern struct FOO_P myfoo_p; extern struct FOO myfoo; void ForLoop(uint8 i) { for(i = 0; i < 10; i++) { LCD_PrintInt8(i); } } void WhileLoop(void) { uint8 i = 0; while(i < 10) { LCD_PrintInt8(i); i++; } } void Conditional(uint8 i, uint8 j) { if(j == 1) { www.cypress.com Document No. 001-89610 Rev. *A 61 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization LCD_PrintInt8(i); } else { LCD_PrintInt8(i + 1); } } void SwitchCase(uint8 j) { switch(j) { case 0: LCD_PrintInt8(1); break; case 1: LCD_PrintInt8(2); break; default: LCD_PrintInt8(0); break; } } void Ternary(uint8 i) { LCD_PrintInt8((i == 1) ? 80 : 100); } int DoAdd(int x, int y) { return x + y; } int DoSub(int x, int y) { return x - y; } int DoMul(int x, int y) { return x * y; } int DoDiv(int x, int y) { return x / y; } int DoMod(int x, int y) { return x % y; } void Pointer(uint8 x, uint8 *ptr) { *ptr = *ptr + x; ptr++; LCD_PrintInt8(*ptr); www.cypress.com Document No. 001-89610 Rev. *A 62 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization } void FuncPtr(uint8 x, void *fptr(uint8)) { (*fptr)(x); } void PackedStruct(void) { myfoo_p.membera = 5; myfoo_p.memberb = 10; myfoo_p.memberc = 15; myfoo_p.memberd = 20; } void UnpackedStruct(void) { myfoo.membera = 5; myfoo.memberb = 10; myfoo.memberc = 15; myfoo.memberd = 20; } www.cypress.com Document No. 001-89610 Rev. *A 63 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Document History ® Document Title: PSoC 4 and PSoC 5LP ARM Cortex Code Optimization - AN89610 Document Number: 001-89610 Revision ECN Orig. of Change Submission Date Description of Change ** 4275133 MKEA 02/07/2014 New application note *A 4994599 MKEA 10/29/2015 Clarified that PSoC 5LP cannot execute code from an 8-bit EMIF memory. Added a reference to AN90799, PSoC 4 Interrupts. www.cypress.com Document No. 001-89610 Rev. *A 64 PSoC® 4 and PSoC 5LP ARM® Cortex® Code Optimization Worldwide Sales and Design Support Cypress maintains a worldwide network of offices, solution centers, manufacturer‟s representatives, and distributors. To find the office closest to you, visit us at Cypress Locations. Products PSoC Solutions Automotive cypress.com/go/automotive psoc.cypress.com/solutions Clocks & Buffers cypress.com/go/clocks PSoC 1 | PSoC 3 | PSoC 4 | PSoC 5LP Interface cypress.com/go/interface Cypress Developer Community Lighting & Power Control cypress.com/go/powerpsoc Memory cypress.com/go/memory PSoC cypress.com/go/psoc Touch Sensing cypress.com/go/touch USB Controllers cypress.com/go/usb Wireless/RF cypress.com/go/wireless Community | Forums | Blogs | Video | Training Technical Support cypress.com/go/support All other trademarks or registered trademarks referenced herein are the property of their respective owners. Cypress Semiconductor 198 Champion Court San Jose, CA 95134-1709 Phone Fax Website : 408-943-2600 : 408-943-4730 : www.cypress.com © Cypress Semiconductor Corporation, 2014-2015. The information contained herein is subject to change without notice. Cypress Semiconductor Corporation assumes no responsibility for the use of any circuitry other than circuitry embodied in a Cypress product. Nor does it convey or imply any license under patent or other rights. Cypress products are not warranted nor intended to be used for medical, life support, life saving, critical control or safety applications, unless pursuant to an express written agreement with Cypress. Furthermore, Cypress does not authorize its products for use as critical components in life-support systems where a malfunction or failure may reasonably be expected to result in significant injury to the user. The inclusion of Cypress products in life-support systems application implies that the manufacturer assumes all risk of such use and in doing so indemnifies Cypress against all charges. This Source Code (software and/or firmware) is owned by Cypress Semiconductor Corporation (Cypress) and is protected by and subject to worldwide patent protection (United States and foreign), United States copyright laws and international treaty provisions. Cypress hereby grants to licensee a personal, non-exclusive, non-transferable license to copy, use, modify, create derivative works of, and compile the Cypress Source Code and derivative works for the sole purpose of creating custom software and or firmware in support of licensee product to be used only in conjunction with a Cypress integrated circuit as specified in the applicable agreement. Any reproduction, modification, translation, compilation, or representation of this Source Code except as specified above is prohibited without the express written permission of Cypress. Disclaimer: CYPRESS MAKES NO WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Cypress reserves the right to make changes without further notice to the materials described herein. Cypress does not assume any liability arising out of the application or use of any product or circuit described herein. Cypress does not authorize its products for use as critical components in life-support systems where a malfunction or failure may reasonably be expected to result in significant injury to the user. The inclusion of Cypress‟ product in a life-support systems application implies that the manufacturer assumes all risk of such use and in doing so indemnifies Cypress against all charges. Use may be limited by and subject to the applicable Cypress software license agreement. www.cypress.com Document No. 001-89610 Rev. *A 65