APPLICATION NOTE How to Optimize Usage of SAM S70/E70/V7x Architecture Atmel | SMART SAM S70/E70 Introduction The purpose of this application note is to understand the architecture Cortex ®-M7 processor and the Atmel® | SMART SAM S70/E70/V7x devices and how to tune an application code to benefit from it and maximize performance. Firstly a short introduction on the Cortex-M7 core will be made with details on the implementation done in SAM S70/E70/V7x devices. Then the document will focus on the architecture of the SAM S70/E70/V7x itself. The last part of the application note will explain how to enable and properly use the features previously introduced. A concrete example will be used to illustrate it: an application code performing FFT computation will be successively executed out of different memories to have a better understanding on the impact on performance. Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 Ta bl e of Conte nts 1 Introduction to ARM Cortex-M7 Processor................................................................ 3 1.1 1.2 1.3 1.4 1.5 2 SAM S70/E70/V7x Devices Architecture and Highlights ........................................... 7 2.1 2.2 2.3 3 3.2 3.3 3.4 Use of Floating Point Unit.................................................................................................................... 11 3.1.1 Enable the Floating Point Unit ................................................................................................ 11 3.1.2 Compiler Configuration ........................................................................................................... 11 Enable I-cache and D-cache ............................................................................................................... 12 Relocate Critical Part of Code in Tightly Coupled Memory.................................................................. 12 3.3.1 TCM Configuration ................................................................................................................. 13 3.3.2 Linker Script Configuration ..................................................................................................... 13 3.3.3 Code Copy in TCM ................................................................................................................. 15 Optimize Pipeline Usage ..................................................................................................................... 15 Software Example: FFT Computation Code Example ............................................. 16 4.1 4.2 4.3 2 High Performance Implementation ........................................................................................................ 7 Multi-port SRAM .................................................................................................................................... 8 2.2.2 System RAM............................................................................................................................. 9 2.2.3 Tightly Coupled Memories ...................................................................................................... 10 Internal Flash ...................................................................................................................................... 10 How to Benefit from the SAM S70/E70/V7x Architecture to Optimize Performance11 3.1 4 Six-stage superscalar pipeline .............................................................................................................. 3 Instruction and Data Tightly Coupled Memories .................................................................................... 4 Instruction and Data Cache ................................................................................................................... 5 Floating Point Unit ................................................................................................................................. 5 Memory Interface .................................................................................................................................. 5 1.5.1 AXI Master Interface ................................................................................................................. 5 1.5.2 AHB Peripheral Interface .......................................................................................................... 5 1.5.3 AHB Slave Interface ................................................................................................................. 6 1.5.4 AHB Debug Interface ................................................................................................................ 6 Introduction to the Example ................................................................................................................. 16 Usage .............................................................................................................................................. 17 Results .............................................................................................................................................. 18 5 Conclusion ................................................................................................................. 18 6 Suggested Readings ................................................................................................. 19 7 Revision History ........................................................................................................ 20 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 2 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 1 Introduction to ARM Cortex-M7 Processor The ARM® Cortex-M7 processor is the high-end processor of the Cortex-M family, providing 5 CoreMark®/MHz and up to 630 DMIPS. It delivers the best performance level thanks to various new features: Six-stage superscalar pipeline allowing parallel execution of instructions (dual instruction issue) Instruction and Data Tightly Coupled Memories (ITCM and DTCM respectively) with access at processor clock speed with no wait state penalty and deterministic behavior Inner Data and Instruction caches to compensate wait state penalty when executing code out of external memory DSP extension and single/double precision Floating Point Unit (FPv5) with extended instruction set Embedded Trace Module (ETM) with instruction and data trace capability Various memory interfaces to increase overall bandwidth: – 64-bit AXI Master Interface to access memories and peripherals. This memory interface has been optimized for throughput – 32-bit AHB Peripheral interface (AHBP) to access low-latency peripherals rather than memory (peripheral data access only) – 32-bit AHB Slave interface (AHBS) providing a direct access path between the DMA and TCM – 32-bit AHB Debug port providing a debug interface (JTAG or SWD interfaces) The Cortex-M7 processor is based on the ARMv7-M architecture so it is binary compatible with other Cortex-M processors. Figure 1-1. 1.1 Cortex-M7 Processor Overview Six-stage superscalar pipeline The Cortex-M7 has a six-stage, in order, dual-issue superscalar pipeline with branch prediction. How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 3 3 Thanks to the dual instruction issue, the Cortex-M7 processor is able to execute two instructions in parallel. The pipeline features an optional float pipeline so that floating point instructions can be dual issued with integer instructions as well. In addition, memory accesses are interleaved with computation to reduce latency, which is a major improvement compared to previous Cortex-M processors neither dual-issue nor interleaving in CortexM0/M3/M4 processors). Integer MAC instruction execution has been also improved and takes one cycle only (from two to four cycles for floating MAC instructions) so it is used whenever possible. As a consequence Cortex-M7 processor doubles the performance of the Cortex-M4 processor when executing math functions (FFT, FIR etc.). 1.2 Instruction and Data Tightly Coupled Memories The TCM controller provides a direct access between the processor and two memory areas: Instruction TCM (ITCM) and Data TCM (DTCM). The size of these memory areas can be up to 16Mbytes each, but depends on the silicon vendor integration (up to 128KB each on the SAM S70/E70/V7x devices - see later for more information). As shown in Figure 1-2, the interface between the processor and the ITCM is a 64-bit interface, so that the processor can fetch two 32-bit instructions in a single access, and thus benefit from the dual-issue capability of the pipeline. The DTCM interface is a dual 32-bit interleaved interface (DTCM0 and DTCM1), thus concurrent accesses (from the DMA and the core for instance) can be optimized. Figure 1-2. Tightly Coupled Memories Instructions and data located in TCM can be directly accessed at processor speed (e.g. up to 300MHz on the SAM S70/E70) with no wait state penalty, as opposed to the other memories such as the flash, which are accessed at bus speed through the AXI master interface. The purpose of the TCM memories is to store critical part of code which needs to be processed as fast as possible. In addition TCM has a deterministic behavior, which makes it perfectly suitable for RTOS-based applications. 4 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 4 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 1.3 Instruction and Data Cache Cortex-M7 embeds an Instruction (I-cache) and Data cache (D-cache) to compensate wait state penalty when executing code out of external memory (typically flash): Instruction cache is 2-way associative, up to 64kB with optional ECC (16kB in SAM S70/E70/V7x) Data cache is 4-way associative, up to 64kB with optional ECC (16kB in SAM S70/E70/V7x) I-cache and D-cache are disabled by default and must be enabled by default in the application code. 1.4 Floating Point Unit The Cortex-M7 processor can optionally integrate a Floating Point Unit (FPU). This FPU, corresponding to the FPv5 extension, share the same instruction set as the previous version (FPv4) which was implemented on Cortex-M4 processor and adds the double–precision operand support. It also introduces new instructions such as rounding functions. For more information on floating point instructions, you can refer to the Cortex-M7 Devices Generic User Guide from ARM: http://infocenter.arm.com/help/topic/com.arm.doc.dui0646a/CHDHHAJF.html. 1.5 Memory Interface As shown in Figure 1-1, Cortex-M7 core has four main memory interfaces. 1.5.1 AXI Master Interface As said in introduction the Cortex-M7 processor features a new 64-bit interface running at processor frequency. This interface is for both instruction and data accesses and dedicated for on-chip and off-chip memories and devices (typically flash or RAM). This interface has been optimized for performance: AXI accesses are not made in a predictable order as this bus can re-order instruction or data to reduce latency and increase bandwidth. If predictability is critical when executing some part of an application code, it is recommended to: 1.5.2 Configure the Memory Protection Unit (MPU) to define memory regions with proper attributes. For more information on MPU configuration one can refer to the following application note: http://www.atmel.com/Images/Atmel-42128-AT02346-Using-the-MPU-on-Atmel-Cortex-M3-M4-basedMicrocontroller_Application-Note.pdf A typical MPU configuration example is also included in the software examples from the SAM S70/E70/V7x Software Package. Use memory barriers. More information can be found in a dedicated application note from ARM: http://infocenter.arm.com/help/topic/com.arm.doc.dai0321a/DAI0321A_programming_guide_memory_ba rriers_for_m_profile.pdf AHB Peripheral Interface The AHB Peripheral interface (AHBP) is dedicated to access low-latency peripherals rather than memory. Note that AHBP does not support instruction fetch (which is performed through the AXI master interface), but only data transfers. This interface has been added to avoid overloading the AXI master interface with additional data transfers and so to increase the overall bandwidth. Unlike AXI master interface, there is no optimization done and speculative access is not supported (e.g. buffering is ordered). In addition bursts are not supported so only single access can be performed. For burst access to peripheral it is recommended to use the DMA to do so. How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 5 5 1.5.3 AHB Slave Interface The AHB Slave port (AHBS), also called DMA slave port, provides a system access between the DMA and the TCM through a 32-bit interface. This interface is also usable when processor is in sleep state. Thanks to this interface both processor and DMA can access the TCM in parallel, thus bandwidth is maximized when running high resources demanding applications. 1.5.4 AHB Debug Interface The Cortex-M7 processor implements a complete hardware debug solution, accessible through the AHB Debug interface (AHBD). This provides high system visibility of the processor and memory through either a traditional JTAG port or a 2-pin Serial Wire Debug (SWD) port. 6 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 6 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 2 SAM S70/E70/V7x Devices Architecture and Highlights 2.1 High Performance Implementation In order to support larger flash memory range and benefit from the AXI bus performance, Atmel chose to implement the Cortex-M7 processor according to the High Performance configuration recommended by ARM. Figure 2-1 shows a simplified diagram of this implementation, with the example of the SAM S70/E70/V7x devices. Figure 2-1. Cortex-M7 Processor Implementation Unlike the Simple Microcontroller implementation application code can run from flash (connected to the AXI master interface) or TCM, and thus benefit of large data storage (flash is up to 2MB) and performance boost with the TCM running at 300MHz. This architecture also enables the use of I/D cache to accelerate code execution when running out of flash. The table below shows the options selected by Atmel when implementing the Cortex-M7 processor in the SAM S70/E70/V7x devices: Features Configurable options SAM S70/E70/V7x implementation FPU No FPU Single Precision (SP) only SP and DP Single and Double Precision FPU ITCM max size No ITCM / 4kB-16MB 128kB DTCM max size No ITCM / 4kB-16MB 128kB I-cache size 4, 8, 16, 32, 64kB 16kB D-cache size 4, 8, 16, 32, 64kB 16kB AHB Peripheral size (AHBP) 64, 128, 256, 512MB 512MB ECC support on caches Implemented or not Implemented How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 7 7 Features Configurable options SAM S70/E70/V7x implementation MPU 0, 8, 16 regions 16 regions Interrupts 1-240 72 Debug Watchpoints and Breakpoints 2 data watchpoints + 4 breakpoints 4 data watchpoints + 8 breakpoints 4 data watchpoints + 8 breakpoints ITM and DWT Trace Implemented or not Implemented ETM No ETM ETM instruction only ETM instruction and data ETM Instruction Trace only Cross Trigger Interface (CTI) Wakeup Interrupt Controller (WIC) Implemented or not Not Implemented In the next coming paragraphs a review of the key features will be done to understand how SAM S70/E70/V7x devices have been designed to optimize performance. 2.2 Multi-port SRAM The SAM S70/E70/V7x devices feature a multi-port SRAM, which can be up to 384kB. This SRAM spaces operates at bus clock (i.e. processor clock/2 = up to 150MHz) and has four ports to optimize the bandwidth and latency. As shown in Figure 2-2, two ports are dedicated to the Cortex-M7 processor and two ports are shared by AHB masters (Central DMA, EMAC DMA, USB DMA etc.). Figure 2-2. Multi-port SRAM Implementation The purpose of the multi-port capability is to decrease the latency when several masters try to access the SRAM simultaneously: the integrated controller manages interleaved addressing of SRAM blocks so that another master will be able to access it on the next cycle. As an example, when a 16-word burst is performed by the DMA, another master will be able to access the SRAM on the next cycle (n+1) and not on the (n+16) one. 8 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 8 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 The multi-port SRAM is split in two regions: System RAM region TCM region Figure 2-3. Multi-port SRAM The size of the TCM region can be configured through NVM bits, and the remaining SRAM size is automatically assigned to System RAM region. 2.2.2 System RAM System RAM is accessed by AHB masters (peripherals, DMA) through the AHB matrix or the Cortex-M7 core through a privileged access (see Figure 2-2) at bus clock frequency (up to 150MHz). How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 9 9 2.2.3 Tightly Coupled Memories As explained previously ITCM and DTCM can be directly accessed by the processor at 300MHz. Transfers between DMA and TCM is done though the dedicated AHBS interface to maximize bandwidth and free CPU resources. 2.3 Internal Flash Internal flash is accessed through the AXI Master and AHB matrix at bus clock. Wait states must be added depending on the operating frequency (up to five wait states when processor is running at maximum frequency), but thanks to I-cache and D-cache penalty can be compensated. Figure 2-4. 10 Connection Between Flash and Memory Interface How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 1 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 0 3 How to Benefit from the SAM S70/E70/V7x Architecture to Optimize Performance This chapter gives more details on how to properly enable and configure the features previously introduced to reach the maximum performance level. Usage of the Floating Point Unit, I- and D-cache and Tightly Coupled Memories will be discussed in this part, illustrated with code snippets for Atmel Studio, Keil and IAR™ users. More information will be also given on how to optimize the load/store process and boost performances. 3.1 Use of Floating Point Unit By default Floating Point Unit is not enabled and developers have to enable it in the application code and make sure FPU instructions will be used by the compiler. 3.1.1 Enable the Floating Point Unit When developing an application code requiring the use of the FPU, one have to ensure FPU is enabled by software using the following code sequence: /** Address for ARM CPACR */ #define ADDR_CPACR 0xE000ED88 /** CPACR Register */ #define REG_CPACR (*((volatile uint32_t *)ADDR_CPACR)) /** * \brief Enable FPU */ __always_inline static void fpu_enable(void) { irqflags_t flags; flags = cpu_irq_save(); REG_CPACR |= (0xFu << 20); __DSB(); __ISB(); cpu_irq_restore(flags); } Note: 3.1.2 __DSB() is a memory barrier used to ensure that all data memory transfers are complete. __ISB() is also a memory barrier used to flush the instruction pipeline. More information can be found in Section 1.5.1 of this document. Compiler Configuration User has to make sure FPU instructions will be used when compiling the application code and configure the compiler accordingly. For ARM GCC compiler the following flags need to be added: -mfloat-abi=hard or softfp softfp: to be used only if the user intends to use same binary on Cortex-M4F and CortexM7F based product hard: in any other case -mfpu=fpv5-sp-d16 or fpv5-d16 fpv5-sp-d16 is for single precision fpv5-d16 is for double precision In Atmel Studio, those flags are added in Project Properties menu > Toolchain tab > ARM/GNU C Compiler menu > Miscellaneous How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 11 1 1 For IAR users, there is no need to add flags and users simply have to select VFPv5 single-precision or VFPv5 double-precision in the project options (in General Options > Target tab > FPU drop-down menu). 3.2 Enable I-cache and D-cache By default I-cache and D-cache are disabled and must be active in the application code, like the Floating Point Unit. Below is a code example to enable it: __STATIC_INLINE void SCB_EnableICache(void) { #if (__ICACHE_PRESENT == 1) __DSB(); __ISB(); SCB->ICIALLU = 0; // invalidate I-Cache SCB->CCR |= SCB_CCR_IC_Msk; // enable I-Cache __DSB(); __ISB(); #endif } _STATIC_INLINE void SCB_EnableDCache(void) { #if (__DCACHE_PRESENT == 1) uint32_t ccsidr, sshift, wshift, sw; uint32_t sets, ways; ccsidr sets sshift ways wshift = = = = = SCB->CCSIDR; CCSIDR_SETS(ccsidr); CCSIDR_LSSHIFT(ccsidr) + 4; CCSIDR_WAYS(ccsidr); __CLZ(ways) & 0x1f; __DSB(); do { // invalidate D-Cache int32_t tmpways = ways; do { sw = ((tmpways << wshift) | (sets << sshift)); SCB->DCISW = sw; } while(tmpways--); } while(sets--); __DSB(); SCB->CCR |= SCB_CCR_DC_Msk; // enable D-Cache __DSB(); __ISB(); #endif } The two above functions are included in the software library provided for free on Atmel.com (SAM V71 / V70 / E70 / S70 Software Package). 3.3 Relocate Critical Part of Code in Tightly Coupled Memory As explained previously Tightly Coupled Memories are intended to store critical parts of code which needs to be executed with a maximum performance. When determining which functions or data will be executed/stored out of TCM, it is important to be understand that power consumption increases according to its usage: the 12 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 1 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 2 more TCM is used, the more power consumption increase. So user has to find the best trade-off between performance and power consumption and make sure his application does not exceed the power budget when moving code to TCM. 3.3.1 TCM Configuration The first step is to configure the TCM size and enable it in the application code. This step is performed at startup before initializing the clocks and other peripherals. TCM size is set with GPNVM bits 7 and 8 according to the below table. GPNVM8 GPNVM7 ITCM/DTCM size [kB] System RAM (348kB – TCMs size) 0 0 0/0 384 0 1 32 / 32 320 1 0 64 / 64 256 1 1 128 / 128 128 GPNVM bits are configured with specific EFC commands, as shown in the below code example: /*Configure TCM sizes to: 128 kB ITCM - 128 kB DTCM (set GPNVM7 and GPNVM8)*/ EFC->EEFC_FCR = (EEFC_FCR_FKEY_PASSWD | EEFC_FCR_FCMD_SGPB | EEFC_FCR_FARG(7)); EFC->EEFC_FCR = (EEFC_FCR_FKEY_PASSWD | EEFC_FCR_FCMD_SGPB | EEFC_FCR_FARG(8)); Then TCM must be enabled by writing in the SCB register (System Control Block register) of the Cortex-M7 processor: __STATIC_INLINE void TCM_Enable(void) { __DSB(); __ISB(); SCB->ITCMCR = (SCB_ITCMCR_EN_Msk | SCB_ITCMCR_RMW_Msk | SCB_ITCMCR_RETEN_Msk); SCB->DTCMCR = (SCB_DTCMCR_EN_Msk | SCB_DTCMCR_RMW_Msk | SCB_DTCMCR_RETEN_Msk); __DSB(); __ISB(); } 3.3.2 Linker Script Configuration In the linker file, a new region corresponding to the TCM must be created so that relevant code and data can be linked to the right memory addresses. For Atmel Studio/GCC compiler the linker file is a .ld file. Firstly ITCM and DTCM memory regions have to be defined, with their respective base addresses, list of attributes (read, write, execute etc.) and size: /* ITCM and DTCM are 128KB each */ itcm (rwx) : ORIGIN = 0x00000000, LENGTH = 0x00020000 dtcm (rw) : ORIGIN = 0x20000000, LENGTH = 0x00020000 Then a section needs to be created to place the relevant code and data in ITCM and DTCM respectively. Below is an example from the Atmel Studio project given with this application note, in which FFT computation functions are placed in TCM: SECTIONS { How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 13 1 3 .code_TCM 0x00000000: AT ( _itcm_lma ) { _sitcm = .; fft.o (.text.*) fft.o (.rodata) *(EXCLUDE_FILE (library/* *(EXCLUDE_FILE (library/* *(EXCLUDE_FILE (library/* *(EXCLUDE_FILE (library/* _eitcm = .; } > itcm .data_TCM 0x20000000: { _sdtcm = .; fft.o (.data) fft.o (.bss) fft.o (COMMON) *(EXCLUDE_FILE (library/* *(EXCLUDE_FILE (library/* _edtcm = .; } > dtcm .DTCM_stack : { . = ALIGN(8); _sdtcm_stack = .; . += STACK_SIZE; _edtcm_stack = .; } > dtcm } signal.o signal.o signal.o signal.o main.o main.o main.o main.o wdt.o) wdt.o) wdt.o) wdt.o) .text) .text.*) .rodata.*) .fini*) signal.o main.o wdt.o) .data) signal.o main.o wdt.o) .bss*) For more information on linker script concepts, you can refer to the below application note: http://www.atmel.com/Images/doc32158.pdf. For IAR users, linker files are in the .icf format. In this .icf file, two regions must be created for ITCM and DTCM (ITCM_region and DTCM_region in the below example) with their respective start address and size: define define define define symbol symbol symbol symbol __ICFEDIT_region_ITCM_start__= 0x00000000; __ICFEDIT_region_DTCM_start__= 0x20000000; __ICFEDIT_size_itcm__= 0x20000; __ICFEDIT_size_dtcm__= 0x20000; define region ITCM_region = mem:[from __ICFEDIT_region_ITCM_start__ size __ICFEDIT_size_itcm__]; define region DTCM_region = mem:[from __ICFEDIT_region_DTCM_start__ size __ICFEDIT_size_dtcm__]; Then relevant functions and data are placed in TCM. The below example is from the IAR project coming with the application note: place in ITCM_region { readwrite object fft.o, readwrite object arm_cortexM7lfdp_math.lib, readwrite object dl7M_tln.a, readwrite object m7M_tlv.a, readwrite object rt7M_tl.a }; place in DTCM_region zi object fft.o, 14 { How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 1 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 4 zi zi zi zi object object object object arm_cortexM7lfdp_math.lib, dl7M_tln.a, m7M_tlv.a, rt7M_tl.a }; initialize by copy { object fft.o, object arm_cortexM7lfdp_math.lib, object dl7M_tln.a, object m7M_tlv.a, object rt7M_tl.a }; For more information on IAR linker files, you can refer to the following application note: http://supp.iar.com/FilesPublic/UPDINFO/005316/xlink.ENU.pdf. 3.3.3 Code Copy in TCM When compiling an application code with Atmel Studio/GCC compiler, the code to move in TCM area must be copied manually. This is usually done during initialization, just before reaching the main function. The below code snippet shows how to do it: /* copy code_TCM from flash to ITCM */ volatile char *dst = &_sitcm; volatile char *src = &_itcm_lma; while(dst < &_eitcm){ *dst++ = *src++; } For IAR users, this copy loop is already implemented in the IAR cstartup file, so there is no need add it in the application step. 3.4 Optimize Pipeline Usage In order to fully benefit from the dual-instruction issue capability of the Cortex-M7 pipeline, it is recommended to interleave load and store instructions as shown below: Xn1 = pIn[0]; Xn2 = pIn[1]; Xn3 = pIn[2]; acc1 = b0 * Xn1 + d1; Xn4 = pIn[3]; d1 = b1 * Xn1 + d2; Xn5 = pIn[4]; d2 = b2 * Xn1; Xn6 = pIn[5]; d1 += a1 * acc1; Xn7 = pIn[6]; d2 += a2 * acc1; Compilers are not able to perform such kind of optimization thus it must be done at application code level. Note: The ARM CMSIS library for Cortex-M7, which is used on all software examples for SAM S70/E70/V7x, already implements such optimization mechanism. How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 15 1 5 4 Software Example: FFT Computation Code Example This part is an introduction of the code example provided with this application note, which allows to illustrate the different concepts previously detailed. This code example is provided for Atmel Studio 7 (build 7.0.567 or above), and needs the Atmel | SMART SAM V71 Xplained Ultra board. This code example is based on the ARM CMSIS-DSP library version 1.4.3 (CMSIS version 3.20). 4.1 Introduction to the Example The purpose of the example is to generate a 256-point FFT of a sine wave using 32-bit floating-point data type (F32). FFT computation time and CPU load are displayed in real time on the DEBUG USB Virtual COM port: FFT computation time (in µs) CPU load (in %) As shown in the below screenshot, the project is made of three folders: libchip: provides the API for the different embedded peripherals libboard: provides the API for the different on-board components and board low-level initialization linkerScripts: contains the linker scripts The following source files are also noticeable: 16 main.c: program entry-point. In this file you will find the initialization function of the device, and the sine wave generation function. The I- and D-cache enable functions can be found at the beginning of the main function (SCB_EnableICache() and SCB_EnableDCache() functions). fft.c: contains all functions related to the FFT computation process. This process calls the mathematics functions from the ARM CMSIS-DSP library (included in Atmel Studio). How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 1 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 6 4.2 startup_sam.c (in the libboard folder): contains the exception table and reset handler. The reset handler sets the vector table, and also configure the TCM (GPNVM bits) and enable it. board_lowlevel.c (in the libboard folder): Performs the low-level initialization of the chip, including EEFC (Flash Controller), MPU and master clock configuration. The code copy in TCM memory can be also found at the end of the LowLevelInit() function in this file. samv71q21_xxx.ld (in linkerScripts directory): program linker file. As explained previously its purpose is to describe how the sections in the input files should be mapped into the output file (.hex file), and to control the memory layout of the output file. Usage The project contains several configurations so that FFT computation process runs out of flash (with I-/D-cache enabled or not), System RAM or TCM. Figure 4-1 shows the case with FFT computation is executed out of TCM: Figure 4-1. FFT Computation Process Running out of TCM ITCM FFT computation and maths library 0x20000000 DTCM 0x20400000 0x00000000 -INSTRUCTIONS- FFT computation and maths library -DATA- 0x00400000 Flash Other functions (result display…) System RAM 0x00800000 ROM 0x20C00000 SRAM memory space Code memory space The configuration is changed through the drop-down menu at the top of the tools bar: How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 17 1 7 4.3 FLASH-CACHE executes the FFT computation process out of flash with I-/D-cache enabled FLASH-NO-CACHE executes the FFT computation process out of flash with I-/D-cache disabled RAM1-CACHE executes the FFT computation process out of System RAM with I-/D-cache enabled RAM1-NO-CACHE executes the FFT computation process out of System RAM with I-/D-cache disabled TCM-CACHE executes the FFT computation process out of TCM with I-/D-cache enabled TCM-NO-CACHE executes the FFT computation process out of TCM with I-/D-cache disabled Results As mentioned in introduction the code example is based on the ARM CMSIS-DSP library version 1.4.3 (CMSIS version 3.20) and running on the SAM V71 Xplained Ultra board. The results to expect are the following (FFT computation time in us, CPU load in %): Flash System RAM TCM I/D cache disabled 358µs 6.1% 341µs 5.8% 79µs 1.3% I/D cache enabled 102µs 1.7% 95µs 1.5% 79µs 1.3% The maximum performance level is achieved when FFT computation code is running out of TCM, thanks to the direct access to the Cortex-M7 processor at the processor clock speed (300MHz). The results when running out of flash with cache enabled are pretty close because instruction and data in cache are also fetched at processor clock speed. As explained previously the main difference between code execution out of cache and TCM is determinism: TCM is more suitable when performing real-time tasks because it is fully deterministic. The other difference is that TCM management mechanism is simpler: unlike cache, there are no maintenance operations to perform, such as invalidate, clean etc. 5 Conclusion This application note presented the new features introduced by the Cortex-M7 core and how it was integrated in the SAM S70/E70/V7x devices. We also had a review of the SAM S70/E70/V7x architecture to have a better understanding on why these devices are perfectly suited for highly performance demanding application. Finally a concrete example was given to illustrate how to enable and properly use these specific features. 18 How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 1 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 8 6 Suggested Readings Cortex-M7 Processor Technical Reference Manual: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0489b/DDI0489B_cortex_m7_trm.pdf Atmel | SMART SAM S70 Product Datasheet: http://www.atmel.com/Images/Atmel-11242-32-bit-Cortex-M7-Microcontroller-SAM-S70Q-SAM-S70NSAM-S70J_Datasheet.pdf Atmel | SMART SAM E70 Product Datasheet: http://www.atmel.com/Images/Atmel-11296-32-bit-Cortex-M7-Microcontroller-SAM-E70Q-SAM-E70NSAM-E70J_Datasheet.pdf Using the MPU on Atmel Cortex-M3 / Cortex-M4 based Microcontrollers: http://www.atmel.com/Images/Atmel-42128-AT02346-Using-the-MPU-on-Atmel-Cortex-M3-M4-basedMicrocontroller_Application-Note.pdf ARM Cortex-M Programming Guide to Memory Barrier Instructions: http://infocenter.arm.com/help/topic/com.arm.doc.dai0321a/DAI0321A_programming_guide_memory_ba rriers_for_m_profile.pdf How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 19 1 9 7 20 Revision History Doc Rev. Date Comments 44047B 03/2016 Code example updated to Atmel Studio 7. 44047A 06/2015 Initial document release. How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 2 Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 0 Atmel Corporation 1600 Technology Drive, San Jose, CA 95110 USA T: (+1)(408) 441.0311 F: (+1)(408) 436.4200 │ www.atmel.com © 2016 Atmel Corporation. / Rev.: Atmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016. Atmel®, Atmel logo and combinations thereof, Enabling Unlimited Possibilities®, and others are registered trademarks or trademarks of Atmel Corporation in U.S. and other countries. ARM®, ARM Connected® logo, Cortex®, and others are the registered trademarks or trademarks of ARM Ltd. Other terms and product names may be trademarks of others. DISCLAIMER: The information in this document is provided in connection with Atmel products. No license, express or implied, b y estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Atmel products. EXCEPT AS SET FORTH IN THE ATMEL TERMS AND CONDITIONS OF SALES LOCATED ON THE ATMEL WEBSITE, ATMEL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON -INFRINGEMENT. IN NO EVENT SHALL ATMEL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WI THOUT LIMITATION, DAMAGES FOR LOSS AND PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT , EVEN IF ATMEL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Atmel makes no representations or wa rranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and products descriptions at any time without notice. Atmel does not make any commitment to update the information contained herein. Unless specifically provided otherwise, Atmel products are not suitable for, and shall not be used in, auto motive applications. Atmel products are not intended, authorized, or warranted for use as components in applications intended to support or sustain life. SAFETY-CRITICAL, MILITARY, AND AUTOMOTIVE APPLICATIONS DISCLAIMER: Atmel products are not designed for and will not be used in conne ction with any applications where the failure of such products would reasonably be expected to result in significant personal injury or death (“Safety-Critical Applications”) without an Atmel officer's specific written consent. Safety-Critical Applications include, without limitation, life support devices and systems, equipment or systems for the oper ation of nuclear facilities and weapons systems. Atmel products are not designed nor intended for use in military or aerospace applications or environments unless specifically desi gnated by Atmel as military-grade. Atmel products are not designed nor intended for use in automotive applications unless specificallyAtmel-44047B-ATARM-Optimize-Usage-SAM-V71-V70-E70-S70-Architecture_Application-note_032016 designated by Atmel as automotive -grade. How to Optimize Usage of SAM S70/E70/V7x Architecture [APPLICATION NOTE] 21 2 1