Cell/B.E. SDK: Code sample directory Keep track of where to find code examples for the SDK Anita Bateman ([email protected]) Certified Senior IT Architect, Cell Solutions IBM 15 July 2008 VanDung To Software Engineer IBM In this article, you'll find tables indicating the locations of code samples that illustrate how to use the IBM SDK for Multicore Acceleration. This article will be updated with new code samples. About the SDK for Multicore Acceleration To enable you to take full advantage of the Cell Broadband Engine® (Cell/B.E.) architecture and the PowerXCell 8i processor, IBM has developed a software development kit designed to accelerate production-ready, multi-core programming. The IBM Software Development Kit (SDK) for Multicore Acceleration provides the libraries, tools, and resources that businesses need to develop and tune applications for Cell/B.E. technology. You can easily: • • • • Port and optimize applications and algorithms quickly. Increase ease-of-programming and developer productivity. Obtain a reliable development tool kit with warranty and support. Plug in third-party ISV libraries to integrate and build your software ecosystem. Cell/B.E. demos These full demonstration programs are designed and optimized to show the capabilities of the Cell/ B.E. Table 1. Cell/B.E. demos Example FFT16M Description Tuned Fast Fourier Transform. Performs a 4-way SIMD single- © Copyright IBM Corporation 2008 Cell/B.E. SDK: Code sample directory RPM Default installed directory cell-demos-source /opt/cell/sdk/src/demos/FFT16M SDK 3.0 Trademarks Page 1 of 11 developerWorks® ibm.com/developerWorks/ precision complex FFT on an array of size 16,777,216 elements. Julia Set Quaternion Julia Set Ray-tracing Sample. Matrix Multiply Parallel matrix multiplication workload for CBE. cell-demos-source /opt/cell/sdk/src/demos/julia_set 3.0 3.0 Cell/B.E. examples These small examples show different features of the SDK and the Cell/B.E. architecture. Table 2. Cell/B.E. examples Example Description RPM Default installed directory SDK cache-cp Demonstrates a standalone spulet that copies one file to another. The file contents are mapped into the effective address space, and a software managed cache is used to stage the data into LS. The example demonstrates the use of software-managed cache. cell-examples-source /opt/cell/sdk/src/examples/cache/ cache-cp 3.0 cache: sort Demonstrates examples of quicksort and heapsort that use a software managed data cache. The example demonstrates the use of software-managed cache. cell-examples-source /opt/cell/sdk/src/examples/cache/ sort 3.0 ch_prof Demonstrates an example of a statistical profiling utility that can be used to determine channel read and write activity of an SPU program. This example also contains an example application on how to use the ch_prof utility. cell-examples-source /opt/cell/sdk/src/examples/ch_prof 3.0 DMA: simple Demonstrates simple DMA calls within one SPE, cell-examples-source /opt/cell/sdk/src/examples/DMA/ simple 3.0 DMA: complex Demonstrates DMA calls for multiple SPEs using singlebuffered DMA, double-buffered DMA, single-buffered DMA List, and double-buffered DMA list. cell-examples-source /opt/cell/sdk/src/examples/DMA/ complex 3.0 overlay: overview Demonstrates an example implementation for SPU code overlay, cell-examples-source /opt/cell/sdk/src/examples/overlay/ overview 3.0 overlay: simple Demonstrates a simple example of using SPU code overlay. cell-examples-source /opt/cell/sdk/src/examples/overlay/ simple 3.0 overlay: large matrix Demonstrates an example of using cell-examples-source SPU code overlay for large-matrix /opt/cell/sdk/src/examples/overlay/ math. large_matrix 3.0 ppe_address_space Demonstrates an implementation of quick-sort on a random list of floats using compiler PPE address space support. The example illustrates the use of the automatic software caching feature supported by the SPU 3.0 Cell/B.E. SDK: Code sample directory cell-examples-source /opt/cell/sdk/src/examples/ ppe_address_space Page 2 of 11 ibm.com/developerWorks/ developerWorks® compiler. The "__ea" qualifier is used within the source file to indicate to the SPU compiler that a memory reference is in the effective address space, rather than in local store. spe_interrupt Demonstrates the use of an assembly first level interrupt handle (FLIH) that calls ABI compliant, registered, C-sourced second level interrupt handlers (SLIH) as a function of the interrupt events. cell-examples-source /opt/cell/sdk/src/examples/ spu_interrupt 3.0 spe_interrupt_fast Demonstrates the use of the fast interrupt handler. cell-examples-source /opt/cell/sdk/src/examples/ spu_interrupt 3.0 spu_let Demonstrates examples of /opt/cell/sdk/src/examples/spulet standalone SPU programs called spulets. These are C-language programs that have been compiled to run on an SPU and can invoke C-library functions, such as printf(3), open(2), and so on. A spulet can be executed directly from the Linux® command prompt. The following entries are spulet examples. 3.0 spu_let: hello Demonstrates a spulet version of hello world. cell-examples-source /opt/cell/sdk/src/examples/spulet/ hello 3.0 spu_let: spe-sum Demonstrates a simple spulet that computes a checksum for a file. The file contents are mmap'ed into the effective address space, and DMAs are used to stage data into the LS. The example illustrates how an SPE-based filter might be constructed. cell-examples-source /opt/cell/sdk/src/examples/spulet/ spe-sum 3.0 spu_let: spe-audio Demonstrates audio filtering on the cell-examples-source SPU. Two examples are included. /opt/cell/sdk/src/examples/spulet/ cpaudio copies one channel of spe-audio a stereo audio file to the other channel. normalize normalizes the volume of a mono audio file. Both examples take raw, unsigned, halfword audio files as input, and they output the same format. 3.0 spu_let: spe-cp Demonstrates how a simple spulet cell-examples-source copies one file to another. The file /opt/cell/sdk/src/examples/spulet/ contents are mmap'ed into the spe-cp effective address space, and multibuffered DMAs are used to stage data into and then out from LS. 3.0 Demonstrates and tests lowlevel atomic and thread-type synchronization functions from the sample library libsync. 3.0 sync Cell/B.E. SDK: Code sample directory cell-examples-source /opt/cell/sdk/src/examples/sync Page 3 of 11 developerWorks® ibm.com/developerWorks/ Cell/B.E. libraries These example libraries show different sets of functions to the programmer. The examples are optimized for running on the PPE and SPE. Find detailed information about each of the routines in each of the libraries in the 3.0 document SDK_Example_Library_API_v3.0.pdf. Table 3. Cell/B.E. libraries Example fft Description RPM Default installed directory SDK Contains a collection of fast fourier transform routines. Includes the following two routines. cell-libs-source /opt/cell/sdk/src/lib/fft 3.0 fft_1d_r2 Performs a single precision, complex Fast Fourier Transform using Discrete Fourier Transform with radix-2 decimation in time. This function is built only for use on the SPE. cell-libs-source /opt/cell/sdk/src/lib/fft 3.0 fft_2d Transforms 4 rows of complex 2D data from the time domain to the frequency domain or vice versa. This function is available on both the PPE and the SPE. cell-libs-source 3.0 gmath Contains the following game cell-libs-source math functions: cos8, cos8_v, /opt/cell/sdk/src/lib/gmath cos14, cos14_v, cos18, cos18_v, pack_normal16, pack_normal16_v, pack_rgba8_v, sin8, sin8_v, sin14, sin14_v, sin18, sin18_v, spec9, spec9_v, unpack_normal16, unpack_normal16_v, unpack_rgba8, unpack_rgba8_v, pack_color8, pack_rgba8, set_spec_exponent9, specThresholds, unpack_color8. 3.0 image Contains a series of convolution and histogram functions. cell-libs-source /opt/cell/sdk/src/lib/image 3.0 large_matrix Contains several large matrix functions for the SPE. The matrix must fit the SPE local store. cell-libs-source /opt/cell/sdk/src/lib/large_matrix 3.0 matrix Contains matrix routines for the PPE and SPE. cell-libs-source /opt/cell/sdk/src/lib/matrix 3.0 misc Contains miscellaneous functions for PPE and SPE, including: calloc_align, clamp, clamp_0_to_1, clamp_0_to_1_v, clamp_minus1_to_1, clamp_minus1_to_1_v, clamp_v, free_align, load_vec_unaligned, malloc_align, max_float_v, max_int_v, max_vec_float3, max_vec_float4, max_vec_int3, max_vec_int4, min_float_v, min_int_v, min_vec_float3, min_vec_float4, min_vec_int3, min_vec_int4, rand_0_to_1, rand_0_to_1_v, rand_minus1_to_1, rand_minus1_to_1_v, rand_v, cell-libs-source /opt/cell/sdk/src/lib/misc 3.0 Cell/B.E. SDK: Code sample directory Page 4 of 11 ibm.com/developerWorks/ developerWorks® realloc_align, srand_v, store_vec_unaligned, copy_from_ls, copy_from_ls_aligned, copy_to_ls, copy_to_ls_aligned, ls_sync. mpm Contains several functions that perform math on multi-precision large numbers on the SPE. cell-libs-source /opt/cell/sdk/src/lib/mpm 3.0 sync Contains several interfaces that are patterned after POSIX threads mutex and that condition variables for the Cell. cell-libs-source /opt/cell/sdk/src/lib/sync 3.0 vector Contains several vector math functions. cell-libs-source /opt/cell/sdk/src/lib/vector 3.0 Cell/B.E. tutorials These examples are contained within the SDK Programming Tutorial. Table 4. Cell/B.E. tutorials Example Description RPM Default installed directory SDK simple Demonstrates a simple program that creates 8 identical SPE threads that take turns printing messages to stdout and then terminate. cell-tutorial-source 3.0 euler Demonstrates how a simplified scalar function can be ported and accelerated for parallel execution on Cell. cell-tutorial-source 3.0 ALF Cell/B.E. examples These examples show how the Accelerated Library Framework works on Cell/B.E. Table 5. ALF Cell/B.E. examples Example Description RPM Default installed directory SDK BlackScholes_ALF Demonstrates how to perform a Black Scholes pricing model. It demonstrates a closed form solution for Black Scholes using ALF. alf-examples-source /opt/cell/sdk/src/alf/ BlackScholes_ALF/ 3.0 FFT16M_ALF Demonstrates an ALF version of the FFT16M workload in the cellexamples-source RPM. alf-examples-source /opt/cell/sdk/src/alf/FFT16M_ALF 3.0 hello_world Demonstrates a hello_world program that shows a simple ALF program. It contains an SPU computational kernel that issues a printf of the hello_world program. alf-examples-source /opt/cell/sdk/src/alf/hello_world 3.0 inout_buffer Demonstrates the usage of overlapped I/O buffers. The first example implements C=A +B, where A, B, and C are matrices. The second example alf-examples-source /opt/cell/sdk/src/alf/inout_buffer 3.0 Cell/B.E. SDK: Code sample directory Page 5 of 11 developerWorks® ibm.com/developerWorks/ implements A=A+B, where matrix A is overwritten by the result. inverse_matrix_ovl Demonstrates the use of overlay with ALF by calculating the inverse of a block diagonal matrix. alf-examples-source /opt/cell/sdk/src/alf/ inverse_matrix_ovl 3.0 matrix_add Demonstrates how to program with ALF API. By adding two 1024x512 single precision floating point matrices, the sample code demonstrates how a simplified scalar function can be ported and accelerated for parallel execution on ALF. This simple example illustrates different features in ALF, including data partitioning on the host, data partitioning on the accelerator, overlapped in/out buffer, and data set. alf-examples-source /opt/cell/sdk/src/alf/matrix_add 3.0 matrix_transpose Demonstrates how to program with the ALF API. By transposing a 1024x512 single precision floating point matrix, the sample code shows how a simplified scalar function can be ported and accelerated for parallel execution on ALF. The sample contains a scalar, non-ALF version of the code, a data partition on host version, a data partition on accelerator version, and a tuned version with SIMD. alf-examples-source /opt/cell/sdk/src/alf/ matrix_transpose 3.0 PI Demonstrates computing PI by Buffon's Needle method using ALF. It shows how to use the ALF context buffer for global parameters and how to collect computing results on each task instance. This program also shows a progress bar during the computation by using the sync point feature in ALF. alf-examples-source /opt/cell/sdk/src/alf/PI 3.0 pipe_line Demonstrates the task dependency feature in ALF. It shows how task dependency is used in a two-stage pipeline application. The application is a simple simulation. An object P is placed in the middle of a flat surface with a bounding rectangular box. On each simulation step, the object moves a random distance in a random direction. It moves back to the initial position when it hits the side walls of the bounding box. The problem is to calculate the number of hits to the four walls in a given time period. alf-examples-source /opt/cell/sdk/src/pipe_line 3.0 task_context Contains a collection of small samples that demonstrate the use of task context in ALF. Four included samples follow. alf-examples-source /opt/cell/sdk/src/task_context 3.0 Cell/B.E. SDK: Code sample directory Page 6 of 11 ibm.com/developerWorks/ developerWorks® task_context: dot_prod Computes the dot product of two large vectors. It shows how to use the bundled work block distribution together with the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit. /opt/cell/sdk/src/task_context/ dot_prod 3.0 task_context: dot_prod_multi Computes the dot product of two large vectors. It shows how to use the multi-use workblocks together with work block parameter and context buffers. /opt/cell/sdk/src/task_context/ dot_prod_multi 3.0 task_context: min_max Shows how to use the task context to keep the partial computing results for each task instance and then to combine these partial results into the final result. /opt/cell/sdk/src/task_context/ min_max 3.0 task_context: table_lookup Shows how the task context buffer /opt/cell/sdk/src/task_context/ is used as a large lookup table to table_lookup convert the 16-bit input data into 8bit output data. 3.0 ALF hybrid examples These examples show how the Accelerated Library Framework works on the hybrid system. Table 6. ALF hybrid examples Example Description RPM Default installed directory SDK BlackScholes_ALF Demonstrates how to perform a Black Scholes pricing model. It demonstrates a closed form solution for Black Scholes using ALF. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ BlackScholes_ALF/ 3.0 FFT16M_ALF Demonstrates an ALF version of the FFT16M workload in the cellexamples-source RPM. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ FFT16M_ALF 3.0 hello_world Demonstrates a hello_world program that shows a simple ALF program. It contains an SPU computational kernel that issues a printf of the hello_world program. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ hello_world 3.0 inout_buffer Demonstrates the usage of overlapped I/O buffers. The first example implements C=A +B, where A, B, and C are matrices. The second example implements A=A+B, where matrix A is overwritten by the result. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ inout_buffer 3.0 inverse_matrix_ovl Demonstrates the use of overlay with ALF by calculating the inverse of a block diagonal matrix. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ inverse_matrix_ovl 3.0 matrix_add Demonstrates how to program with ALF API. By adding two 1024x512 single precision floating point matrices, the sample code demonstrates how a simplified alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ matrix_add 3.0 Cell/B.E. SDK: Code sample directory Page 7 of 11 developerWorks® ibm.com/developerWorks/ scalar function can be ported and accelerated for parallel execution on ALF. This simple example illustrates different features in ALF, including data partitioning on the host, data partitioning on the accelerator, overlapped in/out buffer, and data set. matrix_transpose Demonstrates how to program with the ALF API. By transposing a 1024x512 single precision floating point matrix, the sample code shows how a simplified scalar function can be ported and accelerated for parallel execution on ALF. The sample contains a scalar, non-ALF version of the code, a data partition on host version, a data partition on accelerator version, and a tuned version with SIMD. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/ matrix_transpose 3.0 PI Demonstrates computing PI by Buffon's Needle method using ALF. It shows how to use the ALF context buffer for global parameters and how to collect computing results on every task instance. This program also shows a progress bar during the computation by using the sync point feature in ALF. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/alf/PI 3.0 pipe_line Demonstrates the task dependency feature in ALF. It shows how task dependency is used in a two-stage pipeline application. The application is a simple simulation. An object P is placed in the middle of a flat surface with a bounding rectangular box. On each simulation step, the object moves a random distance in a random direction. It moves back to the initial position when it hits the side walls of the bounding box. The problem is to calculate the number of hits to the four walls in a given time period. alf-hybrid-examples-source /opt/cell/sdk/prototype/src/ pipe_line 3.0 task_context Contains a collection of small samples that demonstrate the use of task context in ALF. (The same samples under task_context in Table 5.) alf-hybrid-examples-source /opt/cell/sdk/prototype/src/ task_context 3.0 LAPACK examples These examples show how to use the LAPACK library. Table 7. LAPACK examples Example Cell/B.E. SDK: Code sample directory Description RPM Default installed directory SDK Page 8 of 11 ibm.com/developerWorks/ developerWorks® dsteqr_F Computes all eigenvalues and optionally eigenvectors of a symmetric tridiagonal matrix using the implicit QL or QR method. lapack-examples-source /opt/cell/sdk/src/lapack-examples/ dsteqr_F 3.0.0.3 inverse_matrix Computes inverse of a matrix on Cell. lapack-examples-source /opt/cell/sdk/src/lapack-examples/ inverse_matrix 3.0.0.3 BLAS examples These examples show how to use the BLAS library. Table 8. BLAS examples Example Description RPM Default installed directory SDK blas_simple Demonstrates the use of the BLAS library at the PPU level, which has the standard BLAS interface. blas-devel /opt/cell/sdk/src/blas-examples/ blas_simple 3.0 blas_thread Demonstrates the use of the BLAS library with a multi-threaded PPU application. blas-devel /opt/cell/sdk/src/blas-examples/ blas_thread 3.0 spulet Demonstrates the use of the SPU BLAS library containing SPU BLAS kernel routines. blas-devel /opt/cell/sdk/src/blas-examples/ spulet 3.0 LIB FFT examples These examples show how to use the FFT library. Table 9. LIB FFT examples Example Description RPM Default installed directory SDK FFT1D Demonstrates the use of the prototype libfft code. In particular, this example demonstrates how the library can be made to perform with high efficiency on carefully crafted 1D FFTs. libfft-examples-source /opt/cell/sdk/prototype/src/libfft/ FFT1D 3.0 FFT1D.spu_only Demonstrates the use of the prototype libfft code. In particular, this example demonstrates how the library can be made to perform with high efficiency on carefully crafted 1D FFTs on an SPU. libfft-examples-source /opt/cell/sdk/prototype/src/libfft/ FFT1D.spu_only 3.0 FFT2D Demonstrates the use of the prototype libfft code. In particular, this example demonstrates how the library can be made to perform with high efficiency on carefully crafted 2D FFTs. libfft-examples-source /opt/cell/sdk/prototype/src/libfft/ FFT2D 3.0 FFT2D.stream Demonstrates the use of the prototype libfft code. In particular, this example demonstrates how the library can be made to perform with high efficiency on carefully crafted 2D FFTs. libfft-examples-source /opt/cell/sdk/prototype/src/libfft/ FFT2D.stream 3.0 Cell/B.E. SDK: Code sample directory Page 9 of 11 developerWorks® ibm.com/developerWorks/ Resources Learn • Learn more about Cell/B.E. programming from the developerWorks series: • "Programming high-performance applications on the Cell/B.E. processor" • "PS3 fab-to-lab" • "The little broadband engine that could" • Refer to the Cell Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals, specifications, and more. • Sign up for the developerWorks newsletter and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week. Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter. Get products and technologies • Download additional code samples by choosing the IBM Cell Education link at this Information Center site. • Get your copy of the IBM SDK for Multicore Acceleration 3.0 or browse through the library of Cell/B.E. documentation. • Find all Cell/B.E.-related articles, discussion forums, downloads, and more at the IBM developerWorks Cell Broadband Engine resource center: your definitive resource for all things Cell/B.E. • Contact IBM about custom Cell/B.E.-based or custom-processor based solutions. Discuss • Check out the Cell Broadband Engine Architecture forum to get your technical questions about the processor answered. Juicy problems and answers from the forums are rounded up periodically and highlighted in the "Forum watch" blog series. • Go to the Cell Broadband Engine/Power Architecture blog for news, downloads, instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read technology introductions. Cell/B.E. SDK: Code sample directory Page 10 of 11 ibm.com/developerWorks/ developerWorks® About the authors Anita Bateman Anita Bateman is a Senior IT Architect with the IBM Corporation. She has been with IBM doing software development and architecture since 1998. She is a certified architect with both IBM and The Open Group, and she has filed and published several patents. She holds an M.S. in computer science from the University of Texas at Austin and a B.S. in computer science from Hope College in Holland, Michigan. She is currently a Cell Broadband Engine solutions architect, working with partners and customers to adopt the IBM multi-core technology and to improve Cell/B.E. programmability. VanDung To VanDung To received a computer science degree from Rice University in 2002 then joined IBM shortly afterwards. She is an advisory software engineer in the Quasar Design Center at IBM, and she has been working with Cell/B.E. technology since 2002. © Copyright IBM Corporation 2008 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) Cell/B.E. SDK: Code sample directory Page 11 of 11