]> git.proxmox.com Git - rustc.git/blame - src/llvm/docs/NVPTXUsage.rst
Imported Upstream version 1.0.0+dfsg1
[rustc.git] / src / llvm / docs / NVPTXUsage.rst
CommitLineData
1a4d82fc
JJ
1=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6 :local:
7 :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note::
20
21 This document assumes a basic familiarity with CUDA and the PTX
22 assembly language. Information about the CUDA Driver API and the PTX assembly
23 language can be found in the `CUDA documentation
24 <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
40.. code-block:: llvm
41
42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
43
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50 define float @my_fmad(float %x, float %y, float %z) {
51 %mul = fmul float %x, %y
52 %add = fadd float %mul, %z
53 ret float %add
54 }
55
56 define void @my_kernel(float* %ptr) {
57 %val = load float* %ptr
58 %ret = call float @my_fmad(float %val, float %val, float %val)
59 store float %ret, float* %ptr
60 ret void
61 }
62
63 !nvvm.annotations = !{!1}
64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
65
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
69.. _address_spaces:
70
71Address Spaces
72--------------
73
74The NVPTX back-end uses the following address space mapping:
75
76 ============= ======================
77 Address Space Memory Space
78 ============= ======================
79 0 Generic
80 1 Global
81 2 Internal Use
82 3 Shared
83 4 Constant
84 5 Local
85 ============= ======================
86
87Every global variable and pointer type is assigned to one of these address
88spaces, with 0 being the default address space. Intrinsics are provided which
89can be used to convert pointers between the generic and non-generic address
90spaces.
91
92As an example, the following IR will define an array ``@g`` that resides in
93global device memory.
94
95.. code-block:: llvm
96
97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98
99LLVM IR functions can read and write to this array, and host-side code can
100copy data to it by name with the CUDA Driver API.
101
102Note that since address space 0 is the generic space, it is illegal to have
103global variables in address space 0. Address space 0 is the default address
104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105variables.
106
107
108Triples
109-------
110
111The NVPTX target uses the module triple to select between 32/64-bit code
112generation and the driver-compiler interface to use. The triple architecture
113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114operating system should be one of ``cuda`` or ``nvcl``, which determines the
115interface used by the generated code to communicate with the driver. Most
116users will want to use ``cuda`` as the operating system, which makes the
117generated PTX compatible with the CUDA Driver API.
118
119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120
121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122
123
124
125.. _nvptx_intrinsics:
126
127NVPTX Intrinsics
128================
129
130Address Space Conversion
131------------------------
132
133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136Syntax:
137"""""""
138
139These are overloaded intrinsics. You can use these on any pointer types.
140
141.. code-block:: llvm
142
143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
147
148Overview:
149"""""""""
150
151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152address space to a generic address space pointer.
153
154Semantics:
155""""""""""
156
157These intrinsics modify the pointer value to be a valid generic address space
158pointer.
159
160
161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164Syntax:
165"""""""
166
167These are overloaded intrinsics. You can use these on any pointer types.
168
169.. code-block:: llvm
170
171 declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
172 declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
173 declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
174 declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
175
176Overview:
177"""""""""
178
179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180address space to a pointer in the target address space. Note that these
181intrinsics are only useful if the address space of the target address space of
182the pointer is known. It is not legal to use address space conversion
183intrinsics to convert a pointer from one non-generic address space to another
184non-generic address space.
185
186Semantics:
187""""""""""
188
189These intrinsics modify the pointer value to be a valid pointer in the target
190non-generic address space.
191
192
193Reading PTX Special Registers
194-----------------------------
195
196'``llvm.nvvm.read.ptx.sreg.*``'
197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198
199Syntax:
200"""""""
201
202.. code-block:: llvm
203
204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
217
218Overview:
219"""""""""
220
221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222special registers, in particular the kernel launch bounds. These registers
223map in the following way to CUDA builtins:
224
225 ============ =====================================
226 CUDA Builtin PTX Special Register Intrinsic
227 ============ =====================================
228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232 ============ =====================================
233
234
235Barriers
236--------
237
238'``llvm.nvvm.barrier0``'
239^^^^^^^^^^^^^^^^^^^^^^^^^^^
240
241Syntax:
242"""""""
243
244.. code-block:: llvm
245
246 declare void @llvm.nvvm.barrier0()
247
248Overview:
249"""""""""
250
251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252instruction, equivalent to the ``__syncthreads()`` call in CUDA.
253
254
255Other Intrinsics
256----------------
257
258For the full set of NVPTX intrinsics, please see the
259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
260
261
262.. _libdevice:
263
264Linking with Libdevice
265======================
266
267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268implements many common mathematical functions. This library can be used as a
269high-performance math library for any compilers using the LLVM NVPTX target.
270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271there is a separate version for each compute architecture.
272
273For a list of all math functions implemented in libdevice, see
274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
275
276To accommodate various math-related compiler flags that can affect code
277generation of libdevice code, the library code depends on a special LLVM IR
278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280with constants based on the defined reflection parameters. Such conditional
281code often follows a pattern:
282
283.. code-block:: c++
284
285 float my_function(float a) {
286 if (__nvvm_reflect("FASTMATH"))
287 return my_function_fast(a);
288 else
289 return my_function_precise(a);
290 }
291
292The default value for all unspecified reflection parameters is zero.
293
294The ``NVVMReflect`` pass should be executed early in the optimization
295pipeline, immediately after the link stage. The ``internalize`` pass is also
296recommended to remove unused math functions from the resulting PTX. For an
297input IR module ``module.bc``, the following compilation flow is recommended:
298
2991. Save list of external functions in ``module.bc``
3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
3013. Internalize all functions not in list from (1)
3024. Eliminate all unused internal functions
3035. Run ``NVVMReflect`` pass
3046. Run standard optimization pipeline
305
306.. note::
307
308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309 libdevice functions. It is possible to link two IR modules that have been
310 linked against libdevice using different reflection variables.
311
312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
313often leave behind dead code of the form:
314
315.. code-block:: llvm
316
317 entry:
318 ..
319 br i1 true, label %foo, label %bar
320 foo:
321 ..
322 bar:
323 ; Dead code
324 ..
325
326Therefore, it is recommended that ``NVVMReflect`` is executed early in the
327optimization pipeline before dead-code elimination.
328
329
330Reflection Parameters
331---------------------
332
333The libdevice library currently uses the following reflection parameters to
334control code generation:
335
336==================== ======================================================
337Flag Description
338==================== ======================================================
339``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
340==================== ======================================================
341
342
343Invoking NVVMReflect
344--------------------
345
346To ensure that all dead code caused by the reflection pass is eliminated, it
347is recommended that the reflection pass is executed early in the LLVM IR
348optimization pipeline. The pass takes an optional mapping of reflection
349parameter name to an integer value. This mapping can be specified as either a
350command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
351programmatically creating a pass pipeline.
352
353With ``opt``:
354
355.. code-block:: text
356
357 # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
358
359
360With programmatic pass pipeline:
361
362.. code-block:: c++
363
364 extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
365
366 StringMap<int> ReflectParams;
367 ReflectParams["__CUDA_FTZ"] = 1;
368 Passes.add(createNVVMReflectPass(ReflectParams));
369
370
371
372Executing PTX
373=============
374
375The most common way to execute PTX assembly on a GPU device is to use the CUDA
376Driver API. This API is a low-level interface to the GPU driver and allows for
377JIT compilation of PTX code to native GPU machine code.
378
379Initializing the Driver API:
380
381.. code-block:: c++
382
383 CUdevice device;
384 CUcontext context;
385
386 // Initialize the driver API
387 cuInit(0);
388 // Get a handle to the first compute device
389 cuDeviceGet(&device, 0);
390 // Create a compute device context
391 cuCtxCreate(&context, 0, device);
392
393JIT compiling a PTX string to a device binary:
394
395.. code-block:: c++
396
397 CUmodule module;
398 CUfunction funcion;
399
400 // JIT compile a null-terminated PTX string
401 cuModuleLoadData(&module, (void*)PTXString);
402
403 // Get a handle to the "myfunction" kernel function
404 cuModuleGetFunction(&function, module, "myfunction");
405
406For full examples of executing PTX assembly, please see the `CUDA Samples
407<https://developer.nvidia.com/cuda-downloads>`_ distribution.
408
409
410Common Issues
411=============
412
413ptxas complains of undefined function: __nvvm_reflect
414-----------------------------------------------------
415
416When linking with libdevice, the ``NVVMReflect`` pass must be used. See
417:ref:`libdevice` for more information.
418
419
420Tutorial: A Simple Compute Kernel
421=================================
422
423To start, let us take a look at a simple compute kernel written directly in
424LLVM IR. The kernel implements vector addition, where each thread computes one
425element of the output vector C from the input vectors A and B. To make this
426easier, we also assume that only a single CTA (thread block) will be launched,
427and that it will be one dimensional.
428
429
430The Kernel
431----------
432
433.. code-block:: llvm
434
435 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
436 target triple = "nvptx64-nvidia-cuda"
437
438 ; Intrinsic to read X component of thread ID
439 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
440
441 define void @kernel(float addrspace(1)* %A,
442 float addrspace(1)* %B,
443 float addrspace(1)* %C) {
444 entry:
445 ; What is my ID?
446 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
447
448 ; Compute pointers into A, B, and C
449 %ptrA = getelementptr float addrspace(1)* %A, i32 %id
450 %ptrB = getelementptr float addrspace(1)* %B, i32 %id
451 %ptrC = getelementptr float addrspace(1)* %C, i32 %id
452
453 ; Read A, B
454 %valA = load float addrspace(1)* %ptrA, align 4
455 %valB = load float addrspace(1)* %ptrB, align 4
456
457 ; Compute C = A + B
458 %valC = fadd float %valA, %valB
459
460 ; Store back to C
461 store float %valC, float addrspace(1)* %ptrC, align 4
462
463 ret void
464 }
465
466 !nvvm.annotations = !{!0}
467 !0 = metadata !{void (float addrspace(1)*,
468 float addrspace(1)*,
469 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
470
471
472We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
473
474.. code-block:: text
475
476 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
477
478
479.. note::
480
481 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
482 in the module data layout string and use ``nvptx-nvidia-cuda`` as the
483 target triple.
484
485
486The output we get from ``llc`` (as of LLVM 3.4):
487
488.. code-block:: text
489
490 //
491 // Generated by LLVM NVPTX Back-End
492 //
493
494 .version 3.1
495 .target sm_20
496 .address_size 64
497
498 // .globl kernel
499 // @kernel
500 .visible .entry kernel(
501 .param .u64 kernel_param_0,
502 .param .u64 kernel_param_1,
503 .param .u64 kernel_param_2
504 )
505 {
506 .reg .f32 %f<4>;
507 .reg .s32 %r<2>;
508 .reg .s64 %rl<8>;
509
510 // BB#0: // %entry
511 ld.param.u64 %rl1, [kernel_param_0];
512 mov.u32 %r1, %tid.x;
513 mul.wide.s32 %rl2, %r1, 4;
514 add.s64 %rl3, %rl1, %rl2;
515 ld.param.u64 %rl4, [kernel_param_1];
516 add.s64 %rl5, %rl4, %rl2;
517 ld.param.u64 %rl6, [kernel_param_2];
518 add.s64 %rl7, %rl6, %rl2;
519 ld.global.f32 %f1, [%rl3];
520 ld.global.f32 %f2, [%rl5];
521 add.f32 %f3, %f1, %f2;
522 st.global.f32 [%rl7], %f3;
523 ret;
524 }
525
526
527Dissecting the Kernel
528---------------------
529
530Now let us dissect the LLVM IR that makes up this kernel.
531
532Data Layout
533^^^^^^^^^^^
534
535The data layout string determines the size in bits of common data types, their
536ABI alignment, and their storage size. For NVPTX, you should use one of the
537following:
538
53932-bit PTX:
540
541.. code-block:: llvm
542
543 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
544
54564-bit PTX:
546
547.. code-block:: llvm
548
549 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
550
551
552Target Intrinsics
553^^^^^^^^^^^^^^^^^
554
555In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
556read the X component of the current thread's ID, which corresponds to a read
557of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
558intrinsics. A short list is shown below; please see
559``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
560
561
562================================================ ====================
563Intrinsic CUDA Equivalent
564================================================ ====================
565``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z}
566``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z}
567``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z}
568``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z}
569``void @llvm.cuda.syncthreads()`` __syncthreads()
570================================================ ====================
571
572
573Address Spaces
574^^^^^^^^^^^^^^
575
576You may have noticed that all of the pointer types in the LLVM IR example had
577an explicit address space specifier. What is address space 1? NVIDIA GPU
578devices (generally) have four types of memory:
579
580- Global: Large, off-chip memory
581- Shared: Small, on-chip memory shared among all threads in a CTA
582- Local: Per-thread, private memory
583- Constant: Read-only memory shared across all threads
584
585These different types of memory are represented in LLVM IR as address spaces.
586There is also a fifth address space used by the NVPTX code generator that
587corresponds to the "generic" address space. This address space can represent
588addresses in any other address space (with a few exceptions). This allows
589users to write IR functions that can load/store memory using the same
590instructions. Intrinsics are provided to convert pointers between the generic
591and non-generic address spaces.
592
593See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
594
595
596Kernel Metadata
597^^^^^^^^^^^^^^^
598
599In PTX, a function can be either a `kernel` function (callable from the host
600program), or a `device` function (callable only from GPU code). You can think
601of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
602function as a `kernel` function, we make use of special LLVM metadata. The
603NVPTX back-end will look for a named metadata node called
604``nvvm.annotations``. This named metadata must contain a list of metadata that
605describe the IR. For our purposes, we need to declare a metadata node that
606assigns the "kernel" attribute to the LLVM IR function that should be emitted
607as a PTX `kernel` function. These metadata nodes take the form:
608
609.. code-block:: text
610
611 metadata !{<function ref>, metadata !"kernel", i32 1}
612
613For the previous example, we have:
614
615.. code-block:: llvm
616
617 !nvvm.annotations = !{!0}
618 !0 = metadata !{void (float addrspace(1)*,
619 float addrspace(1)*,
620 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
621
622Here, we have a single metadata declaration in ``nvvm.annotations``. This
623metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
624
625
626Running the Kernel
627------------------
628
629Generating PTX from LLVM IR is all well and good, but how do we execute it on
630a real GPU device? The CUDA Driver API provides a convenient mechanism for
631loading and JIT compiling PTX to a native GPU device, and launching a kernel.
632The API is similar to OpenCL. A simple example showing how to load and
633execute our vector addition code is shown below. Note that for brevity this
634code does not perform much error checking!
635
636.. note::
637
638 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
639 compile PTX to machine code (SASS) for a specific GPU architecture. Such
640 binaries can be loaded by the CUDA Driver API in the same way as PTX. This
641 can be useful for reducing startup time by precompiling the PTX kernels.
642
643
644.. code-block:: c++
645
646 #include <iostream>
647 #include <fstream>
648 #include <cassert>
649 #include "cuda.h"
650
651
652 void checkCudaErrors(CUresult err) {
653 assert(err == CUDA_SUCCESS);
654 }
655
656 /// main - Program entry point
657 int main(int argc, char **argv) {
658 CUdevice device;
659 CUmodule cudaModule;
660 CUcontext context;
661 CUfunction function;
662 CUlinkState linker;
663 int devCount;
664
665 // CUDA initialization
666 checkCudaErrors(cuInit(0));
667 checkCudaErrors(cuDeviceGetCount(&devCount));
668 checkCudaErrors(cuDeviceGet(&device, 0));
669
670 char name[128];
671 checkCudaErrors(cuDeviceGetName(name, 128, device));
672 std::cout << "Using CUDA Device [0]: " << name << "\n";
673
674 int devMajor, devMinor;
675 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
676 std::cout << "Device Compute Capability: "
677 << devMajor << "." << devMinor << "\n";
678 if (devMajor < 2) {
679 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
680 return 1;
681 }
682
683 std::ifstream t("kernel.ptx");
684 if (!t.is_open()) {
685 std::cerr << "kernel.ptx not found\n";
686 return 1;
687 }
688 std::string str((std::istreambuf_iterator<char>(t)),
689 std::istreambuf_iterator<char>());
690
691 // Create driver context
692 checkCudaErrors(cuCtxCreate(&context, 0, device));
693
694 // Create module for object
695 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
696
697 // Get kernel function
698 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
699
700 // Device data
701 CUdeviceptr devBufferA;
702 CUdeviceptr devBufferB;
703 CUdeviceptr devBufferC;
704
705 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
706 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
707 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
708
709 float* hostA = new float[16];
710 float* hostB = new float[16];
711 float* hostC = new float[16];
712
713 // Populate input
714 for (unsigned i = 0; i != 16; ++i) {
715 hostA[i] = (float)i;
716 hostB[i] = (float)(2*i);
717 hostC[i] = 0.0f;
718 }
719
720 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
721 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
722
723
724 unsigned blockSizeX = 16;
725 unsigned blockSizeY = 1;
726 unsigned blockSizeZ = 1;
727 unsigned gridSizeX = 1;
728 unsigned gridSizeY = 1;
729 unsigned gridSizeZ = 1;
730
731 // Kernel parameters
732 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
733
734 std::cout << "Launching kernel\n";
735
736 // Kernel launch
737 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
738 blockSizeX, blockSizeY, blockSizeZ,
739 0, NULL, KernelParams, NULL));
740
741 // Retrieve device data
742 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
743
744
745 std::cout << "Results:\n";
746 for (unsigned i = 0; i != 16; ++i) {
747 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
748 }
749
750
751 // Clean up after ourselves
752 delete [] hostA;
753 delete [] hostB;
754 delete [] hostC;
755
756 // Clean-up
757 checkCudaErrors(cuMemFree(devBufferA));
758 checkCudaErrors(cuMemFree(devBufferB));
759 checkCudaErrors(cuMemFree(devBufferC));
760 checkCudaErrors(cuModuleUnload(cudaModule));
761 checkCudaErrors(cuCtxDestroy(context));
762
763 return 0;
764 }
765
766
767You will need to link with the CUDA driver and specify the path to cuda.h.
768
769.. code-block:: text
770
771 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
772
773We don't need to specify a path to ``libcuda.so`` since this is installed in a
774system location by the driver, not the CUDA toolkit.
775
776If everything goes as planned, you should see the following output when
777running the compiled program:
778
779.. code-block:: text
780
781 Using CUDA Device [0]: GeForce GTX 680
782 Device Compute Capability: 3.0
783 Launching kernel
784 Results:
785 0 + 0 = 0
786 1 + 2 = 3
787 2 + 4 = 6
788 3 + 6 = 9
789 4 + 8 = 12
790 5 + 10 = 15
791 6 + 12 = 18
792 7 + 14 = 21
793 8 + 16 = 24
794 9 + 18 = 27
795 10 + 20 = 30
796 11 + 22 = 33
797 12 + 24 = 36
798 13 + 26 = 39
799 14 + 28 = 42
800 15 + 30 = 45
801
802.. note::
803
804 You will likely see a different device identifier based on your hardware
805
806
807Tutorial: Linking with Libdevice
808================================
809
810In this tutorial, we show a simple example of linking LLVM IR with the
811libdevice library. We will use the same kernel as the previous tutorial,
812except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
813Libdevice provides an ``__nv_powf`` function that we will use.
814
815.. code-block:: llvm
816
817 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
818 target triple = "nvptx64-nvidia-cuda"
819
820 ; Intrinsic to read X component of thread ID
821 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
822 ; libdevice function
823 declare float @__nv_powf(float, float)
824
825 define void @kernel(float addrspace(1)* %A,
826 float addrspace(1)* %B,
827 float addrspace(1)* %C) {
828 entry:
829 ; What is my ID?
830 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
831
832 ; Compute pointers into A, B, and C
833 %ptrA = getelementptr float addrspace(1)* %A, i32 %id
834 %ptrB = getelementptr float addrspace(1)* %B, i32 %id
835 %ptrC = getelementptr float addrspace(1)* %C, i32 %id
836
837 ; Read A, B
838 %valA = load float addrspace(1)* %ptrA, align 4
839 %valB = load float addrspace(1)* %ptrB, align 4
840
841 ; Compute C = pow(A, B)
842 %valC = call float @__nv_powf(float %valA, float %valB)
843
844 ; Store back to C
845 store float %valC, float addrspace(1)* %ptrC, align 4
846
847 ret void
848 }
849
850 !nvvm.annotations = !{!0}
851 !0 = metadata !{void (float addrspace(1)*,
852 float addrspace(1)*,
853 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
854
855
856To compile this kernel, we perform the following steps:
857
8581. Link with libdevice
8592. Internalize all but the public kernel function
8603. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
8614. Optimize the linked module
8625. Codegen the module
863
864
865These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
866tools. In a complete compiler, these steps can also be performed entirely
867programmatically by setting up an appropriate pass configuration (see
868:ref:`libdevice`).
869
870.. code-block:: text
871
872 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
873 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
874 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
875
876.. note::
877
878 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
879 undefined variables will default to zero. It is shown here for evaluation
880 purposes.
881
882
883This gives us the following PTX (excerpt):
884
885.. code-block:: text
886
887 //
888 // Generated by LLVM NVPTX Back-End
889 //
890
891 .version 3.1
892 .target sm_20
893 .address_size 64
894
895 // .globl kernel
896 // @kernel
897 .visible .entry kernel(
898 .param .u64 kernel_param_0,
899 .param .u64 kernel_param_1,
900 .param .u64 kernel_param_2
901 )
902 {
903 .reg .pred %p<30>;
904 .reg .f32 %f<111>;
905 .reg .s32 %r<21>;
906 .reg .s64 %rl<8>;
907
908 // BB#0: // %entry
909 ld.param.u64 %rl2, [kernel_param_0];
910 mov.u32 %r3, %tid.x;
911 ld.param.u64 %rl3, [kernel_param_1];
912 mul.wide.s32 %rl4, %r3, 4;
913 add.s64 %rl5, %rl2, %rl4;
914 ld.param.u64 %rl6, [kernel_param_2];
915 add.s64 %rl7, %rl3, %rl4;
916 add.s64 %rl1, %rl6, %rl4;
917 ld.global.f32 %f1, [%rl5];
918 ld.global.f32 %f2, [%rl7];
919 setp.eq.f32 %p1, %f1, 0f3F800000;
920 setp.eq.f32 %p2, %f2, 0f00000000;
921 or.pred %p3, %p1, %p2;
922 @%p3 bra BB0_1;
923 bra.uni BB0_2;
924 BB0_1:
925 mov.f32 %f110, 0f3F800000;
926 st.global.f32 [%rl1], %f110;
927 ret;
928 BB0_2: // %__nv_isnanf.exit.i
929 abs.f32 %f4, %f1;
930 setp.gtu.f32 %p4, %f4, 0f7F800000;
931 @%p4 bra BB0_4;
932 // BB#3: // %__nv_isnanf.exit5.i
933 abs.f32 %f5, %f2;
934 setp.le.f32 %p5, %f5, 0f7F800000;
935 @%p5 bra BB0_5;
936 BB0_4: // %.critedge1.i
937 add.f32 %f110, %f1, %f2;
938 st.global.f32 [%rl1], %f110;
939 ret;
940 BB0_5: // %__nv_isinff.exit.i
941
942 ...
943
944 BB0_26: // %__nv_truncf.exit.i.i.i.i.i
945 mul.f32 %f90, %f107, 0f3FB8AA3B;
946 cvt.rzi.f32.f32 %f91, %f90;
947 mov.f32 %f92, 0fBF317200;
948 fma.rn.f32 %f93, %f91, %f92, %f107;
949 mov.f32 %f94, 0fB5BFBE8E;
950 fma.rn.f32 %f95, %f91, %f94, %f93;
951 mul.f32 %f89, %f95, 0f3FB8AA3B;
952 // inline asm
953 ex2.approx.ftz.f32 %f88,%f89;
954 // inline asm
955 add.f32 %f96, %f91, 0f00000000;
956 ex2.approx.f32 %f97, %f96;
957 mul.f32 %f98, %f88, %f97;
958 setp.lt.f32 %p15, %f107, 0fC2D20000;
959 selp.f32 %f99, 0f00000000, %f98, %p15;
960 setp.gt.f32 %p16, %f107, 0f42D20000;
961 selp.f32 %f110, 0f7F800000, %f99, %p16;
962 setp.eq.f32 %p17, %f110, 0f7F800000;
963 @%p17 bra BB0_28;
964 // BB#27:
965 fma.rn.f32 %f110, %f110, %f108, %f110;
966 BB0_28: // %__internal_accurate_powf.exit.i
967 setp.lt.f32 %p18, %f1, 0f00000000;
968 setp.eq.f32 %p19, %f3, 0f3F800000;
969 and.pred %p20, %p18, %p19;
970 @!%p20 bra BB0_30;
971 bra.uni BB0_29;
972 BB0_29:
973 mov.b32 %r9, %f110;
974 xor.b32 %r10, %r9, -2147483648;
975 mov.b32 %f110, %r10;
976 BB0_30: // %__nv_powf.exit
977 st.global.f32 [%rl1], %f110;
978 ret;
979 }
980