]>
Commit | Line | Data |
---|---|---|
1a4d82fc JJ |
1 | ============================= |
2 | User Guide for NVPTX Back-end | |
3 | ============================= | |
4 | ||
5 | .. contents:: | |
6 | :local: | |
7 | :depth: 3 | |
8 | ||
9 | ||
10 | Introduction | |
11 | ============ | |
12 | ||
13 | To support GPU programming, the NVPTX back-end supports a subset of LLVM IR | |
14 | along with a defined set of conventions used to represent GPU programming | |
15 | concepts. This document provides an overview of the general usage of the back- | |
16 | end, including a description of the conventions used and the set of accepted | |
17 | LLVM IR. | |
18 | ||
19 | .. note:: | |
20 | ||
21 | This document assumes a basic familiarity with CUDA and the PTX | |
22 | assembly language. Information about the CUDA Driver API and the PTX assembly | |
23 | language can be found in the `CUDA documentation | |
24 | <http://docs.nvidia.com/cuda/index.html>`_. | |
25 | ||
26 | ||
27 | ||
28 | Conventions | |
29 | =========== | |
30 | ||
31 | Marking Functions as Kernels | |
32 | ---------------------------- | |
33 | ||
34 | In PTX, there are two types of functions: *device functions*, which are only | |
35 | callable by device code, and *kernel functions*, which are callable by host | |
36 | code. By default, the back-end will emit device functions. Metadata is used to | |
37 | declare a function as a kernel function. This metadata is attached to the | |
38 | ``nvvm.annotations`` named metadata object, and has the following format: | |
39 | ||
40 | .. code-block:: llvm | |
41 | ||
42 | !0 = metadata !{<function-ref>, metadata !"kernel", i32 1} | |
43 | ||
44 | The first parameter is a reference to the kernel function. The following | |
45 | example shows a kernel function calling a device function in LLVM IR. The | |
46 | function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. | |
47 | ||
48 | .. code-block:: llvm | |
49 | ||
50 | define float @my_fmad(float %x, float %y, float %z) { | |
51 | %mul = fmul float %x, %y | |
52 | %add = fadd float %mul, %z | |
53 | ret float %add | |
54 | } | |
55 | ||
56 | define void @my_kernel(float* %ptr) { | |
57 | %val = load float* %ptr | |
58 | %ret = call float @my_fmad(float %val, float %val, float %val) | |
59 | store float %ret, float* %ptr | |
60 | ret void | |
61 | } | |
62 | ||
63 | !nvvm.annotations = !{!1} | |
64 | !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1} | |
65 | ||
66 | When compiled, the PTX kernel functions are callable by host-side code. | |
67 | ||
68 | ||
69 | .. _address_spaces: | |
70 | ||
71 | Address Spaces | |
72 | -------------- | |
73 | ||
74 | The NVPTX back-end uses the following address space mapping: | |
75 | ||
76 | ============= ====================== | |
77 | Address Space Memory Space | |
78 | ============= ====================== | |
79 | 0 Generic | |
80 | 1 Global | |
81 | 2 Internal Use | |
82 | 3 Shared | |
83 | 4 Constant | |
84 | 5 Local | |
85 | ============= ====================== | |
86 | ||
87 | Every global variable and pointer type is assigned to one of these address | |
88 | spaces, with 0 being the default address space. Intrinsics are provided which | |
89 | can be used to convert pointers between the generic and non-generic address | |
90 | spaces. | |
91 | ||
92 | As an example, the following IR will define an array ``@g`` that resides in | |
93 | global device memory. | |
94 | ||
95 | .. code-block:: llvm | |
96 | ||
97 | @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] | |
98 | ||
99 | LLVM IR functions can read and write to this array, and host-side code can | |
100 | copy data to it by name with the CUDA Driver API. | |
101 | ||
102 | Note that since address space 0 is the generic space, it is illegal to have | |
103 | global variables in address space 0. Address space 0 is the default address | |
104 | space in LLVM, so the ``addrspace(N)`` annotation is *required* for global | |
105 | variables. | |
106 | ||
107 | ||
108 | Triples | |
109 | ------- | |
110 | ||
111 | The NVPTX target uses the module triple to select between 32/64-bit code | |
112 | generation and the driver-compiler interface to use. The triple architecture | |
113 | can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The | |
114 | operating system should be one of ``cuda`` or ``nvcl``, which determines the | |
115 | interface used by the generated code to communicate with the driver. Most | |
116 | users will want to use ``cuda`` as the operating system, which makes the | |
117 | generated PTX compatible with the CUDA Driver API. | |
118 | ||
119 | Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` | |
120 | ||
121 | Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` | |
122 | ||
123 | ||
124 | ||
125 | .. _nvptx_intrinsics: | |
126 | ||
127 | NVPTX Intrinsics | |
128 | ================ | |
129 | ||
130 | Address Space Conversion | |
131 | ------------------------ | |
132 | ||
133 | '``llvm.nvvm.ptr.*.to.gen``' Intrinsics | |
134 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
135 | ||
136 | Syntax: | |
137 | """"""" | |
138 | ||
139 | These are overloaded intrinsics. You can use these on any pointer types. | |
140 | ||
141 | .. code-block:: llvm | |
142 | ||
143 | declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) | |
144 | declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) | |
145 | declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) | |
146 | declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) | |
147 | ||
148 | Overview: | |
149 | """"""""" | |
150 | ||
151 | The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic | |
152 | address space to a generic address space pointer. | |
153 | ||
154 | Semantics: | |
155 | """""""""" | |
156 | ||
157 | These intrinsics modify the pointer value to be a valid generic address space | |
158 | pointer. | |
159 | ||
160 | ||
161 | '``llvm.nvvm.ptr.gen.to.*``' Intrinsics | |
162 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
163 | ||
164 | Syntax: | |
165 | """"""" | |
166 | ||
167 | These are overloaded intrinsics. You can use these on any pointer types. | |
168 | ||
169 | .. code-block:: llvm | |
170 | ||
171 | declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*) | |
172 | declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*) | |
173 | declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*) | |
174 | declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*) | |
175 | ||
176 | Overview: | |
177 | """"""""" | |
178 | ||
179 | The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic | |
180 | address space to a pointer in the target address space. Note that these | |
181 | intrinsics are only useful if the address space of the target address space of | |
182 | the pointer is known. It is not legal to use address space conversion | |
183 | intrinsics to convert a pointer from one non-generic address space to another | |
184 | non-generic address space. | |
185 | ||
186 | Semantics: | |
187 | """""""""" | |
188 | ||
189 | These intrinsics modify the pointer value to be a valid pointer in the target | |
190 | non-generic address space. | |
191 | ||
192 | ||
193 | Reading PTX Special Registers | |
194 | ----------------------------- | |
195 | ||
196 | '``llvm.nvvm.read.ptx.sreg.*``' | |
197 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
198 | ||
199 | Syntax: | |
200 | """"""" | |
201 | ||
202 | .. code-block:: llvm | |
203 | ||
204 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() | |
205 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() | |
206 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() | |
207 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() | |
208 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() | |
209 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() | |
210 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() | |
211 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() | |
212 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() | |
213 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() | |
214 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() | |
215 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() | |
216 | declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() | |
217 | ||
218 | Overview: | |
219 | """"""""" | |
220 | ||
221 | The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX | |
222 | special registers, in particular the kernel launch bounds. These registers | |
223 | map in the following way to CUDA builtins: | |
224 | ||
225 | ============ ===================================== | |
226 | CUDA Builtin PTX Special Register Intrinsic | |
227 | ============ ===================================== | |
228 | ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` | |
229 | ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` | |
230 | ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` | |
231 | ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` | |
232 | ============ ===================================== | |
233 | ||
234 | ||
235 | Barriers | |
236 | -------- | |
237 | ||
238 | '``llvm.nvvm.barrier0``' | |
239 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
240 | ||
241 | Syntax: | |
242 | """"""" | |
243 | ||
244 | .. code-block:: llvm | |
245 | ||
246 | declare void @llvm.nvvm.barrier0() | |
247 | ||
248 | Overview: | |
249 | """"""""" | |
250 | ||
251 | The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` | |
252 | instruction, equivalent to the ``__syncthreads()`` call in CUDA. | |
253 | ||
254 | ||
255 | Other Intrinsics | |
256 | ---------------- | |
257 | ||
258 | For the full set of NVPTX intrinsics, please see the | |
259 | ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. | |
260 | ||
261 | ||
262 | .. _libdevice: | |
263 | ||
264 | Linking with Libdevice | |
265 | ====================== | |
266 | ||
267 | The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that | |
268 | implements many common mathematical functions. This library can be used as a | |
269 | high-performance math library for any compilers using the LLVM NVPTX target. | |
270 | The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and | |
271 | there is a separate version for each compute architecture. | |
272 | ||
273 | For a list of all math functions implemented in libdevice, see | |
274 | `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. | |
275 | ||
276 | To accommodate various math-related compiler flags that can affect code | |
277 | generation of libdevice code, the library code depends on a special LLVM IR | |
278 | pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This | |
279 | pass looks for calls to the ``@__nvvm_reflect`` function and replaces them | |
280 | with constants based on the defined reflection parameters. Such conditional | |
281 | code often follows a pattern: | |
282 | ||
283 | .. code-block:: c++ | |
284 | ||
285 | float my_function(float a) { | |
286 | if (__nvvm_reflect("FASTMATH")) | |
287 | return my_function_fast(a); | |
288 | else | |
289 | return my_function_precise(a); | |
290 | } | |
291 | ||
292 | The default value for all unspecified reflection parameters is zero. | |
293 | ||
294 | The ``NVVMReflect`` pass should be executed early in the optimization | |
295 | pipeline, immediately after the link stage. The ``internalize`` pass is also | |
296 | recommended to remove unused math functions from the resulting PTX. For an | |
297 | input IR module ``module.bc``, the following compilation flow is recommended: | |
298 | ||
299 | 1. Save list of external functions in ``module.bc`` | |
300 | 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` | |
301 | 3. Internalize all functions not in list from (1) | |
302 | 4. Eliminate all unused internal functions | |
303 | 5. Run ``NVVMReflect`` pass | |
304 | 6. Run standard optimization pipeline | |
305 | ||
306 | .. note:: | |
307 | ||
308 | ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the | |
309 | libdevice functions. It is possible to link two IR modules that have been | |
310 | linked against libdevice using different reflection variables. | |
311 | ||
312 | Since the ``NVVMReflect`` pass replaces conditionals with constants, it will | |
313 | often leave behind dead code of the form: | |
314 | ||
315 | .. code-block:: llvm | |
316 | ||
317 | entry: | |
318 | .. | |
319 | br i1 true, label %foo, label %bar | |
320 | foo: | |
321 | .. | |
322 | bar: | |
323 | ; Dead code | |
324 | .. | |
325 | ||
326 | Therefore, it is recommended that ``NVVMReflect`` is executed early in the | |
327 | optimization pipeline before dead-code elimination. | |
328 | ||
329 | ||
330 | Reflection Parameters | |
331 | --------------------- | |
332 | ||
333 | The libdevice library currently uses the following reflection parameters to | |
334 | control code generation: | |
335 | ||
336 | ==================== ====================================================== | |
337 | Flag Description | |
338 | ==================== ====================================================== | |
339 | ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero | |
340 | ==================== ====================================================== | |
341 | ||
342 | ||
343 | Invoking NVVMReflect | |
344 | -------------------- | |
345 | ||
346 | To ensure that all dead code caused by the reflection pass is eliminated, it | |
347 | is recommended that the reflection pass is executed early in the LLVM IR | |
348 | optimization pipeline. The pass takes an optional mapping of reflection | |
349 | parameter name to an integer value. This mapping can be specified as either a | |
350 | command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when | |
351 | programmatically creating a pass pipeline. | |
352 | ||
353 | With ``opt``: | |
354 | ||
355 | .. code-block:: text | |
356 | ||
357 | # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc | |
358 | ||
359 | ||
360 | With programmatic pass pipeline: | |
361 | ||
362 | .. code-block:: c++ | |
363 | ||
364 | extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping); | |
365 | ||
366 | StringMap<int> ReflectParams; | |
367 | ReflectParams["__CUDA_FTZ"] = 1; | |
368 | Passes.add(createNVVMReflectPass(ReflectParams)); | |
369 | ||
370 | ||
371 | ||
372 | Executing PTX | |
373 | ============= | |
374 | ||
375 | The most common way to execute PTX assembly on a GPU device is to use the CUDA | |
376 | Driver API. This API is a low-level interface to the GPU driver and allows for | |
377 | JIT compilation of PTX code to native GPU machine code. | |
378 | ||
379 | Initializing the Driver API: | |
380 | ||
381 | .. code-block:: c++ | |
382 | ||
383 | CUdevice device; | |
384 | CUcontext context; | |
385 | ||
386 | // Initialize the driver API | |
387 | cuInit(0); | |
388 | // Get a handle to the first compute device | |
389 | cuDeviceGet(&device, 0); | |
390 | // Create a compute device context | |
391 | cuCtxCreate(&context, 0, device); | |
392 | ||
393 | JIT compiling a PTX string to a device binary: | |
394 | ||
395 | .. code-block:: c++ | |
396 | ||
397 | CUmodule module; | |
398 | CUfunction funcion; | |
399 | ||
400 | // JIT compile a null-terminated PTX string | |
401 | cuModuleLoadData(&module, (void*)PTXString); | |
402 | ||
403 | // Get a handle to the "myfunction" kernel function | |
404 | cuModuleGetFunction(&function, module, "myfunction"); | |
405 | ||
406 | For full examples of executing PTX assembly, please see the `CUDA Samples | |
407 | <https://developer.nvidia.com/cuda-downloads>`_ distribution. | |
408 | ||
409 | ||
410 | Common Issues | |
411 | ============= | |
412 | ||
413 | ptxas complains of undefined function: __nvvm_reflect | |
414 | ----------------------------------------------------- | |
415 | ||
416 | When linking with libdevice, the ``NVVMReflect`` pass must be used. See | |
417 | :ref:`libdevice` for more information. | |
418 | ||
419 | ||
420 | Tutorial: A Simple Compute Kernel | |
421 | ================================= | |
422 | ||
423 | To start, let us take a look at a simple compute kernel written directly in | |
424 | LLVM IR. The kernel implements vector addition, where each thread computes one | |
425 | element of the output vector C from the input vectors A and B. To make this | |
426 | easier, we also assume that only a single CTA (thread block) will be launched, | |
427 | and that it will be one dimensional. | |
428 | ||
429 | ||
430 | The Kernel | |
431 | ---------- | |
432 | ||
433 | .. code-block:: llvm | |
434 | ||
435 | target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" | |
436 | target triple = "nvptx64-nvidia-cuda" | |
437 | ||
438 | ; Intrinsic to read X component of thread ID | |
439 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind | |
440 | ||
441 | define void @kernel(float addrspace(1)* %A, | |
442 | float addrspace(1)* %B, | |
443 | float addrspace(1)* %C) { | |
444 | entry: | |
445 | ; What is my ID? | |
446 | %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind | |
447 | ||
448 | ; Compute pointers into A, B, and C | |
449 | %ptrA = getelementptr float addrspace(1)* %A, i32 %id | |
450 | %ptrB = getelementptr float addrspace(1)* %B, i32 %id | |
451 | %ptrC = getelementptr float addrspace(1)* %C, i32 %id | |
452 | ||
453 | ; Read A, B | |
454 | %valA = load float addrspace(1)* %ptrA, align 4 | |
455 | %valB = load float addrspace(1)* %ptrB, align 4 | |
456 | ||
457 | ; Compute C = A + B | |
458 | %valC = fadd float %valA, %valB | |
459 | ||
460 | ; Store back to C | |
461 | store float %valC, float addrspace(1)* %ptrC, align 4 | |
462 | ||
463 | ret void | |
464 | } | |
465 | ||
466 | !nvvm.annotations = !{!0} | |
467 | !0 = metadata !{void (float addrspace(1)*, | |
468 | float addrspace(1)*, | |
469 | float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} | |
470 | ||
471 | ||
472 | We can use the LLVM ``llc`` tool to directly run the NVPTX code generator: | |
473 | ||
474 | .. code-block:: text | |
475 | ||
476 | # llc -mcpu=sm_20 kernel.ll -o kernel.ptx | |
477 | ||
478 | ||
479 | .. note:: | |
480 | ||
481 | If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` | |
482 | in the module data layout string and use ``nvptx-nvidia-cuda`` as the | |
483 | target triple. | |
484 | ||
485 | ||
486 | The output we get from ``llc`` (as of LLVM 3.4): | |
487 | ||
488 | .. code-block:: text | |
489 | ||
490 | // | |
491 | // Generated by LLVM NVPTX Back-End | |
492 | // | |
493 | ||
494 | .version 3.1 | |
495 | .target sm_20 | |
496 | .address_size 64 | |
497 | ||
498 | // .globl kernel | |
499 | // @kernel | |
500 | .visible .entry kernel( | |
501 | .param .u64 kernel_param_0, | |
502 | .param .u64 kernel_param_1, | |
503 | .param .u64 kernel_param_2 | |
504 | ) | |
505 | { | |
506 | .reg .f32 %f<4>; | |
507 | .reg .s32 %r<2>; | |
508 | .reg .s64 %rl<8>; | |
509 | ||
510 | // BB#0: // %entry | |
511 | ld.param.u64 %rl1, [kernel_param_0]; | |
512 | mov.u32 %r1, %tid.x; | |
513 | mul.wide.s32 %rl2, %r1, 4; | |
514 | add.s64 %rl3, %rl1, %rl2; | |
515 | ld.param.u64 %rl4, [kernel_param_1]; | |
516 | add.s64 %rl5, %rl4, %rl2; | |
517 | ld.param.u64 %rl6, [kernel_param_2]; | |
518 | add.s64 %rl7, %rl6, %rl2; | |
519 | ld.global.f32 %f1, [%rl3]; | |
520 | ld.global.f32 %f2, [%rl5]; | |
521 | add.f32 %f3, %f1, %f2; | |
522 | st.global.f32 [%rl7], %f3; | |
523 | ret; | |
524 | } | |
525 | ||
526 | ||
527 | Dissecting the Kernel | |
528 | --------------------- | |
529 | ||
530 | Now let us dissect the LLVM IR that makes up this kernel. | |
531 | ||
532 | Data Layout | |
533 | ^^^^^^^^^^^ | |
534 | ||
535 | The data layout string determines the size in bits of common data types, their | |
536 | ABI alignment, and their storage size. For NVPTX, you should use one of the | |
537 | following: | |
538 | ||
539 | 32-bit PTX: | |
540 | ||
541 | .. code-block:: llvm | |
542 | ||
543 | target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" | |
544 | ||
545 | 64-bit PTX: | |
546 | ||
547 | .. code-block:: llvm | |
548 | ||
549 | target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" | |
550 | ||
551 | ||
552 | Target Intrinsics | |
553 | ^^^^^^^^^^^^^^^^^ | |
554 | ||
555 | In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to | |
556 | read the X component of the current thread's ID, which corresponds to a read | |
557 | of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of | |
558 | intrinsics. A short list is shown below; please see | |
559 | ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. | |
560 | ||
561 | ||
562 | ================================================ ==================== | |
563 | Intrinsic CUDA Equivalent | |
564 | ================================================ ==================== | |
565 | ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} | |
566 | ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} | |
567 | ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} | |
568 | ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} | |
569 | ``void @llvm.cuda.syncthreads()`` __syncthreads() | |
570 | ================================================ ==================== | |
571 | ||
572 | ||
573 | Address Spaces | |
574 | ^^^^^^^^^^^^^^ | |
575 | ||
576 | You may have noticed that all of the pointer types in the LLVM IR example had | |
577 | an explicit address space specifier. What is address space 1? NVIDIA GPU | |
578 | devices (generally) have four types of memory: | |
579 | ||
580 | - Global: Large, off-chip memory | |
581 | - Shared: Small, on-chip memory shared among all threads in a CTA | |
582 | - Local: Per-thread, private memory | |
583 | - Constant: Read-only memory shared across all threads | |
584 | ||
585 | These different types of memory are represented in LLVM IR as address spaces. | |
586 | There is also a fifth address space used by the NVPTX code generator that | |
587 | corresponds to the "generic" address space. This address space can represent | |
588 | addresses in any other address space (with a few exceptions). This allows | |
589 | users to write IR functions that can load/store memory using the same | |
590 | instructions. Intrinsics are provided to convert pointers between the generic | |
591 | and non-generic address spaces. | |
592 | ||
593 | See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. | |
594 | ||
595 | ||
596 | Kernel Metadata | |
597 | ^^^^^^^^^^^^^^^ | |
598 | ||
599 | In PTX, a function can be either a `kernel` function (callable from the host | |
600 | program), or a `device` function (callable only from GPU code). You can think | |
601 | of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR | |
602 | function as a `kernel` function, we make use of special LLVM metadata. The | |
603 | NVPTX back-end will look for a named metadata node called | |
604 | ``nvvm.annotations``. This named metadata must contain a list of metadata that | |
605 | describe the IR. For our purposes, we need to declare a metadata node that | |
606 | assigns the "kernel" attribute to the LLVM IR function that should be emitted | |
607 | as a PTX `kernel` function. These metadata nodes take the form: | |
608 | ||
609 | .. code-block:: text | |
610 | ||
611 | metadata !{<function ref>, metadata !"kernel", i32 1} | |
612 | ||
613 | For the previous example, we have: | |
614 | ||
615 | .. code-block:: llvm | |
616 | ||
617 | !nvvm.annotations = !{!0} | |
618 | !0 = metadata !{void (float addrspace(1)*, | |
619 | float addrspace(1)*, | |
620 | float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} | |
621 | ||
622 | Here, we have a single metadata declaration in ``nvvm.annotations``. This | |
623 | metadata annotates our ``@kernel`` function with the ``kernel`` attribute. | |
624 | ||
625 | ||
626 | Running the Kernel | |
627 | ------------------ | |
628 | ||
629 | Generating PTX from LLVM IR is all well and good, but how do we execute it on | |
630 | a real GPU device? The CUDA Driver API provides a convenient mechanism for | |
631 | loading and JIT compiling PTX to a native GPU device, and launching a kernel. | |
632 | The API is similar to OpenCL. A simple example showing how to load and | |
633 | execute our vector addition code is shown below. Note that for brevity this | |
634 | code does not perform much error checking! | |
635 | ||
636 | .. note:: | |
637 | ||
638 | You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline | |
639 | compile PTX to machine code (SASS) for a specific GPU architecture. Such | |
640 | binaries can be loaded by the CUDA Driver API in the same way as PTX. This | |
641 | can be useful for reducing startup time by precompiling the PTX kernels. | |
642 | ||
643 | ||
644 | .. code-block:: c++ | |
645 | ||
646 | #include <iostream> | |
647 | #include <fstream> | |
648 | #include <cassert> | |
649 | #include "cuda.h" | |
650 | ||
651 | ||
652 | void checkCudaErrors(CUresult err) { | |
653 | assert(err == CUDA_SUCCESS); | |
654 | } | |
655 | ||
656 | /// main - Program entry point | |
657 | int main(int argc, char **argv) { | |
658 | CUdevice device; | |
659 | CUmodule cudaModule; | |
660 | CUcontext context; | |
661 | CUfunction function; | |
662 | CUlinkState linker; | |
663 | int devCount; | |
664 | ||
665 | // CUDA initialization | |
666 | checkCudaErrors(cuInit(0)); | |
667 | checkCudaErrors(cuDeviceGetCount(&devCount)); | |
668 | checkCudaErrors(cuDeviceGet(&device, 0)); | |
669 | ||
670 | char name[128]; | |
671 | checkCudaErrors(cuDeviceGetName(name, 128, device)); | |
672 | std::cout << "Using CUDA Device [0]: " << name << "\n"; | |
673 | ||
674 | int devMajor, devMinor; | |
675 | checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); | |
676 | std::cout << "Device Compute Capability: " | |
677 | << devMajor << "." << devMinor << "\n"; | |
678 | if (devMajor < 2) { | |
679 | std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; | |
680 | return 1; | |
681 | } | |
682 | ||
683 | std::ifstream t("kernel.ptx"); | |
684 | if (!t.is_open()) { | |
685 | std::cerr << "kernel.ptx not found\n"; | |
686 | return 1; | |
687 | } | |
688 | std::string str((std::istreambuf_iterator<char>(t)), | |
689 | std::istreambuf_iterator<char>()); | |
690 | ||
691 | // Create driver context | |
692 | checkCudaErrors(cuCtxCreate(&context, 0, device)); | |
693 | ||
694 | // Create module for object | |
695 | checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); | |
696 | ||
697 | // Get kernel function | |
698 | checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); | |
699 | ||
700 | // Device data | |
701 | CUdeviceptr devBufferA; | |
702 | CUdeviceptr devBufferB; | |
703 | CUdeviceptr devBufferC; | |
704 | ||
705 | checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); | |
706 | checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); | |
707 | checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); | |
708 | ||
709 | float* hostA = new float[16]; | |
710 | float* hostB = new float[16]; | |
711 | float* hostC = new float[16]; | |
712 | ||
713 | // Populate input | |
714 | for (unsigned i = 0; i != 16; ++i) { | |
715 | hostA[i] = (float)i; | |
716 | hostB[i] = (float)(2*i); | |
717 | hostC[i] = 0.0f; | |
718 | } | |
719 | ||
720 | checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); | |
721 | checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); | |
722 | ||
723 | ||
724 | unsigned blockSizeX = 16; | |
725 | unsigned blockSizeY = 1; | |
726 | unsigned blockSizeZ = 1; | |
727 | unsigned gridSizeX = 1; | |
728 | unsigned gridSizeY = 1; | |
729 | unsigned gridSizeZ = 1; | |
730 | ||
731 | // Kernel parameters | |
732 | void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; | |
733 | ||
734 | std::cout << "Launching kernel\n"; | |
735 | ||
736 | // Kernel launch | |
737 | checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, | |
738 | blockSizeX, blockSizeY, blockSizeZ, | |
739 | 0, NULL, KernelParams, NULL)); | |
740 | ||
741 | // Retrieve device data | |
742 | checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); | |
743 | ||
744 | ||
745 | std::cout << "Results:\n"; | |
746 | for (unsigned i = 0; i != 16; ++i) { | |
747 | std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; | |
748 | } | |
749 | ||
750 | ||
751 | // Clean up after ourselves | |
752 | delete [] hostA; | |
753 | delete [] hostB; | |
754 | delete [] hostC; | |
755 | ||
756 | // Clean-up | |
757 | checkCudaErrors(cuMemFree(devBufferA)); | |
758 | checkCudaErrors(cuMemFree(devBufferB)); | |
759 | checkCudaErrors(cuMemFree(devBufferC)); | |
760 | checkCudaErrors(cuModuleUnload(cudaModule)); | |
761 | checkCudaErrors(cuCtxDestroy(context)); | |
762 | ||
763 | return 0; | |
764 | } | |
765 | ||
766 | ||
767 | You will need to link with the CUDA driver and specify the path to cuda.h. | |
768 | ||
769 | .. code-block:: text | |
770 | ||
771 | # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda | |
772 | ||
773 | We don't need to specify a path to ``libcuda.so`` since this is installed in a | |
774 | system location by the driver, not the CUDA toolkit. | |
775 | ||
776 | If everything goes as planned, you should see the following output when | |
777 | running the compiled program: | |
778 | ||
779 | .. code-block:: text | |
780 | ||
781 | Using CUDA Device [0]: GeForce GTX 680 | |
782 | Device Compute Capability: 3.0 | |
783 | Launching kernel | |
784 | Results: | |
785 | 0 + 0 = 0 | |
786 | 1 + 2 = 3 | |
787 | 2 + 4 = 6 | |
788 | 3 + 6 = 9 | |
789 | 4 + 8 = 12 | |
790 | 5 + 10 = 15 | |
791 | 6 + 12 = 18 | |
792 | 7 + 14 = 21 | |
793 | 8 + 16 = 24 | |
794 | 9 + 18 = 27 | |
795 | 10 + 20 = 30 | |
796 | 11 + 22 = 33 | |
797 | 12 + 24 = 36 | |
798 | 13 + 26 = 39 | |
799 | 14 + 28 = 42 | |
800 | 15 + 30 = 45 | |
801 | ||
802 | .. note:: | |
803 | ||
804 | You will likely see a different device identifier based on your hardware | |
805 | ||
806 | ||
807 | Tutorial: Linking with Libdevice | |
808 | ================================ | |
809 | ||
810 | In this tutorial, we show a simple example of linking LLVM IR with the | |
811 | libdevice library. We will use the same kernel as the previous tutorial, | |
812 | except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. | |
813 | Libdevice provides an ``__nv_powf`` function that we will use. | |
814 | ||
815 | .. code-block:: llvm | |
816 | ||
817 | target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" | |
818 | target triple = "nvptx64-nvidia-cuda" | |
819 | ||
820 | ; Intrinsic to read X component of thread ID | |
821 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind | |
822 | ; libdevice function | |
823 | declare float @__nv_powf(float, float) | |
824 | ||
825 | define void @kernel(float addrspace(1)* %A, | |
826 | float addrspace(1)* %B, | |
827 | float addrspace(1)* %C) { | |
828 | entry: | |
829 | ; What is my ID? | |
830 | %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind | |
831 | ||
832 | ; Compute pointers into A, B, and C | |
833 | %ptrA = getelementptr float addrspace(1)* %A, i32 %id | |
834 | %ptrB = getelementptr float addrspace(1)* %B, i32 %id | |
835 | %ptrC = getelementptr float addrspace(1)* %C, i32 %id | |
836 | ||
837 | ; Read A, B | |
838 | %valA = load float addrspace(1)* %ptrA, align 4 | |
839 | %valB = load float addrspace(1)* %ptrB, align 4 | |
840 | ||
841 | ; Compute C = pow(A, B) | |
842 | %valC = call float @__nv_powf(float %valA, float %valB) | |
843 | ||
844 | ; Store back to C | |
845 | store float %valC, float addrspace(1)* %ptrC, align 4 | |
846 | ||
847 | ret void | |
848 | } | |
849 | ||
850 | !nvvm.annotations = !{!0} | |
851 | !0 = metadata !{void (float addrspace(1)*, | |
852 | float addrspace(1)*, | |
853 | float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} | |
854 | ||
855 | ||
856 | To compile this kernel, we perform the following steps: | |
857 | ||
858 | 1. Link with libdevice | |
859 | 2. Internalize all but the public kernel function | |
860 | 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 | |
861 | 4. Optimize the linked module | |
862 | 5. Codegen the module | |
863 | ||
864 | ||
865 | These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` | |
866 | tools. In a complete compiler, these steps can also be performed entirely | |
867 | programmatically by setting up an appropriate pass configuration (see | |
868 | :ref:`libdevice`). | |
869 | ||
870 | .. code-block:: text | |
871 | ||
872 | # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc | |
873 | # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc | |
874 | # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx | |
875 | ||
876 | .. note:: | |
877 | ||
878 | The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any | |
879 | undefined variables will default to zero. It is shown here for evaluation | |
880 | purposes. | |
881 | ||
882 | ||
883 | This gives us the following PTX (excerpt): | |
884 | ||
885 | .. code-block:: text | |
886 | ||
887 | // | |
888 | // Generated by LLVM NVPTX Back-End | |
889 | // | |
890 | ||
891 | .version 3.1 | |
892 | .target sm_20 | |
893 | .address_size 64 | |
894 | ||
895 | // .globl kernel | |
896 | // @kernel | |
897 | .visible .entry kernel( | |
898 | .param .u64 kernel_param_0, | |
899 | .param .u64 kernel_param_1, | |
900 | .param .u64 kernel_param_2 | |
901 | ) | |
902 | { | |
903 | .reg .pred %p<30>; | |
904 | .reg .f32 %f<111>; | |
905 | .reg .s32 %r<21>; | |
906 | .reg .s64 %rl<8>; | |
907 | ||
908 | // BB#0: // %entry | |
909 | ld.param.u64 %rl2, [kernel_param_0]; | |
910 | mov.u32 %r3, %tid.x; | |
911 | ld.param.u64 %rl3, [kernel_param_1]; | |
912 | mul.wide.s32 %rl4, %r3, 4; | |
913 | add.s64 %rl5, %rl2, %rl4; | |
914 | ld.param.u64 %rl6, [kernel_param_2]; | |
915 | add.s64 %rl7, %rl3, %rl4; | |
916 | add.s64 %rl1, %rl6, %rl4; | |
917 | ld.global.f32 %f1, [%rl5]; | |
918 | ld.global.f32 %f2, [%rl7]; | |
919 | setp.eq.f32 %p1, %f1, 0f3F800000; | |
920 | setp.eq.f32 %p2, %f2, 0f00000000; | |
921 | or.pred %p3, %p1, %p2; | |
922 | @%p3 bra BB0_1; | |
923 | bra.uni BB0_2; | |
924 | BB0_1: | |
925 | mov.f32 %f110, 0f3F800000; | |
926 | st.global.f32 [%rl1], %f110; | |
927 | ret; | |
928 | BB0_2: // %__nv_isnanf.exit.i | |
929 | abs.f32 %f4, %f1; | |
930 | setp.gtu.f32 %p4, %f4, 0f7F800000; | |
931 | @%p4 bra BB0_4; | |
932 | // BB#3: // %__nv_isnanf.exit5.i | |
933 | abs.f32 %f5, %f2; | |
934 | setp.le.f32 %p5, %f5, 0f7F800000; | |
935 | @%p5 bra BB0_5; | |
936 | BB0_4: // %.critedge1.i | |
937 | add.f32 %f110, %f1, %f2; | |
938 | st.global.f32 [%rl1], %f110; | |
939 | ret; | |
940 | BB0_5: // %__nv_isinff.exit.i | |
941 | ||
942 | ... | |
943 | ||
944 | BB0_26: // %__nv_truncf.exit.i.i.i.i.i | |
945 | mul.f32 %f90, %f107, 0f3FB8AA3B; | |
946 | cvt.rzi.f32.f32 %f91, %f90; | |
947 | mov.f32 %f92, 0fBF317200; | |
948 | fma.rn.f32 %f93, %f91, %f92, %f107; | |
949 | mov.f32 %f94, 0fB5BFBE8E; | |
950 | fma.rn.f32 %f95, %f91, %f94, %f93; | |
951 | mul.f32 %f89, %f95, 0f3FB8AA3B; | |
952 | // inline asm | |
953 | ex2.approx.ftz.f32 %f88,%f89; | |
954 | // inline asm | |
955 | add.f32 %f96, %f91, 0f00000000; | |
956 | ex2.approx.f32 %f97, %f96; | |
957 | mul.f32 %f98, %f88, %f97; | |
958 | setp.lt.f32 %p15, %f107, 0fC2D20000; | |
959 | selp.f32 %f99, 0f00000000, %f98, %p15; | |
960 | setp.gt.f32 %p16, %f107, 0f42D20000; | |
961 | selp.f32 %f110, 0f7F800000, %f99, %p16; | |
962 | setp.eq.f32 %p17, %f110, 0f7F800000; | |
963 | @%p17 bra BB0_28; | |
964 | // BB#27: | |
965 | fma.rn.f32 %f110, %f110, %f108, %f110; | |
966 | BB0_28: // %__internal_accurate_powf.exit.i | |
967 | setp.lt.f32 %p18, %f1, 0f00000000; | |
968 | setp.eq.f32 %p19, %f3, 0f3F800000; | |
969 | and.pred %p20, %p18, %p19; | |
970 | @!%p20 bra BB0_30; | |
971 | bra.uni BB0_29; | |
972 | BB0_29: | |
973 | mov.b32 %r9, %f110; | |
974 | xor.b32 %r10, %r9, -2147483648; | |
975 | mov.b32 %f110, %r10; | |
976 | BB0_30: // %__nv_powf.exit | |
977 | st.global.f32 [%rl1], %f110; | |
978 | ret; | |
979 | } | |
980 |