llvm.org GIT mirror llvm / 6966267
[doc] Compile CUDA with LLVM Summary: This patch adds documentation on compiling CUDA with LLVM as requested by many engineers and researchers. It includes not only user guides but also some internals (mostly optimizations) so that early adopters can start hacking and contributing. Quite a few researchers who contacted us haven't used LLVM before, which is unsurprising as it hasn't been long since LLVM picked up CUDA. So I added a short summary to help these folks get started with LLVM. I expect this document to evolve substantially down the road. The user guides will be much simplified after the Clang integration is done. However, the internals should continue growing to include for example performance debugging and key areas to improve. Reviewers: chandlerc, meheff, broune, tra Subscribers: silvas, jingyue, llvm-commits, eliben Differential Revision: http://reviews.llvm.org/D14370 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@252660 91177308-0d34-0410-b5e6-96231b3b80d8 Jingyue Wu 3 years ago
2 changed file(s) with 196 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all
0 ===================================
1 Compiling CUDA C/C++ with LLVM
2 ===================================
3
4 .. contents::
5 :local:
6
7 Introduction
8 ============
9
10 This document contains the user guides and the internals of compiling CUDA
11 C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
12 and developers who want to improve LLVM for GPUs. This document assumes a basic
13 familiarity with CUDA. Information about CUDA programming can be found in the
14 `CUDA programming guide
15 `_.
16
17 How to Build LLVM with CUDA Support
18 ===================================
19
20 The support for CUDA is still in progress and temporarily relies on `this patch
21 `_. Below is a quick summary of downloading and
22 building LLVM with CUDA support. Consult the `Getting Started
23 `_ page for more details on setting
24 up LLVM.
25
26 #. Checkout LLVM
27
28 .. code-block:: console
29
30 $ cd where-you-want-llvm-to-live
31 $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
32
33 #. Checkout Clang
34
35 .. code-block:: console
36
37 $ cd where-you-want-llvm-to-live
38 $ cd llvm/tools
39 $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
40
41 #. Apply the temporary patch for CUDA support.
42
43 If you have installed `Arcanist
44 `_,
45 you can apply this patch using
46
47 .. code-block:: console
48
49 $ cd where-you-want-llvm-to-live
50 $ cd llvm/tools/clang
51 $ arc patch D14452
52
53 Otherwise, go to `its review page `_,
54 download the raw diff, and apply it manually using
55
56 .. code-block:: console
57
58 $ cd where-you-want-llvm-to-live
59 $ cd llvm/tools/clang
60 $ patch -p0 < D14452.diff
61
62 #. Configure and build LLVM and Clang
63
64 .. code-block:: console
65
66 $ cd where-you-want-llvm-to-live
67 $ mkdir build
68 $ cd build
69 $ cmake [options] ..
70 $ make
71
72 How to Compile CUDA C/C++ with LLVM
73 ===================================
74
75 We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
76 CUDA installation Guide
77 `_ if
78 you have not.
79
80 Suppose you want to compile and run the following CUDA program (``axpy.cu``)
81 which multiplies a ``float`` array by a ``float`` scalar (AXPY).
82
83 .. code-block:: c++
84
85 #include // for checkCudaErrors
86
87 #include
88
89 __global__ void axpy(float a, float* x, float* y) {
90 y[threadIdx.x] = a * x[threadIdx.x];
91 }
92
93 int main(int argc, char* argv[]) {
94 const int kDataLen = 4;
95
96 float a = 2.0f;
97 float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
98 float host_y[kDataLen];
99
100 // Copy input data to device.
101 float* device_x;
102 float* device_y;
103 checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
104 checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
105 checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
106 cudaMemcpyHostToDevice));
107
108 // Launch the kernel.
109 axpy<<<1, kDataLen>>>(a, device_x, device_y);
110
111 // Copy output data to host.
112 checkCudaErrors(cudaDeviceSynchronize());
113 checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
114 cudaMemcpyDeviceToHost));
115
116 // Print the results.
117 for (int i = 0; i < kDataLen; ++i) {
118 std::cout << "y[" << i << "] = " << host_y[i] << "\n";
119 }
120
121 checkCudaErrors(cudaDeviceReset());
122 return 0;
123 }
124
125 The command line for compilation is similar to what you would use for C++.
126
127 .. code-block:: console
128
129 $ clang++ -o axpy -I/samples/common/inc -L/ axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
130 $ ./axpy
131 y[0] = 2
132 y[1] = 4
133 y[2] = 6
134 y[3] = 8
135
136 Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
137 samples installed for this example. ```` is the root
138 directory where you installed CUDA SDK, typically ``/usr/local/cuda``.
139
140 Optimizations
141 =============
142
143 CPU and GPU have different design philosophies and architectures. For example, a
144 typical CPU has branch prediction, out-of-order execution, and is superscalar,
145 whereas a typical GPU has none of these. Due to such differences, an
146 optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
147
148 LLVM performs several general and CUDA-specific optimizations for GPUs. The
149 list below shows some of the more important optimizations for GPUs. Most of
150 them have been upstreamed to ``lib/Transforms/Scalar`` and
151 ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
152 customizable target-independent optimization pipeline.
153
154 * **Straight-line scalar optimizations**. These optimizations reduce redundancy
155 in straight-line code. Details can be found in the `design document for
156 straight-line scalar optimizations `_.
157
158 * **Inferring memory spaces**. `This optimization
159 `_
160 infers the memory space of an address so that the backend can emit faster
161 special loads and stores from it. Details can be found in the `design
162 document for memory space inference `_.
163
164 * **Aggressive loop unrooling and function inlining**. Loop unrolling and
165 function inlining need to be more aggressive for GPUs than for CPUs because
166 control flow transfer in GPU is more expensive. They also promote other
167 optimizations such as constant propagation and SROA which sometimes speed up
168 code by over 10x. An empirical inline threshold for GPUs is 1100. This
169 configuration has yet to be upstreamed with a target-specific optimization
170 pipeline. LLVM also provides `loop unrolling pragmas
171 `_
172 and ``__attribute__((always_inline))`` for programmers to force unrolling and
173 inling.
174
175 * **Aggressive speculative execution**. `This transformation
176 `_ is
177 mainly for promoting straight-line scalar optimizations which are most
178 effective on code along dominator paths.
179
180 * **Memory-space alias analysis**. `This alias analysis
181 `_ infers that two pointers in different
182 special memory spaces do not alias. It has yet to be integrated to the new
183 alias analysis infrastructure; the new infrastructure does not run
184 target-specific alias analysis.
185
186 * **Bypassing 64-bit divides**. `An existing optimization
187 `_
188 enabled in the NVPTX backend. 64-bit integer divides are much slower than
189 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
190 divides in our benchmarks have a divisor and dividend which fit in 32-bits at
191 runtime. This optimization provides a fast path for this common case.
8585 GetElementPtr
8686 Frontend/PerformanceTips
8787 MCJITDesignAndImplementation
88 CompileCudaWithLLVM
8889
8990 :doc:`GettingStarted`
9091 Discusses how to get up and running quickly with the LLVM infrastructure.
370371 :doc:`FaultMaps`
371372 LLVM support for folding control flow into faulting machine instructions.
372373
374 :doc:`CompileCudaWithLLVM`
375 LLVM support for CUDA.
376
373377 Development Process Documentation
374378 =================================
375379