llvm.org GIT mirror llvm / 6986553
Add documentation for various aspects of the AMDGPU backend. Differential Revision: https://reviews.llvm.org/D33736 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@304831 91177308-0d34-0410-b5e6-96231b3b80d8 Tony Tye 2 years ago
4 changed file(s) with 3475 addition(s) and 179 deletion(s). Raw diff Collapse all Expand all
None ==============================
1 User Guide for AMDGPU Back-end
2 ==============================
0 =============================
1 User Guide for AMDGPU Backend
2 =============================
3
4 .. contents::
5 :local:
36
47 Introduction
58 ============
69
7 The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
8 the R600 family up until the current Volcanic Islands (GCN Gen 3).
9
10 Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers `_
11 for additional documentation.
12
13 Conventions
14 ===========
10 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
11 R600 family up until the current GCN families. It lives in the
12 ``lib/Target/AMDGPU`` directory.
13
14 LLVM
15 ====
16
17 .. _amdgpu-target-triples:
18
19 Target Triples
20 --------------
21
22 Use the ``clang -target ---`` option to
23 specify the target triple:
24
25 .. table:: AMDGPU Target Triples
26 :name: amdgpu-target-triples-table
27
28 ============ ======== ========= ===========
29 Architecture Vendor OS Environment
30 ============ ======== ========= ===========
31 r600 amd
32 amdgcn amd
33 amdgcn amd amdhsa
34 amdgcn amd amdhsa opencl
35 amdgcn amd amdhsa amdgizcl
36 amdgcn amd amdhsa amdgiz
37 amdgcn amd amdhsa hcc
38 ============ ======== ========= ===========
39
40 ``r600-amd--``
41 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
42 the MESA runtime.
43
44 ``amdgcn-amd--``
45 Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
46 the MESA runtime.
47
48 ``amdgcn-amd-amdhsa-``
49 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
50 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
51
52 ``amdgcn-amd-amdhsa-opencl``
53 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
54 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
55 :ref:`amdgpu-opencl`.
56
57 ``amdgcn-amd-amdhsa-amdgizcl``
58 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
59 is used (see :ref:`amdgpu-address-spaces`).
60
61 ``amdgcn-amd-amdhsa-amdgiz``
62 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
63 used (see :ref:`amdgpu-address-spaces`).
64
65 ``amdgcn-amd-amdhsa-hcc``
66 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
67 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
68 :ref:`amdgpu-hcc`.
69
70 .. _amdgpu-processors:
71
72 Processors
73 ----------
74
75 Use the ``clang -mcpu `` option to specify the AMD GPU processor. The
76 names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
79 :name: amdgpu-processors-table
80
81 ========== =========== ============ ===== ======= ==================
82 Processor Alternative Target dGPU/ Runtime Example
83 Processor Triple APU Support Products
84 Architecture
85 ========== =========== ============ ===== ======= ==================
86 **R600** [AMD-R6xx]_
87 --------------------------------------------------------------------
88 r600 r600 dGPU
89 r630 r600 dGPU
90 rs880 r600 dGPU
91 rv670 r600 dGPU
92 **R700** [AMD-R7xx]_
93 --------------------------------------------------------------------
94 rv710 r600 dGPU
95 rv730 r600 dGPU
96 rv770 r600 dGPU
97 **Evergreen** [AMD-Evergreen]_
98 --------------------------------------------------------------------
99 cedar r600 dGPU
100 redwood r600 dGPU
101 sumo r600 dGPU
102 juniper r600 dGPU
103 cypress r600 dGPU
104 **Northern Islands** [AMD-Cayman-Trinity]_
105 --------------------------------------------------------------------
106 barts r600 dGPU
107 turks r600 dGPU
108 caicos r600 dGPU
109 cayman r600 dGPU
110 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
111 --------------------------------------------------------------------
112 gfx600 - SI amdgcn dGPU
113 - tahiti
114 gfx601 - pitcairn amdgcn dGPU
115 - verde
116 - oland
117 - hainan
118 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
121 - Radeon HD 8770
122 - R7 260
123 - R7 260X
124 \ - kaveri amdgcn APU - A6-7000
125 - A6 Pro-7050B
126 - A8-7100
127 - A8 Pro-7150B
128 - A10-7300
129 - A10 Pro-7350B
130 - FX-7500
131 - A8-7200P
132 - A10-7400P
133 - FX-7600P
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
135 - FirePro W9100
136 - FirePro S9150
137 - FirePro S9170
138 \ dGPU ROCm - Radeon R9 290
139 - Radeon R9 290x
140 - Radeon R390
141 - Radeonb R390x
142 gfx702 - kabini amdgcn APU - E1-2100
143 - mullins - E1-2200
144 - E1-2500
145 - E2-3000
146 - E2-3800
147 - A4-5000
148 - A4-5100
149 - A6-5200
150 - A4 Pro-3340B
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
154 - FirePro S7100
155 - FirePro W7100
156 - Radeon R285
157 - Radeon R9 380
158 - Radeon R9 385
159 - Mobile FirePro
160 M7170
161 gfx801 - carrizo amdgcn APU - A6-8500P
162 - Pro A6-8500B
163 - A8-8600P
164 - Pro A8-8600B
165 - FX-8800P
166 - Pro A12-8800B
167 \ amdgcn APU ROCm - A10-8700P
168 - Pro A10-8700B
169 - A10-8780P
170 \ amdgcn APU - A10-9600P
171 - A10-9630P
172 - A12-9700P
173 - A12-9730P
174 - FX-9800P
175 - FX-9830P
176 \ amdgcn APU - E2-9010
177 - A6-9210
178 - A9-9410
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
181 - Radeon R9 Fury
182 - Radeon R9 FuryX
183 - Radeon Pro Duo
184 - FirePro S9300x2
185 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
186 - Radeon RX 480
187 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
188 gfx804 amdgcn dGPU Same as gfx803
189 gfx810 - stoney amdgcn APU
190 **GCN GFX9**
191 --------------------------------------------------------------------
192 gfx900 amdgcn dGPU - FirePro W9500
193 - FirePro S9500
194 - FirePro S9500x2
195 gfx901 amdgcn dGPU ROCm Same as gfx900
196 except XNACK is
197 enabled
198 gfx902 amdgcn APU *TBA*
199
200 .. TODO
201 Add product
202 names.
203 gfx903 amdgcn APU Same as gfx902
204 except XNACK is
205 enabled
206 ========== =========== ============ ===== ======= ==================
207
208 .. _amdgpu-address-spaces:
15209
16210 Address Spaces
17211 --------------
18212
19 The AMDGPU back-end uses the following address space mapping:
20
21 ================== =================== ==============
22 LLVM Address Space DWARF Address Space Memory Space
23 ================== =================== ==============
24 0 1 Private
25 1 N/A Global
26 2 N/A Constant
27 3 2 Local
28 4 N/A Generic (Flat)
29 5 N/A Region
30 ================== =================== ==============
31
32 The terminology in the table, aside from the region memory space, is from the
33 OpenCL standard.
34
35 LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
36 Address Space is emitted in DWARF, and is used by tools, such as debugger,
37 profiler and others.
213 The AMDGPU backend uses the following address space mappings.
214
215 The memory space names used in the table, aside from the region memory space, is
216 from the OpenCL standard.
217
218 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
219
220 .. table:: Address Space Mapping
221 :name: amdgpu-address-space-mapping-table
222
223 ================== ================= ================= ================= =================
224 LLVM Address Space Memory Space
225 ------------------ -----------------------------------------------------------------------
226 \ Current Default amdgiz/amdgizcl hcc Future Default
227 ================== ================= ================= ================= =================
228 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
229 1 Global Global Global Global
230 2 Constant Constant Constant Region (GDS)
231 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
232 4 Generic (Flat) Region (GDS) Region (GDS) Constant
233 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
234 ================== ================= ================= ================= =================
235
236 Current Default
237 This is the current default address space mapping used for all languages
238 except hcc. This will shortly be deprecated.
239
240 amdgiz/amdgizcl
241 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
242 is specified as the target triple environment value.
243
244 hcc
245 This is the current address space mapping used when ``hcc`` is specified as
246 the target triple environment value.This will shortly be deprecated.
247
248 Future Default
249 This will shortly be the only address space mapping for all languages using
250 AMDGPU backend.
251
252 .. _amdgpu-memory-scopes:
253
254 Memory Scopes
255 -------------
256
257 This section provides LLVM memory synchronization scopes supported by the AMDGPU
258 backend memory model when the target triple OS is ``amdhsa`` (see
259 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
260
261 The memory model supported is based on the HSA memory model [HSA]_ which is
262 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
263 relation is transitive over the synchonizes-with relation independent of scope,
264 and synchonizes-with allows the memory scope instances to be inclusive (see
265 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
266
267 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
268 inclusion and requires the memory scopes to exactly match. However, this
269 is conservatively correct for OpenCL.
270
271 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
272 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
273
274 ================ ==========================================================
275 LLVM Sync Scope Description
276 ================ ==========================================================
277 *none* The default: ``system``.
278
279 Synchronizes with, and participates in modification and
280 seq_cst total orderings with, other operations (except
281 image operations) for all address spaces (except private,
282 or generic that accesses private) provided the other
283 operation's sync scope is:
284
285 - ``system``.
286 - ``agent`` and executed by a thread on the same agent.
287 - ``workgroup`` and executed by a thread in the same
288 workgroup.
289 - ``wavefront`` and executed by a thread in the same
290 wavefront.
291
292 ``agent`` Synchronizes with, and participates in modification and
293 seq_cst total orderings with, other operations (except
294 image operations) for all address spaces (except private,
295 or generic that accesses private) provided the other
296 operation's sync scope is:
297
298 - ``system`` or ``agent`` and executed by a thread on the
299 same agent.
300 - ``workgroup`` and executed by a thread in the same
301 workgroup.
302 - ``wavefront`` and executed by a thread in the same
303 wavefront.
304
305 ``workgroup`` Synchronizes with, and participates in modification and
306 seq_cst total orderings with, other operations (except
307 image operations) for all address spaces (except private,
308 or generic that accesses private) provided the other
309 operation's sync scope is:
310
311 - ``system``, ``agent`` or ``workgroup`` and executed by a
312 thread in the same workgroup.
313 - ``wavefront`` and executed by a thread in the same
314 wavefront.
315
316 ``wavefront`` Synchronizes with, and participates in modification and
317 seq_cst total orderings with, other operations (except
318 image operations) for all address spaces (except private,
319 or generic that accesses private) provided the other
320 operation's sync scope is:
321
322 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
323 and executed by a thread in the same wavefront.
324
325 ``singlethread`` Only synchronizes with, and participates in modification
326 and seq_cst total orderings with, other operations (except
327 image operations) running in the same thread for all
328 address spaces (for example, in signal handlers).
329 ================ ==========================================================
330
331 AMDGPU Intrinsics
332 -----------------
333
334 The AMDGPU backend implements the following intrinsics.
335
336 *This section is WIP.*
337
338 .. TODO
339 List AMDGPU intrinsics
340
341 Code Object
342 ===========
343
344 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
345 can be linked by ``lld`` to produce a standard ELF shared code object which can
346 be loaded and executed on an AMDGPU target.
347
348 Header
349 ------
350
351 The AMDGPU backend uses the following ELF header:
352
353 .. table:: AMDGPU ELF Header
354 :name: amdgpu-elf-header-table
355
356 ========================== =========================
357 Field Value
358 ========================== =========================
359 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
360 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
361 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``
362 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
363 ``e_type`` ``ET_REL`` or ``ET_DYN``
364 ``e_machine`` ``EM_AMDGPU``
365 ``e_entry`` 0
366 ``e_flags`` 0
367 ========================== =========================
368
369 ..
370
371 .. table:: AMDGPU ELF Header Enumeration Values
372 :name: amdgpu-elf-header-enumeration-values-table
373
374 ============================ =====
375 Name Value
376 ============================ =====
377 ``EM_AMDGPU`` 224
378 ``ELFOSABI_AMDGPU_HSA`` 64
379 ``ELFABIVERSION_AMDGPU_HSA`` 1
380 ============================ =====
381
382 ``e_ident[EI_CLASS]``
383 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
384 applications.
385
386 ``e_ident[EI_DATA]``
387 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
388
389 ``e_ident[EI_OSABI]``
390 The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
391 specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
392
393 ``e_ident[EI_ABIVERSION]``
394 The AMD GPU architecture specific OS ABI version of
395 ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
396 ABI to which the code object conforms.
397
398 ``e_type``
399 Can be one of the following values:
400
401
402 ``ET_REL``
403 The type produced by the AMD GPU backend compiler as it is relocatable code
404 object.
405
406 ``ET_DYN``
407 The type produced by the linker as it is a shared code object.
408
409 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
410
411 ``e_machine``
412 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
413 architecture family. The specific member is specified in the
414 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
415 :ref:`amdgpu-note-records`).
416
417 ``e_entry``
418 The entry point is 0 as the entry points for individual kernels must be
419 selected in order to invoke them through AQL packets.
420
421 ``e_flags``
422 The value is 0 as no flags are used.
423
424 Sections
425 --------
426
427 An AMDGPU target ELF code object has the standard ELF sections which include:
428
429 .. table:: AMDGPU ELF Sections
430 :name: amdgpu-elf-sections-table
431
432 ================== ================ =================================
433 Name Type Attributes
434 ================== ================ =================================
435 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
436 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
437 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
438 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
439 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
440 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
441 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
442 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
443 ``.note`` ``SHT_NOTE`` *none*
444 ``.rela``\ *name* ``SHT_RELA`` *none*
445 ``.rela.dyn`` ``SHT_RELA`` *none*
446 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
447 ``.shstrtab`` ``SHT_STRTAB`` *none*
448 ``.strtab`` ``SHT_STRTAB`` *none*
449 ``.symtab`` ``SHT_SYMTAB`` *none*
450 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
451 ================== ================ =================================
452
453 These sections have their standard meanings (see [ELF]_) and are only generated
454 if needed.
455
456 ``.debug``\ *\**
457 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
458 DWARF produced by the AMDGPU backend.
459
460 ``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
461 The standard sections used by a dynamic loader.
462
463 ``.note``
464 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
465 backend.
466
467 ``.rela``\ *name*, ``.rela.dyn``
468 For relocatable code objects, *name* is the name of the section that the
469 relocation records apply. For example, ``.rela.text`` is the section name for
470 relocation records associated with the ``.text`` section.
471
472 For linked shared code objects, ``.rela.dyn`` contains all the relocation
473 records from each of the relocatable code object's ``.rela``\ *name* sections.
474
475 See :ref:`amdgpu-relocation-records` for the relocation records supported by
476 the AMDGPU backend.
477
478 ``.text``
479 The executable machine code for the kernels and functions they call. Generated
480 as position independent code. See :ref:`amdgpu-code-conventions` for
481 information on conventions used in the isa generation.
482
483 .. _amdgpu-note-records:
484
485 Note Records
486 ------------
487
488 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
489 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
490 minimal zero byte padding must be generated to ensure the ``desc`` field size is
491 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
492 be at least 4 to indicate at least 8 byte alignment.
493
494 The AMDGPU backend code object uses the following ELF note records in the
495 ``.note`` section. The *Description* column specifies the layout of the note
496 record’s ``desc`` field. All fields are consecutive bytes. Note records with
497 variable size strings have a corresponding ``*_size`` field that specifies the
498 number of bytes, including the terminating null character, in the string. The
499 string(s) come immediately after the preceding fields.
500
501 Additional note records can be present.
502
503 .. table:: AMDGPU ELF Note Records
504 :name: amdgpu-elf-note-records-table
505
506 ===== ========================== ==========================================
507 Name Type Description
508 ===== ========================== ==========================================
509 "AMD" ``NT_AMD_AMDGPU_METADATA``
510 "AMD" ``NT_AMD_AMDGPU_ISA``
511 ===== ========================== ==========================================
512
513 ..
514
515 .. table:: AMDGPU ELF Note Record Enumeration Values
516 :name: amdgpu-elf-note-record-enumeration-values-table
517
518 ============================= =====
519 Name Value
520 ============================= =====
521 *reserved* 0-9
522 ``NT_AMD_AMDGPU_METADATA`` 10
523 ``NT_AMD_AMDGPU_ISA`` 11
524 ============================= =====
525
526 ``NT_AMD_AMDGPU_ISA``
527 Specifies the instruction set architecture used by the machine code contained
528 in the code object.
529
530 This note record is required for code objects containing machine code for
531 processors matching the ``amdgcn`` architecture in table
532 :ref:`amdgpu-processors`.
533
534 The null terminated string has the following syntax:
535
536 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
537
538 where:
539
540 *architecture*
541 The architecture from table :ref:`amdgpu-target-triples-table`.
542
543 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
544 :ref:`amdgpu-target-triples`).
545
546 *vendor*
547 The vendor from table :ref:`amdgpu-target-triples-table`.
548
549 For the AMDGPU backend this is always ``amd``.
550
551 *os*
552 The OS from table :ref:`amdgpu-target-triples-table`.
553
554 *environment*
555 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
556 the environment has no affect on the execution of the code object.
557
558 For the AMDGPU backend this is currently always blank.
559 *processor*
560 The processor from table :ref:`amdgpu-processors-table`.
561
562 For example:
563
564 ``amdgcn-amd-amdhsa--gfx901``
565
566 ``NT_AMD_AMDGPU_METADATA``
567 Specifies extensible metadata associated with the code object. See
568 :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
569 string.
570
571 This note record is required and must contain the minimum information
572 necessary to support the ROCM kernel queries. For example, the segment sizes
573 needed in a dispatch packet. In addition, a high level language runtime may
574 require other information to be included. For example, the AMD OpenCL runtime
575 records kernel argument information.
576
577 .. TODO
578 Is the string null terminated? It probably should not if YAML allows it to
579 contain null characters, otherwise it should be.
580
581 .. _amdgpu-code-object-metadata:
582
583 Code Object Metadata
584 --------------------
585
586 The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
587 record (see :ref:`amdgpu-note-records`).
588
589 The metadata is specified as a YAML formated string (see [YAML]_ and
590 :doc:`YamlIO`).
591
592 The metadata is represented as a single YAML document comprised of the mapping
593 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
594 referenced tables.
595
596 For boolean values, the string values of ``false`` and ``true`` are used for
597 false and true respectively.
598
599 Additional information can be added to the mappings. To avoid conflicts, any
600 non-AMD key names should be prefixed by "*vendor-name*.".
601
602 .. table:: AMDHSA Code Object Metadata Mapping
603 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
604
605 ========== ============== ========= =======================================
606 String Key Value Type Required? Description
607 ========== ============== ========= =======================================
608 "Version" sequence of Required - The first integer is the major
609 2 integers version. Currently 1.
610 - The second integer is the minor
611 version. Currently 0.
612 "Printf" sequence of Each string is encoded information
613 strings about a printf function call. The
614 encoded information is organized as
615 fields separated by colon (':'):
616
617 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
618
619 where:
620
621 ``ID``
622 A 32 bit integer as a unique id for
623 each printf function call
624
625 ``N``
626 A 32 bit integer equal to the number
627 of arguments of printf function call
628 minus 1
629
630 ``S[i]`` (where i = 0, 1, ... , N-1)
631 32 bit integers for the size in bytes
632 of the i-th FormatString argument of
633 the printf function call
634
635 FormatString
636 The format string passed to the
637 printf function call.
638 "Kernels" sequence of Required Sequence of the mappings for each
639 mapping kernel in the code object. See
640 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
641 for the definition of the mapping.
642 ========== ============== ========= =======================================
643
644 ..
645
646 .. table:: AMDHSA Code Object Kernel Metadata Mapping
647 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
648
649 ================= ============== ========= ================================
650 String Key Value Type Required? Description
651 ================= ============== ========= ================================
652 "Name" string Required Source name of the kernel.
653 "SymbolName" string Required Name of the kernel
654 descriptor ELF symbol.
655 "Language" string Source language of the kernel.
656 Values include:
657
658 - "OpenCL C"
659 - "OpenCL C++"
660 - "HCC"
661 - "OpenMP"
662
663 "LanguageVersion" sequence of - The first integer is the major
664 2 integers version.
665 - The second integer is the
666 minor version.
667 "Attrs" mapping Mapping of kernel attributes.
668 See
669 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
670 for the mapping definition.
671 "Arguments" sequence of Sequence of mappings of the
672 mapping kernel arguments. See
673 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
674 for the definition of the mapping.
675 "CodeProps" mapping Mapping of properties related to
676 the kernel code. See
677 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
678 for the mapping definition.
679 "DebugProps" mapping Mapping of properties related to
680 the kernel debugging. See
681 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
682 for the mapping definition.
683 ================= ============== ========= ================================
684
685 ..
686
687 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
688 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
689
690 =================== ============== ========= ==============================
691 String Key Value Type Required? Description
692 =================== ============== ========= ==============================
693 "ReqdWorkGroupSize" sequence of The dispatch work-group size
694 3 integers X, Y, Z must correspond to the
695 specified values.
696
697 Corresponds to the OpenCL
698 ``reqd_work_group_size``
699 attribute.
700 "WorkGroupSizeHint" sequence of The dispatch work-group size
701 3 integers X, Y, Z is likely to be the
702 specified values.
703
704 Corresponds to the OpenCL
705 ``work_group_size_hint``
706 attribute.
707 "VecTypeHint" string The name of a scalar or vector
708 type.
709
710 Corresponds to the OpenCL
711 ``vec_type_hint`` attribute.
712 =================== ============== ========= ==============================
713
714 ..
715
716 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
717 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
718
719 ================= ============== ========= ================================
720 String Key Value Type Required? Description
721 ================= ============== ========= ================================
722 "Name" string Kernel argument name.
723 "TypeName" string Kernel argument type name.
724 "Size" integer Required Kernel argument size in bytes.
725 "Align" integer Required Kernel argument alignment in
726 bytes. Must be a power of two.
727 "ValueKind" string Required Kernel argument kind that
728 specifies how to set up the
729 corresponding argument.
730 Values include:
731
732 "ByValue"
733 The argument is copied
734 directly into the kernarg.
735
736 "GlobalBuffer"
737 A global address space pointer
738 to the buffer data is passed
739 in the kernarg.
740
741 "DynamicSharedPointer"
742 A group address space pointer
743 to dynamically allocated LDS
744 is passed in the kernarg.
745
746 "Sampler"
747 A global address space
748 pointer to a S# is passed in
749 the kernarg.
750
751 "Image"
752 A global address space
753 pointer to a T# is passed in
754 the kernarg.
755
756 "Pipe"
757 A global address space pointer
758 to an OpenCL pipe is passed in
759 the kernarg.
760
761 "Queue"
762 A global address space pointer
763 to an OpenCL device enqueue
764 queue is passed in the
765 kernarg.
766
767 "HiddenGlobalOffsetX"
768 The OpenCL grid dispatch
769 global offset for the X
770 dimension is passed in the
771 kernarg.
772
773 "HiddenGlobalOffsetY"
774 The OpenCL grid dispatch
775 global offset for the Y
776 dimension is passed in the
777 kernarg.
778
779 "HiddenGlobalOffsetZ"
780 The OpenCL grid dispatch
781 global offset for the Z
782 dimension is passed in the
783 kernarg.
784
785 "HiddenNone"
786 An argument that is not used
787 by the kernel. Space needs to
788 be left for it, but it does
789 not need to be set up.
790
791 "HiddenPrintfBuffer"
792 A global address space pointer
793 to the runtime printf buffer
794 is passed in kernarg.
795
796 "HiddenDefaultQueue"
797 A global address space pointer
798 to the OpenCL device enqueue
799 queue that should be used by
800 the kernel by default is
801 passed in the kernarg.
802
803 "HiddenCompletionAction"
804 *TBD*
805
806 .. TODO
807 Add description.
808
809 "ValueType" string Required Kernel argument value type. Only
810 present if "ValueKind" is
811 "ByValue". For vector data
812 types, the value is for the
813 element type. Values include:
814
815 - "Struct"
816 - "I8"
817 - "U8"
818 - "I16"
819 - "U16"
820 - "F16"
821 - "I32"
822 - "U32"
823 - "F32"
824 - "I64"
825 - "U64"
826 - "F64"
827
828 .. TODO
829 How can it be determined if a
830 vector type, and what size
831 vector?
832 "PointeeAlign" integer Alignment in bytes of pointee
833 type for pointer type kernel
834 argument. Must be a power
835 of 2. Only present if
836 "ValueKind" is
837 "DynamicSharedPointer".
838 "AddrSpaceQual" string Kernel argument address space
839 qualifier. Only present if
840 "ValueKind" is "GlobalBuffer" or
841 "DynamicSharedPointer". Values
842 are:
843
844 - "Private"
845 - "Global"
846 - "Constant"
847 - "Local"
848 - "Generic"
849 - "Region"
850
851 .. TODO
852 Is GlobalBuffer only Global
853 or Constant? Is
854 DynamicSharedPointer always
855 Local? Can HCC allow Generic?
856 How can Private or Region
857 ever happen?
858 "AccQual" string Kernel argument access
859 qualifier. Only present if
860 "ValueKind" is "Image" or
861 "Pipe". Values
862 are:
863
864 - "ReadOnly"
865 - "WriteOnly"
866 - "ReadWrite"
867
868 .. TODO
869 Does this apply to
870 GlobalBuffer?
871 "ActualAcc" string The actual memory accesses
872 performed by the kernel on the
873 kernel argument. Only present if
874 "ValueKind" is "GlobalBuffer",
875 "Image", or "Pipe". This may be
876 more restrictive than indicated
877 by "AccQual" to reflect what the
878 kernel actual does. If not
879 present then the runtime must
880 assume what is implied by
881 "AccQual" and "IsConst". Values
882 are:
883
884 - "ReadOnly"
885 - "WriteOnly"
886 - "ReadWrite"
887
888 "IsConst" boolean Indicates if the kernel argument
889 is const qualified. Only present
890 if "ValueKind" is
891 "GlobalBuffer".
892
893 "IsRestrict" boolean Indicates if the kernel argument
894 is restrict qualified. Only
895 present if "ValueKind" is
896 "GlobalBuffer".
897
898 "IsVolatile" boolean Indicates if the kernel argument
899 is volatile qualified. Only
900 present if "ValueKind" is
901 "GlobalBuffer".
902
903 "IsPipe" boolean Indicates if the kernel argument
904 is pipe qualified. Only present
905 if "ValueKind" is "Pipe".
906
907 .. TODO
908 Can GlobalBuffer be pipe
909 qualified?
910 ================= ============== ========= ================================
911
912 ..
913
914 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
915 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
916
917 ============================ ============== ========= =====================
918 String Key Value Type Required? Description
919 ============================ ============== ========= =====================
920 "KernargSegmentSize" integer Required The size in bytes of
921 the kernarg segment
922 that holds the values
923 of the arguments to
924 the kernel.
925 "GroupSegmentFixedSize" integer Required The amount of group
926 segment memory
927 required by a
928 work-group in
929 bytes. This does not
930 include any
931 dynamically allocated
932 group segment memory
933 that may be added
934 when the kernel is
935 dispatched.
936 "PrivateSegmentFixedSize" integer Required The amount of fixed
937 private address space
938 memory required for a
939 work-item in
940 bytes. If
941 IsDynamicCallstack
942 is 1 then additional
943 space must be added
944 to this value for the
945 call stack.
946 "KernargSegmentAlign" integer Required The maximum byte
947 alignment of
948 arguments in the
949 kernarg segment. Must
950 be a power of 2.
951 "WavefrontSize" integer Required Wavefront size. Must
952 be a power of 2.
953 "NumSGPRs" integer Number of scalar
954 registers used by a
955 wavefront for
956 GFX6-GFX9. This
957 includes the special
958 SGPRs for VCC, Flat
959 Scratch (GFX7-GFX9)
960 and XNACK (for
961 GFX8-GFX9). It does
962 not include the 16
963 SGPR added if a trap
964 handler is
965 enabled. It is not
966 rounded up to the
967 allocation
968 granularity.
969 "NumVGPRs" integer Number of vector
970 registers used by
971 each work-item for
972 GFX6-GFX9
973 "MaxFlatWorkgroupSize" integer Maximum flat
974 work-group size
975 supported by the
976 kernel in work-items.
977 "IsDynamicCallStack" boolean Indicates if the
978 generated machine
979 code is using a
980 dynamically sized
981 call stack.
982 "IsXNACKEnabled" boolean Indicates if the
983 generated machine
984 code is capable of
985 supporting XNACK.
986 ============================ ============== ========= =====================
987
988 ..
989
990 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
991 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
992
993 =================================== ============== ========= ==============
994 String Key Value Type Required? Description
995 =================================== ============== ========= ==============
996 "DebuggerABIVersion" string
997 "ReservedNumVGPRs" integer
998 "ReservedFirstVGPR" integer
999 "PrivateSegmentBufferSGPR" integer
1000 "WavefrontPrivateSegmentOffsetSGPR" integer
1001 =================================== ============== ========= ==============
1002
1003 .. TODO
1004 Plan to remove the debug properties metadata.
1005
1006 .. _amdgpu-symbols:
1007
1008 Symbols
1009 -------
1010
1011 Symbols include the following:
1012
1013 .. table:: AMDGPU ELF Symbols
1014 :name: amdgpu-elf-symbols-table
1015
1016 ===================== ============== ============= ==================
1017 Name Type Section Description
1018 ===================== ============== ============= ==================
1019 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1020 - ``.rodata``
1021 - ``.bss``
1022 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1023 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1024 ===================== ============== ============= ==================
1025
1026 Global variable
1027 Global variables both used and defined by the compilation unit.
1028
1029 If the symbol is defined in the compilation unit then it is allocated in the
1030 appropriate section according to if it has initialized data or is readonly.
1031
1032 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1033 will resolve relocations using the defintion provided by another code object
1034 or explicitly defined by the runtime.
1035
1036 All global symbols, whether defined in the compilation unit or external, are
1037 accessed by the machine code indirectly throught a GOT table entry. This
1038 allows them to be preemptable. The GOT table is only supported when the target
1039 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1040
1041 .. TODO
1042 Add description of linked shared object symbols. Seems undefined symbols
1043 are marked as STT_NOTYPE.
1044
1045 Kernel descriptor
1046 Every HSA kernel has an associated kernel descriptor. It is the address of the
1047 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1048 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1049 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1050
1051 Kernel entry point
1052 Every HSA kernel also has a symbol for its machine code entry point.
1053
1054 .. _amdgpu-relocation-records:
1055
1056 Relocation Records
1057 ------------------
1058
1059 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1060 relocatable fields are:
1061
1062 ``word32``
1063 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1064 alignment. These values use the same byte order as other word values in the
1065 AMD GPU architecture.
1066
1067 ``word64``
1068 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1069 alignment. These values use the same byte order as other word values in the
1070 AMD GPU architecture.
1071
1072 Following notations are used for specifying relocation calculations:
1073
1074 **A**
1075 Represents the addend used to compute the value of the relocatable field.
1076
1077 **G**
1078 Represents the offset into the global offset table at which the relocation
1079 entry’s symbol will reside during execution.
1080
1081 **GOT**
1082 Represents the address of the global offset table.
1083
1084 **P**
1085 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1086 of the storage unit being relocated (computed using ``r_offset``).
1087
1088 **S**
1089 Represents the value of the symbol whose index resides in the relocation
1090 entry.
1091
1092 The following relocation types are supported:
1093
1094 .. table:: AMDGPU ELF Relocation Records
1095 :name: amdgpu-elf-relocation-records-table
1096
1097 ========================== ===== ========== ==============================
1098 Relocation Type Value Field Calculation
1099 ========================== ===== ========== ==============================
1100 ``R_AMDGPU_NONE`` 0 *none* *none*
1101 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
1102 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
1103 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
1104 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
1105 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
1106 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
1107 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
1108 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1109 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
1110 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1111 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
1112 ========================== ===== ========== ==============================
1113
1114 .. _amdgpu-dwarf:
1115
1116 DWARF
1117 -----
1118
1119 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
1120 information that maps the code object executable code and data to the source
1121 language constructs. It can be used by tools such as debuggers and profilers.
1122
1123 Address Space Mapping
1124 ~~~~~~~~~~~~~~~~~~~~~
1125
1126 The following address space mapping is used:
1127
1128 .. table:: AMDGPU DWARF Address Space Mapping
1129 :name: amdgpu-dwarf-address-space-mapping-table
1130
1131 =================== =================
1132 DWARF Address Space Memory Space
1133 =================== =================
1134 1 Private (Scratch)
1135 2 Local (group/LDS)
1136 *omitted* Global
1137 *omitted* Constant
1138 *omitted* Generic (Flat)
1139 *not supported* Region (GDS)
1140 =================== =================
1141
1142 See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
1143 used in the table.
1144
1145 An ``address_class`` attribute is generated on pointer type DIEs to specify the
1146 DWARF address space of the value of the pointer when it is in the *private* or
1147 *local* address space. Otherwise the attribute is omitted.
1148
1149 An ``XDEREF`` operation is generated in location list expressions for variables
1150 that are allocated in the *private* and *local* address space. Otherwise no
1151 ``XDREF`` is omitted.
1152
1153 Register Mapping
1154 ~~~~~~~~~~~~~~~~
1155
1156 *This section is WIP.*
1157
1158 .. TODO
1159 Define DWARF register enumeration.
1160
1161 If want to present a wavefront state then should expose vector registers as
1162 64 wide (rather than per work-item view that LLVM uses). Either as seperate
1163 registers, or a 64x4 byte single register. In either case use a new LANE op
1164 (akin to XDREF) to select the current lane usage in a location
1165 expression. This would also allow scalar register spilling to vector register
1166 lanes to be expressed (currently no debug information is being generated for
1167 spilling). If choose a wide single register approach then use LANE in
1168 conjunction with PIECE operation to select the dword part of the register for
1169 the current lane. If the separate register approach then use LANE to select
1170 the register.
1171
1172 Source Text
1173 ~~~~~~~~~~~
1174
1175 *This section is WIP.*
1176
1177 .. TODO
1178 DWARF extension to include runtime generated source text.
1179
1180 .. _amdgpu-code-conventions:
1181
1182 Code Conventions
1183 ================
1184
1185 AMDHSA
1186 ------
1187
1188 This section provides code conventions used when the target triple OS is
1189 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1190
1191 Kernel Dispatch
1192 ~~~~~~~~~~~~~~~
1193
1194 The HSA architected queuing language (AQL) defines a user space memory interface
1195 that can be used to control the dispatch of kernels, in an agent independent
1196 way. An agent can have zero or more AQL queues created for it using the ROCm
1197 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1198 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1199 mechanics and packet layouts.
1200
1201 The packet processor of a kernel agent is responsible for detecting and
1202 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1203 packet processor is implemented by the hardware command processor (CP),
1204 asynchronous dispatch controller (ADC) and shader processor input controller
1205 (SPI).
1206
1207 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1208 mode driver to initialize and register the AQL queue with CP.
1209
1210 To dispatch a kernel the following actions are performed. This can occur in the
1211 CPU host program, or from an HSA kernel executing on a GPU.
1212
1213 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1214 executed is obtained.
1215 2. A pointer to the kernel descriptor (see
1216 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1217 obtained. It must be for a kernel that is contained in a code object that that
1218 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1219 associated.
1220 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1221 for a memory region with the kernarg property for the kernel agent that will
1222 execute the kernel. It must be at least 16 byte aligned.
1223 4. Kernel argument values are assigned to the kernel argument memory
1224 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1225 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1226 memory in the same way constant memory is accessed. (Note that the HSA
1227 specification allows an implementation to copy the kernel argument contents to
1228 another location that is accessed by the kernel.)
1229 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1230 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1231 packet. The packet must be set up, and the final write must use an atomic
1232 store release to set the packet kind to ensure the packet contents are
1233 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1234 notify the kernel agent that the AQL queue has been updated. These rules, and
1235 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1236 System Architecture Specification* [HSA]_.
1237 6. A kernel dispatch packet includes information about the actual dispatch,
1238 such as grid and work-group size, together with information from the code
1239 object about the kernel, such as segment sizes. The ROCm runtime queries on
1240 the kernel symbol can be used to obtain the code object values which are
1241 recorded in the :ref:`amdgpu-code-object-metadata`.
1242 7. CP executes micro-code and is responsible for detecting and setting up the
1243 GPU to execute the wavefronts of a kernel dispatch.
1244 8. CP ensures that when the a wavefront starts executing the kernel machine
1245 code, the scalar general purpose registers (SGPR) and vector general purpose
1246 registers (VGPR) are set up as required by the machine code. The required
1247 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1248 register state is defined in
1249 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1250 9. The prolog of the kernel machine code (see
1251 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1252 before continuing executing the machine code that corresponds to the kernel.
1253 10. When the kernel dispatch has completed execution, CP signals the completion
1254 signal specified in the kernel dispatch packet if not 0.
1255
1256 .. _amdgpu-amdhsa-memory-spaces:
1257
1258 Memory Spaces
1259 ~~~~~~~~~~~~~
1260
1261 The memory space properties are:
1262
1263 .. table:: AMDHSA Memory Spaces
1264 :name: amdgpu-amdhsa-memory-spaces-table
1265
1266 ================= =========== ======== ======= ==================
1267 Memory Space Name HSA Segment Hardware Address NULL Value
1268 Name Name Size
1269 ================= =========== ======== ======= ==================
1270 Private private scratch 32 0x00000000
1271 Local group LDS 32 0xFFFFFFFF
1272 Global global global 64 0x0000000000000000
1273 Constant constant *same as 64 0x0000000000000000
1274 global*
1275 Generic flat flat 64 0x0000000000000000
1276 Region N/A GDS 32 *not implemented
1277 for AMDHSA*
1278 ================= =========== ======== ======= ==================
1279
1280 The global and constant memory spaces both use global virtual addresses, which
1281 are the same virtual address space used by the CPU. However, some virtual
1282 addresses may only be accessible to the CPU, some only accessible by the GPU,
1283 and some by both.
1284
1285 Using the constant memory space indicates that the data will not change during
1286 the execution of the kernel. This allows scalar read instructions to be
1287 used. The vector and scalar L1 caches are invalidated of volatile data before
1288 each kernel dispatch execution to allow constant memory to change values between
1289 kernel dispatches.
1290
1291 The local memory space uses the hardware Local Data Store (LDS) which is
1292 automatically allocated when the hardware creates work-groups of wavefronts, and
1293 freed when all the wavefronts of a work-group have terminated. The data store
1294 (DS) instructions can be used to access it.
1295
1296 The private memory space uses the hardware scratch memory support. If the kernel
1297 uses scratch, then the hardware allocates memory that is accessed using
1298 wavefront lane dword (4 byte) interleaving. The mapping used from private
1299 address to physical address is:
1300
1301 ``wavefront-scratch-base +
1302 (private-address * wavefront-size * 4) +
1303 (wavefront-lane-id * 4)``
1304
1305 There are different ways that the wavefront scratch base address is determined
1306 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1307 memory can be accessed in an interleaved manner using buffer instruction with
1308 the scratch buffer descriptor and per wave scratch offset, by the scratch
1309 instructions, or by flat instructions. If each lane of a wavefront accesses the
1310 same private address, the interleaving results in adjacent dwords being accessed
1311 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1312 supported except by flat and scratch instructions in GFX9.
1313
1314 The generic address space uses the hardware flat address support available in
1315 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1316 local appertures), that are outside the range of addressible global memory, to
1317 map from a flat address to a private or local address.
1318
1319 FLAT instructions can take a flat address and access global, private (scratch)
1320 and group (LDS) memory depending in if the address is within one of the
1321 apperture ranges. Flat access to scratch requires hardware aperture setup and
1322 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1323 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1324 (see :ref:`amdgpu-amdhsa-m0`).
1325
1326 To convert between a segment address and a flat address the base address of the
1327 appertures address can be used. For GFX7-GFX8 these are available in the
1328 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1329 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1330 GFX9 the appature base addresses are directly available as inline constant
1331 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1332 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1333 which makes it easier to convert from flat to segment or segment to flat.
1334
1335 HSA Image and Samplers
1336 ~~~~~~~~~~~~~~~~~~~~~~
1337
1338 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1339 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1340 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1341 enumeration values for the queries that are not trivially deducible from the S#
1342 representation.
1343
1344 HSA Signals
1345 ~~~~~~~~~~~
1346
1347 Signal handles created by the ROCm runtime are 64 bit addresses of a structure
1348 allocated in memory accessible from both the CPU and GPU. The structure is
1349 defined by the ROCm runtime and subject to change between releases (see
1350 [AMD-ROCm-github]_).
1351
1352 .. _amdgpu-amdhsa-hsa-aql-queue:
1353
1354 HSA AQL Queue
1355 ~~~~~~~~~~~~~
1356
1357 The AQL queue structure is defined by the ROCm runtime and subject to change
1358 between releases (see [AMD-ROCm-github]_). For some processors it contains
1359 fields needed to implement certain language features such as the flat address
1360 aperture bases. It also contains fields used by CP such as managing the
1361 allocation of scratch memory.
1362
1363 .. _amdgpu-amdhsa-kernel-descriptor:
1364
1365 Kernel Descriptor
1366 ~~~~~~~~~~~~~~~~~
1367
1368 A kernel descriptor consists of the information needed by CP to initiate the
1369 execution of a kernel, including the entry point address of the machine code
1370 that implements the kernel.
1371
1372 Kernel Descriptor for GFX6-GFX9
1373 +++++++++++++++++++++++++++++++
1374
1375 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1376
1377 .. table:: Kernel Descriptor for GFX6-GFX9
1378 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1379
1380 ======= ======= =============================== ===========================
1381 Bits Size Field Name Description
1382 ======= ======= =============================== ===========================
1383 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1384 address space memory
1385 required for a work-group
1386 in bytes. This does not
1387 include any dynamically
1388 allocated local address
1389 space memory that may be
1390 added when the kernel is
1391 dispatched.
1392 63:32 4 bytes private_segment_fixed_size The amount of fixed
1393 private address space
1394 memory required for a
1395 work-item in bytes. If
1396 is_dynamic_callstack is 1
1397 then additional space must
1398 be added to this value for
1399 the call stack.
1400 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1401 size supported by the
1402 kernel in work-items.
1403 96 1 bit is_dynamic_call_stack Indicates if the generated
1404 machine code is using a
1405 dynamically sized call
1406 stack.
1407 97 1 bit is_xnack_enabled Indicates if the generated
1408 machine code is capable of
1409 suppoting XNACK.
1410 127:98 30 bits Reserved. Must be 0.
1411 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1412 negative) from base
1413 address of kernel
1414 descriptor to kernel's
1415 entry point instruction
1416 which must be 256 byte
1417 aligned.
1418 383:192 24 Reserved. Must be 0.
1419 bytes
1420 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1421 program settings used by
1422 CP to set up
1423 ``COMPUTE_PGM_RSRC1``
1424 configuration
1425 register. See
1426 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1427 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1428 program settings used by
1429 CP to set up
1430 ``COMPUTE_PGM_RSRC2``
1431 configuration
1432 register. See
1433 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1434 448 1 bit enable_sgpr_private_segment Enable the setup of the
1435 _buffer SGPR user data registers
1436 (see
1437 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1438
1439 The total number of SGPR
1440 user data registers
1441 requested must not exceed
1442 16 and match value in
1443 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1444 Any requests beyond 16
1445 will be ignored.
1446 449 1 bit enable_sgpr_dispatch_ptr *see above*
1447 450 1 bit enable_sgpr_queue_ptr *see above*
1448 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1449 452 1 bit enable_sgpr_dispatch_id *see above*
1450 453 1 bit enable_sgpr_flat_scratch_init *see above*
1451 454 1 bit enable_sgpr_private_segment *see above*
1452 _size
1453 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1454 _count_X should always be 0.
1455 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1456 _count_Y should always be 0.
1457 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1458 _count_Z should always be 0.
1459 463:458 6 bits Reserved. Must be 0.
1460 511:464 4 Reserved. Must be 0.
1461 bytes
1462 512 **Total size 64 bytes.**
1463 ======= ===================================================================
1464
1465 ..
1466
1467 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1468 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1469
1470 ======= ======= =============================== ===========================
1471 Bits Size Field Name Description
1472 ======= ======= =============================== ===========================
1473 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1474 used by each work-item,
1475 granularity is device
1476 specific:
1477
1478 GFX6-9
1479 roundup((max-vgpg + 1)
1480 / 4) - 1
1481
1482 Used by CP to set up
1483 ``COMPUTE_PGM_RSRC1.VGPRS``.
1484 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1485 used by a wavefront,
1486 granularity is device
1487 specific:
1488
1489 GFX6-8
1490 roundup((max-sgpg + 1)
1491 / 8) - 1
1492 GFX9
1493 roundup((max-sgpg + 1)
1494 / 16) - 1
1495
1496 Includes the special SGPRs
1497 for VCC, Flat Scratch (for
1498 GFX7 onwards) and XNACK
1499 (for GFX8 onwards). It does
1500 not include the 16 SGPR
1501 added if a trap handler is
1502 enabled.
1503
1504 Used by CP to set up
1505 ``COMPUTE_PGM_RSRC1.SGPRS``.
1506 11:10 2 bits priority Must be 0.
1507
1508 Start executing wavefront
1509 at the specified priority.
1510
1511 CP is responsible for
1512 filling in
1513 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1514 13:12 2 bits float_mode_round_32 Wavefront starts execution
1515 with specified rounding
1516 mode for single (32
1517 bit) floating point
1518 precision floating point
1519 operations.
1520
1521 Floating point rounding
1522 mode values are defined in
1523 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1524
1525 Used by CP to set up
1526 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1527 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1528 with specified rounding
1529 denorm mode for half/double (16
1530 and 64 bit) floating point
1531 precision floating point
1532 operations.
1533
1534 Floating point rounding
1535 mode values are defined in
1536 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1537
1538 Used by CP to set up
1539 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1540 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1541 with specified denorm mode
1542 for single (32
1543 bit) floating point
1544 precision floating point
1545 operations.
1546
1547 Floating point denorm mode
1548 values are defined in
1549 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1550
1551 Used by CP to set up
1552 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1553 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1554 with specified denorm mode
1555 for half/double (16
1556 and 64 bit) floating point
1557 precision floating point
1558 operations.
1559
1560 Floating point denorm mode
1561 values are defined in
1562 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1563
1564 Used by CP to set up
1565 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1566 20 1 bit priv Must be 0.
1567
1568 Start executing wavefront
1569 in privilege trap handler
1570 mode.
1571
1572 CP is responsible for
1573 filling in
1574 ``COMPUTE_PGM_RSRC1.PRIV``.
1575 21 1 bit enable_dx10_clamp Wavefront starts execution
1576 with DX10 clamp mode
1577 enabled. Used by the vector
1578 ALU to force DX-10 style
1579 treatment of NaN's (when
1580 set, clamp NaN to zero,
1581 otherwise pass NaN
1582 through).
1583
1584 Used by CP to set up
1585 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1586 22 1 bit debug_mode Must be 0.
1587
1588 Start executing wavefront
1589 in single step mode.
1590
1591 CP is responsible for
1592 filling in
1593 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1594 23 1 bit enable_ieee_mode Wavefront starts execution
1595 with IEEE mode
1596 enabled. Floating point
1597 opcodes that support
1598 exception flag gathering
1599 will quiet and propagate
1600 signaling-NaN inputs per
1601 IEEE 754-2008. Min_dx10 and
1602 max_dx10 become IEEE
1603 754-2008 compliant due to
1604 signaling-NaN propagation
1605 and quieting.
1606
1607 Used by CP to set up
1608 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1609 24 1 bit bulky Must be 0.
1610
1611 Only one work-group allowed
1612 to execute on a compute
1613 unit.
1614
1615 CP is responsible for
1616 filling in
1617 ``COMPUTE_PGM_RSRC1.BULKY``.
1618 25 1 bit cdbg_user Must be 0.
1619
1620 Flag that can be used to
1621 control debugging code.
1622
1623 CP is responsible for
1624 filling in
1625 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1626 31:26 6 bits Reserved. Must be 0.
1627 32 **Total size 4 bytes**
1628 ======= ===================================================================
1629
1630 ..
1631
1632 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1633 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1634
1635 ======= ======= =============================== ===========================
1636 Bits Size Field Name Description
1637 ======= ======= =============================== ===========================
1638 0 1 bit enable_sgpr_private_segment Enable the setup of the
1639 _wave_offset SGPR wave scratch offset
1640 system register (see
1641 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1642
1643 Used by CP to set up
1644 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1645 5:1 5 bits user_sgpr_count The total number of SGPR
1646 user data registers
1647 requested. This number must
1648 match the number of user
1649 data registers enabled.
1650
1651 Used by CP to set up
1652 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1653 6 1 bit enable_trap_handler Set to 1 if code contains a
1654 TRAP instruction which
1655 requires a trap hander to
1656 be enabled.
1657
1658 CP sets
1659 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1660 if the runtime has
1661 installed a trap handler
1662 regardless of the setting
1663 of this field.
1664 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1665 system SGPR register for
1666 the work-group id in the X
1667 dimension (see
1668 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1669
1670 Used by CP to set up
1671 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1672 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1673 system SGPR register for
1674 the work-group id in the Y
1675 dimension (see
1676 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1677
1678 Used by CP to set up
1679 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1680 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1681 system SGPR register for
1682 the work-group id in the Z
1683 dimension (see
1684 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1685
1686 Used by CP to set up
1687 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1688 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1689 system SGPR register for
1690 work-group information (see
1691 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1692
1693 Used by CP to set up
1694 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1695 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1696 VGPR system registers used
1697 for the work-item ID.
1698 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1699 defines the values.
1700
1701 Used by CP to set up
1702 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1703 13 1 bit enable_exception_address_watch Must be 0.
1704
1705 Wavefront starts execution
1706 with address watch
1707 exceptions enabled which
1708 are generated when L1 has
1709 witnessed a thread access
1710 an *address of
1711 interest*.
1712
1713 CP is responsible for
1714 filling in the address
1715 watch bit in
1716 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1717 according to what the
1718 runtime requests.
1719 14 1 bit enable_exception_memory Must be 0.
1720
1721 Wavefront starts execution
1722 with memory violation
1723 exceptions exceptions
1724 enabled which are generated
1725 when a memory violation has
1726 occurred for this wave from
1727 L1 or LDS
1728 (write-to-read-only-memory,
1729 mis-aligned atomic, LDS
1730 address out of range,
1731 illegal address, etc.).
1732
1733 CP sets the memory
1734 violation bit in
1735 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1736 according to what the
1737 runtime requests.
1738 23:15 9 bits granulated_lds_size Must be 0.
1739
1740 CP uses the rounded value
1741 from the dispatch packet,
1742 not this value, as the
1743 dispatch may contain
1744 dynamically allocated group
1745 segment memory. CP writes
1746 directly to
1747 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1748
1749 Amount of group segment
1750 (LDS) to allocate for each
1751 work-group. Granularity is
1752 device specific:
1753
1754 GFX6:
1755 roundup(lds-size / (64 * 4))
1756 GFX7-GFX9:
1757 roundup(lds-size / (128 * 4))
1758
1759 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1760 _invalid_operation with specified exceptions
1761 enabled.
1762
1763 Used by CP to set up
1764 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1765 (set from bits 0..6).
1766
1767 IEEE 754 FP Invalid
1768 Operation
1769 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1770 _source input operands is a
1771 denormal number
1772 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1773 _division_by_zero Zero
1774 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1775 _overflow
1776 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1777 _underflow
1778 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1779 _inexact
1780 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1781 _zero (rcp_iflag_f32 instruction
1782 only)
1783 31 1 bit Reserved. Must be 0.
1784 32 **Total size 4 bytes.**
1785 ======= ===================================================================
1786
1787 ..
1788
1789 .. table:: Floating Point Rounding Mode Enumeration Values
1790 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1791
1792 ===================================== ===== ===============================
1793 Enumeration Name Value Description
1794 ===================================== ===== ===============================
1795 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1796 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1797 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1798 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1799 ===================================== ===== ===============================
1800
1801 ..
1802
1803 .. table:: Floating Point Denorm Mode Enumeration Values
1804 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1805
1806 ===================================== ===== ===============================
1807 Enumeration Name Value Description
1808 ===================================== ===== ===============================
1809 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1810 Denorms
1811 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1812 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1813 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1814 ===================================== ===== ===============================
1815
1816 ..
1817
1818 .. table:: System VGPR Work-Item ID Enumeration Values
1819 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1820
1821 ===================================== ===== ===============================
1822 Enumeration Name Value Description
1823 ===================================== ===== ===============================
1824 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1825 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1826 dimensions ID.
1827 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1828 dimensions ID.
1829 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1830 ===================================== ===== ===============================
1831
1832 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1833
1834 Initial Kernel Execution State
1835 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1836
1837 This section defines the register state that will be set up by the packet
1838 processor prior to the start of execution of every wavefront. This is limited by
1839 the constraints of the hardware controllers of CP/ADC/SPI.
1840
1841 The order of the SGPR registers is defined, but the compiler can specify which
1842 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1843 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1844 for enabled registers are dense starting at SGPR0: the first enabled register is
1845 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1846 an SGPR number.
1847
1848 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1849 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1850 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1851 initialized. These are then immediately followed by the System SGPRs that are
1852 set up by ADC/SPI and can have different values for each wave of the grid
1853 dispatch.
1854
1855 SGPR register initial state is defined in
1856 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1857
1858 .. table:: SGPR Register Set Up Order
1859 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1860
1861 ========== ========================== ====== ==============================
1862 SGPR Order Name Number Description
1863 (kernel descriptor enable of
1864 field) SGPRs
1865 ========== ========================== ====== ==============================
1866 First Private Segment Buffer 4 V# that can be used, together
1867 (enable_sgpr_private with Scratch Wave Offset as an
1868 _segment_buffer) offset, to access the private
1869 memory space using a segment
1870 address.
1871
1872 CP uses the value provided by
1873 the runtime.
1874 then Dispatch Ptr 2 64 bit address of AQL dispatch
1875 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1876 actually executing.
1877 then Queue Ptr 2 64 bit address of amd_queue_t
1878 (enable_sgpr_queue_ptr) object for AQL queue on which
1879 the dispatch packet was
1880 queued.
1881 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1882 (enable_sgpr_kernarg segment. This is directly
1883 _segment_ptr) copied from the
1884 kernarg_address in the kernel
1885 dispatch packet.
1886
1887 Having CP load it once avoids
1888 loading it at the beginning of
1889 every wavefront.
1890 then Dispatch Id 2 64 bit Dispatch ID of the
1891 (enable_sgpr_dispatch_id) dispatch packet being
1892 executed.
1893 then Flat Scratch Init 2 This is 2 SGPRs:
1894 (enable_sgpr_flat_scratch
1895 _init) GFX6
1896 Not supported.
1897 GFX7-GFX8
1898 The first SGPR is a 32 bit
1899 byte offset from
1900 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1901 to per SPI base of memory
1902 for scratch for the queue
1903 executing the kernel
1904 dispatch. CP obtains this
1905 from the runtime.
1906
1907 This is the same offset used
1908 in computing the Scratch
1909 Segment Buffer base
1910 address. The value of
1911 Scratch Wave Offset must be
1912 added by the kernel machine
1913 code and moved to SGPRn-4
1914 for use as the FLAT SCRATCH
1915 BASE in flat memory
1916 instructions.
1917
1918 The second SGPR is 32 bit
1919 byte size of a single
1920 work-item’s scratch memory
1921 usage. This is directly
1922 loaded from the kernel
1923 dispatch packet Private
1924 Segment Byte Size and
1925 rounded up to a multiple of
1926 DWORD.
1927
1928 The kernel code must move to
1929 SGPRn-3 for use as the FLAT
1930 SCRATCH SIZE in flat memory
1931 instructions. Having CP load
1932 it once avoids loading it at
1933 the beginning of every
1934 wavefront.
1935 GFX9
1936 This is the 64 bit base
1937 address of the per SPI
1938 scratch backing memory
1939 managed by SPI for the queue
1940 executing the kernel
1941 dispatch. CP obtains this
1942 from the runtime (and
1943 divides it if there are
1944 multiple Shader Arrays each
1945 with its own SPI). The value
1946 of Scratch Wave Offset must
1947 be added by the kernel
1948 machine code and moved to
1949 SGPRn-4 and SGPRn-3 for use
1950 as the FLAT SCRATCH BASE in
1951 flat memory instructions.
1952 then Private Segment Size 1 The 32 bit byte size of a
1953 (enable_sgpr_private single work-item’s scratch
1954 _segment_size) memory allocation. This is the
1955 value from the kernel dispatch
1956 packet Private Segment Byte
1957 Size rounded up by CP to a
1958 multiple of DWORD.
1959
1960 Having CP load it once avoids
1961 loading it at the beginning of
1962 every wavefront.
1963
1964 This is not used for
1965 GFX7-GFX8 since it is the same
1966 value as the second SGPR of
1967 Flat Scratch Init. However, it
1968 may be needed for GFX9 which
1969 changes the meaning of the
1970 Flat Scratch Init value.
1971 then Grid Work-Group Count X 1 32 bit count of the number of
1972 (enable_sgpr_grid work-groups in the X dimension
1973 _workgroup_count_X) for the grid being
1974 executed. Computed from the
1975 fields in the kernel dispatch
1976 packet as ((grid_size.x +
1977 workgroup_size.x - 1) /
1978 workgroup_size.x).
1979 then Grid Work-Group Count Y 1 32 bit count of the number of
1980 (enable_sgpr_grid work-groups in the Y dimension
1981 _workgroup_count_Y && for the grid being
1982 less than 16 previous executed. Computed from the
1983 SGPRs) fields in the kernel dispatch
1984 packet as ((grid_size.y +
1985 workgroup_size.y - 1) /
1986 workgroupSize.y).
1987
1988 Only initialized if <16
1989 previous SGPRs initialized.
1990 then Grid Work-Group Count Z 1 32 bit count of the number of
1991 (enable_sgpr_grid work-groups in the Z dimension
1992 _workgroup_count_Z && for the grid being
1993 less than 16 previous executed. Computed from the
1994 SGPRs) fields in the kernel dispatch
1995 packet as ((grid_size.z +
1996 workgroup_size.z - 1) /
1997 workgroupSize.z).
1998
1999 Only initialized if <16
2000 previous SGPRs initialized.
2001 then Work-Group Id X 1 32 bit work-group id in X
2002 (enable_sgpr_workgroup_id dimension of grid for
2003 _X) wavefront.
2004 then Work-Group Id Y 1 32 bit work-group id in Y
2005 (enable_sgpr_workgroup_id dimension of grid for
2006 _Y) wavefront.
2007 then Work-Group Id Z 1 32 bit work-group id in Z
2008 (enable_sgpr_workgroup_id dimension of grid for
2009 _Z) wavefront.
2010 then Work-Group Info 1 {first_wave, 14’b0000,
2011 (enable_sgpr_workgroup ordered_append_term[10:0],
2012 _info) threadgroup_size_in_waves[5:0]}
2013 then Scratch Wave Offset 1 32 bit byte offset from base
2014 (enable_sgpr_private of scratch base of queue
2015 _segment_wave_offset) executing the kernel
2016 dispatch. Must be used as an
2017 offset with Private
2018 segment address when using
2019 Scratch Segment Buffer. It
2020 must be used to set up FLAT
2021 SCRATCH for flat addressing
2022 (see
2023 :ref:`amdgpu-amdhsa-flat-scratch`).
2024 ========== ========================== ====== ==============================
2025
2026 The order of the VGPR registers is defined, but the compiler can specify which
2027 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2028 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2029 for enabled registers are dense starting at VGPR0: the first enabled register is
2030 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2031 VGPR number.
2032
2033 VGPR register initial state is defined in
2034 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2035
2036 .. table:: VGPR Register Set Up Order
2037 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2038
2039 ========== ========================== ====== ==============================
2040 VGPR Order Name Number Description
2041 (kernel descriptor enable of
2042 field) VGPRs
2043 ========== ========================== ====== ==============================
2044 First Work-Item Id X 1 32 bit work item id in X
2045 (Always initialized) dimension of work-group for
2046 wavefront lane.
2047 then Work-Item Id Y 1 32 bit work item id in Y
2048 (enable_vgpr_workitem_id dimension of work-group for
2049 > 0) wavefront lane.
2050 then Work-Item Id Z 1 32 bit work item id in Z
2051 (enable_vgpr_workitem_id dimension of work-group for
2052 > 1) wavefront lane.
2053 ========== ========================== ====== ==============================
2054
2055 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2056
2057 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2058 registers.
2059 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2060 combination including none.
2061 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2062 cannot included with the flat scratch init value which is per queue.
2063 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2064 or (X, Y, Z).
2065
2066 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2067 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2068
2069 The global segment can be accessed either using buffer instructions (GFX6 which
2070 has V# 64 bit address support), flat instructions (GFX7-9), or global
2071 instructions (GFX9).
2072
2073 If buffer operations are used then the compiler can generate a V# with the
2074 following properties:
2075
2076 * base address of 0
2077 * no swizzle
2078 * ATC: 1 if IOMMU present (such as APU)
2079 * ptr64: 1
2080 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2081 APU and NC for dGPU).
2082
2083 .. _amdgpu-amdhsa-kernel-prolog:
2084
2085 Kernel Prolog
2086 ~~~~~~~~~~~~~
2087
2088 .. _amdgpu-amdhsa-m0:
2089
2090 M0
2091 ++
2092
2093 GFX6-GFX8
2094 The M0 register must be initialized with a value at least the total LDS size
2095 if the kernel may access LDS via DS or flat operations. Total LDS size is
2096 available in dispatch packet. For M0, it is also possible to use maximum
2097 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2098 GFX7-GFX8).
2099 GFX9
2100 The M0 register is not used for range checking LDS accesses and so does not
2101 need to be initialized in the prolog.
2102
2103 .. _amdgpu-amdhsa-flat-scratch:
2104
2105 Flat Scratch
2106 ++++++++++++
2107
2108 If the kernel may use flat operations to access scratch memory, the prolog code
2109 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2110 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2111 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2112
2113 GFX6
2114 Flat scratch is not supported.
2115
2116 GFX7-8
2117 1. The low word of Flat Scratch Init is 32 bit byte offset from
2118 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2119 being managed by SPI for the queue executing the kernel dispatch. This is
2120 the same value used in the Scratch Segment Buffer V# base address. The
2121 prolog must add the value of Scratch Wave Offset to get the wave's byte
2122 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2123 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2124 by 8 before moving into FLAT_SCRATCH_LO.
2125 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2126 work-items scratch memory usage. This is directly loaded from the kernel
2127 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2128 DWORD. Having CP load it once avoids loading it at the beginning of every
2129 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2130 SIZE.
2131 GFX9
2132 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2133 memory being managed by SPI for the queue executing the kernel dispatch. The
2134 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2135 pair for use as the flat scratch base in flat memory instructions.
2136
2137 .. _amdgpu-amdhsa-memory-model:
2138
2139 Memory Model
2140 ~~~~~~~~~~~~
2141
2142 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2143 (see :ref:`memmodel`). *The implementation is WIP.*
2144
2145 .. TODO
2146 Update when implementation complete.
2147
2148 Support more relaxed OpenCL memory model to be controled by environment
2149 component of target triple.
2150
2151 The AMDGPU backend supports the memory synchronization scopes specified in
2152 :ref:`amdgpu-memory-scopes`.
2153
2154 The code sequences used to implement the memory model are defined in table
2155 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2156
2157 The sequences specify the order of instructions that a single thread must
2158 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2159 to other memory instructions executed by the same thread. This allows them to be
2160 moved earlier or later which can allow them to be combined with other instances
2161 of the same instruction, or hoisted/sunk out of loops to improve
2162 performance. Only the instructions related to the memory model are given;
2163 additional ``s_waitcnt`` instructions are required to ensure registers are
2164 defined before being used. These may be able to be combined with the memory
2165 model ``s_waitcnt`` instructions as described above.
2166
2167 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2168 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2169 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2170 memory model which has separate happens-before relations for the global and
2171 local address spaces, and only a fence specifying both global and local address
2172 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2173 allow an address space to be specified the OpenCL fence has to convervatively
2174 assume both local and global address space was specified. However, optimizations
2175 can often be done to eliminate the additional ``s_waitcnt``instructions when
2176 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2177 instructions. The code sequences in the table indicate what can be omitted for
2178 the OpenCL memory. The target triple environment is used to determine if the
2179 source language is OpenCL (see :ref:`amdgpu-opencl`).
2180
2181 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2182 operations.
2183
2184 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2185 termed vector memory operations.
2186
2187 For GFX6-GFX9:
2188
2189 * Each agent has multiple compute units (CU).
2190 * Each CU has multiple SIMDs that execute wavefronts.
2191 * The wavefronts for a single work-group are executed in the same CU but may be
2192 executed by different SIMDs.
2193 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2194 executing on it.
2195 * All LDS operations of a CU are performed as wavefront wide operations in a
2196 global order and involve no caching. Completion is reported to a wavefront in
2197 execution order.
2198 * The LDS memory has multiple request queues shared by the SIMDs of a
2199 CU. Therefore, the LDS operations performed by different waves of a work-group
2200 can be reordered relative to each other, which can result in reordering the
2201 visibility of vector memory operations with respect to LDS operations of other
2202 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2203 ensure synchonization between LDS operations and vector memory operations
2204 between waves of a work-group, but not between operations performed by the
2205 same wavefront.
2206 * The vector memory operations are performed as wavefront wide operations and
2207 completion is reported to a wavefront in execution order. The exception is
2208 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2209 vector memory order if they access LDS memory, and out of LDS operation order
2210 if they access global memory.
2211 * The vector memory operations access a vector L1 cache shared by all wavefronts
2212 on a CU. Therefore, no special action is required for coherence between
2213 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2214 coherence between waves executing in different work-groups as they may be
2215 executing on different CUs.
2216 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2217 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2218 scalar operations are used in a restricted way so do not impact the memory
2219 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2220 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2221 the same agent.
2222 * The L2 cache has independent channels to service disjoint ranges of virtual
2223 addresses.
2224 * Each CU has a separate request queue per channel. Therefore, the vector and
2225 scalar memory operations performed by waves executing in different work-groups
2226 (which may be executing on different CUs) of an agent can be reordered
2227 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2228 synchonization between vector memory operations of different CUs. It ensures a
2229 previous vector memory operation has completed before executing a subsequent
2230 vector memory or LDS operation and so can be used to meet the requirements of
2231 acquire and release.
2232 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2233 of virtual addresses can be set up to bypass it to ensure system coherence.
2234
2235 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2236 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2237 memory, atomic memory orderings are not meaningful and all accesses are treated
2238 as non-atomic.
2239
2240 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2241 scalar memory instructions). Since the constant address space contents do not
2242 change during the execution of a kernel dispatch it is not legal to perform
2243 stores, and atomic memory orderings are not meaningful and all access are
2244 treated as non-atomic.
2245
2246 A memory synchronization scope wider than work-group is not meaningful for the
2247 group (LDS) address space and is treated as work-group.
2248
2249 The memory model does not support the region address space which is treated as
2250 non-atomic.
2251
2252 Acquire memory ordering is not meaningful on store atomic instructions and is
2253 treated as non-atomic.
2254
2255 Release memory ordering is not meaningful on load atomic instructions and is
2256 treated a non-atomic.
2257
2258 Acquire-release memory ordering is not meaningful on load or store atomic
2259 instructions and is treated as acquire and release respectively.
2260
2261 AMDGPU backend only uses scalar memory operations to access memory that is
2262 proven to not change during the execution of the kernel dispatch. This includes
2263 constant address space and global address space for program scope const
2264 variables. Therefore the kernel machine code does not have to maintain the
2265 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2266 and vector L1 caches are invalidated between kernel dispatches by CP since
2267 constant address space data may change between kernel dispatch executions. See
2268 :ref:`amdgpu-amdhsa-memory-spaces`.
2269
2270 The one exeception is if scalar writes are used to spill SGPR registers. In this
2271 case the AMDGPU backend ensures the memory location used to spill is never
2272 accessed by vector memory operations at the same time. If scalar writes are used
2273 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2274 return since the locations may be used for vector memory instructions by a
2275 future wave that uses the same scratch area, or a function call that creates a
2276 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2277 as all scalar writes are write-before-read in the same thread.
2278
2279 Scratch backing memory (which is used for the private address space) is accessed
2280 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2281 is only accessed by a single thread, and is always write-before-read,
2282 there is never a need to invalidate these entries from the L1 cache. Hence all
2283 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2284 lines.
2285
2286 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2287 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2288 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2289 coherent) and so the L2 cache will coherent with the CPU and other agents.
2290
2291 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2292 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2293
2294 ============ ============ ============== ========== =======================
2295 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2296 Ordering Sync Scope Address
2297 Space
2298 ============ ============ ============== ========== =======================
2299 **Non-Atomic**
2300 ---------------------------------------------------------------------------
2301 load *none* *none* - global non-volatile
2302 - generic 1. buffer/global/flat_load
2303 volatile
2304 1. buffer/global/flat_load
2305 glc=1
2306 load *none* *none* - local 1. ds_load
2307 store *none* *none* - global 1. buffer/global/flat_store
2308 - generic
2309 store *none* *none* - local 1. ds_store
2310 **Unordered Atomic**
2311 ---------------------------------------------------------------------------
2312 load atomic unordered *any* *any* *Same as non-atomic*.
2313 store atomic unordered *any* *any* *Same as non-atomic*.
2314 atomicrmw unordered *any* *any* *Same as monotonic
2315 atomic*.
2316 **Monotonic Atomic**
2317 ---------------------------------------------------------------------------
2318 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2319 - wavefront - generic
2320 - workgroup
2321 load atomic monotonic - singlethread - local 1. ds_load
2322 - wavefront
2323 - workgroup
2324 load atomic monotonic - agent - global 1. buffer/global/flat_load
2325 - system - generic glc=1
2326 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2327 - wavefront - generic
2328 - workgroup
2329 - agent
2330 - system
2331 store atomic monotonic - singlethread - local 1. ds_store
2332 - wavefront
2333 - workgroup
2334 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2335 - wavefront - generic
2336 - workgroup
2337 - agent
2338 - system
2339 atomicrmw monotonic - singlethread - local 1. ds_atomic
2340 - wavefront
2341 - workgroup
2342 **Acquire Atomic**
2343 ---------------------------------------------------------------------------
2344 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2345 - wavefront - local
2346 - generic
2347 load atomic acquire - workgroup - global 1. buffer/global_load
2348 load atomic acquire - workgroup - local 1. ds/flat_load
2349 - generic 2. s_waitcnt lgkmcnt(0)
2350
2351 - If OpenCL, omit
2352 waitcnt.
2353 - Must happen before
2354 any following
2355 global/generic
2356 load/load
2357 atomic/store/store
2358 atomic/atomicrmw.
2359 - Ensures any
2360 following global
2361 data read is no
2362 older than the load
2363 atomic value being
2364 acquired.
2365
2366 load atomic acquire - agent - global 1. buffer/global_load
2367 - system glc=1
2368 2. s_waitcnt vmcnt(0)
2369
2370 - Must happen before
2371 following
2372 buffer_wbinvl1_vol.
2373 - Ensures the load
2374 has completed
2375 before invalidating
2376 the cache.
2377
2378 3. buffer_wbinvl1_vol
2379
2380 - Must happen before
2381 any following
2382 global/generic
2383 load/load
2384 atomic/atomicrmw.
2385 - Ensures that
2386 following
2387 loads will not see
2388 stale global data.
2389
2390 load atomic acquire - agent - generic 1. flat_load glc=1
2391 - system 2. s_waitcnt vmcnt(0) &
2392 lgkmcnt(0)
2393
2394 - If OpenCL omit
2395 lgkmcnt(0).
2396 - Must happen before
2397 following
2398 buffer_wbinvl1_vol.
2399 - Ensures the flat_load
2400 has completed
2401 before invalidating
2402 the cache.
2403
2404 3. buffer_wbinvl1_vol
2405
2406 - Must happen before
2407 any following
2408 global/generic
2409 load/load
2410 atomic/atomicrmw.
2411 - Ensures that
2412 following loads
2413 will not see stale
2414 global data.
2415
2416 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2417 - wavefront - local
2418 - generic
2419 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2420 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2421 - generic 2. waitcnt lgkmcnt(0)
2422
2423 - If OpenCL, omit
2424 waitcnt.
2425 - Must happen before
2426 any following
2427 global/generic
2428 load/load
2429 atomic/store/store
2430 atomic/atomicrmw.
2431 - Ensures any
2432 following global
2433 data read is no
2434 older than the
2435 atomicrmw value
2436 being acquired.
2437
2438 atomicrmw acquire - agent - global 1. buffer/global_atomic
2439 - system 2. s_waitcnt vmcnt(0)
2440
2441 - Must happen before
2442 following
2443 buffer_wbinvl1_vol.
2444 - Ensures the
2445 atomicrmw has
2446 completed before
2447 invalidating the
2448 cache.
2449
2450 3. buffer_wbinvl1_vol
2451
2452 - Must happen before
2453 any following
2454 global/generic
2455 load/load
2456 atomic/atomicrmw.
2457 - Ensures that
2458 following loads
2459 will not see stale
2460 global data.
2461
2462 atomicrmw acquire - agent - generic 1. flat_atomic
2463 - system 2. s_waitcnt vmcnt(0) &
2464 lgkmcnt(0)
2465
2466 - If OpenCL, omit
2467 lgkmcnt(0).
2468 - Must happen before
2469 following
2470 buffer_wbinvl1_vol.
2471 - Ensures the
2472 atomicrmw has
2473 completed before
2474 invalidating the
2475 cache.
2476
2477 3. buffer_wbinvl1_vol
2478
2479 - Must happen before
2480 any following
2481 global/generic
2482 load/load
2483 atomic/atomicrmw.
2484 - Ensures that
2485 following loads
2486 will not see stale
2487 global data.
2488
2489 fence acquire - singlethread *none* *none*
2490 - wavefront
2491 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2492
2493 - If OpenCL and
2494 address space is
2495 not generic, omit
2496 waitcnt. However,
2497 since LLVM
2498 currently has no
2499 address space on
2500 the fence need to
2501 conservatively
2502 always generate. If
2503 fence had an
2504 address space then
2505 set to address
2506 space of OpenCL
2507 fence flag, or to
2508 generic if both
2509 local and global
2510 flags are
2511 specified.
2512 - Must happen after
2513 any preceding
2514 local/generic load
2515 atomic/atomicrmw
2516 with an equal or
2517 wider sync scope
2518 and memory ordering
2519 stronger than
2520 unordered (this is
2521 termed the
2522 fence-paired-atomic).
2523 - Must happen before
2524 any following
2525 global/generic
2526 load/load
2527 atomic/store/store
2528 atomic/atomicrmw.
2529 - Ensures any
2530 following global
2531 data read is no
2532 older than the
2533 value read by the
2534 fence-paired-atomic.
2535
2536 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2537 - system lgkmcnt(0)
2538
2539 - If OpenCL and
2540 address space is
2541 not generic, omit
2542 lgkmcnt(0).
2543 However, since LLVM
2544 currently has no
2545 address space on
2546 the fence need to
2547 conservatively
2548 always generate
2549 (see comment for
2550 previous fence).
2551 - Could be split into
2552 separate s_waitcnt
2553 vmcnt(0) and
2554 s_waitcnt
2555 lgkmcnt(0) to allow
2556 them to be
2557 independently moved
2558 according to the
2559 following rules.
2560 - s_waitcnt vmcnt(0)
2561 must happen after
2562 any preceding
2563 global/generic load
2564 atomic/atomicrmw
2565 with an equal or
2566 wider sync scope
2567 and memory ordering
2568 stronger than
2569 unordered (this is
2570 termed the
2571 fence-paired-atomic).
2572 - s_waitcnt lgkmcnt(0)
2573 must happen after
2574 any preceding
2575 group/generic load
2576 atomic/atomicrmw
2577 with an equal or
2578 wider sync scope
2579 and memory ordering
2580 stronger than
2581 unordered (this is
2582 termed the
2583 fence-paired-atomic).
2584 - Must happen before
2585 the following
2586 buffer_wbinvl1_vol.
2587 - Ensures that the
2588 fence-paired atomic
2589 has completed
2590 before invalidating
2591 the
2592 cache. Therefore
2593 any following
2594 locations read must
2595 be no older than
2596 the value read by
2597 the
2598 fence-paired-atomic.
2599
2600 2. buffer_wbinvl1_vol
2601
2602 - Must happen before
2603 any following global/generic
2604 load/load
2605 atomic/store/store
2606 atomic/atomicrmw.
2607 - Ensures that
2608 following loads
2609 will not see stale
2610 global data.
2611
2612 **Release Atomic**
2613 ---------------------------------------------------------------------------
2614 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2615 - wavefront - local
2616 - generic
2617 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2618 - generic
2619 - If OpenCL, omit
2620 waitcnt.
2621 - Must happen after
2622 any preceding
2623 local/generic
2624 load/store/load
2625 atomic/store
2626 atomic/atomicrmw.
2627 - Must happen before
2628 the following
2629 store.
2630 - Ensures that all
2631 memory operations
2632 to local have
2633 completed before
2634 performing the
2635 store that is being
2636 released.
2637
2638 2. buffer/global/flat_store
2639 store atomic release - workgroup - local 1. ds_store
2640 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2641 - system - generic lgkmcnt(0)
2642
2643 - If OpenCL, omit
2644 lgkmcnt(0).
2645 - Could be split into
2646 separate s_waitcnt
2647 vmcnt(0) and
2648 s_waitcnt
2649 lgkmcnt(0) to allow
2650 them to be
2651 independently moved
2652 according to the
2653 following rules.
2654 - s_waitcnt vmcnt(0)
2655 must happen after
2656 any preceding
2657 global/generic
2658 load/store/load
2659 atomic/store
2660 atomic/atomicrmw.
2661 - s_waitcnt lgkmcnt(0)
2662 must happen after
2663 any preceding
2664 local/generic
2665 load/store/load
2666 atomic/store
2667 atomic/atomicrmw.
2668 - Must happen before
2669 the following
2670 store.
2671 - Ensures that all
2672 memory operations
2673 to global have
2674 completed before
2675 performing the
2676 store that is being
2677 released.
2678
2679 2. buffer/global/ds/flat_store
2680 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2681 - wavefront - local
2682 - generic
2683 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2684 - generic
2685 - If OpenCL, omit
2686 waitcnt.
2687 - Must happen after
2688 any preceding
2689 local/generic
2690 load/store/load
2691 atomic/store
2692 atomic/atomicrmw.
2693 - Must happen before
2694 the following
2695 atomicrmw.
2696 - Ensures that all
2697 memory operations
2698 to local have
2699 completed before
2700 performing the
2701 atomicrmw that is
2702 being released.
2703
2704 2. buffer/global/flat_atomic
2705 atomicrmw release - workgroup - local 1. ds_atomic
2706 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2707 - system - generic lgkmcnt(0)
2708
2709 - If OpenCL, omit
2710 lgkmcnt(0).
2711 - Could be split into
2712 separate s_waitcnt
2713 vmcnt(0) and
2714 s_waitcnt
2715 lgkmcnt(0) to allow
2716 them to be
2717 independently moved
2718 according to the
2719 following rules.
2720 - s_waitcnt vmcnt(0)
2721 must happen after
2722 any preceding
2723 global/generic
2724 load/store/load
2725 atomic/store
2726 atomic/atomicrmw.
2727 - s_waitcnt lgkmcnt(0)
2728 must happen after
2729 any preceding
2730 local/generic
2731 load/store/load
2732 atomic/store
2733 atomic/atomicrmw.
2734 - Must happen before
2735 the following
2736 atomicrmw.
2737 - Ensures that all
2738 memory operations
2739 to global and local
2740 have completed
2741 before performing
2742 the atomicrmw that
2743 is being released.
2744
2745 2. buffer/global/ds/flat_atomic*
2746 fence release - singlethread *none* *none*
2747 - wavefront
2748 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2749
2750 - If OpenCL and
2751 address space is
2752 not generic, omit
2753 waitcnt. However,
2754 since LLVM
2755 currently has no
2756 address space on
2757 the fence need to
2758 conservatively
2759 always generate
2760 (see comment for
2761 previous fence).
2762 - Must happen after
2763 any preceding
2764 local/generic
2765 load/load
2766 atomic/store/store
2767 atomic/atomicrmw.
2768 - Must happen before
2769 any following store
2770 atomic/atomicrmw
2771 with an equal or
2772 wider sync scope
2773 and memory ordering
2774 stronger than
2775 unordered (this is
2776 termed the
2777 fence-paired-atomic).
2778 - Ensures that all
2779 memory operations
2780 to local have
2781 completed before
2782 performing the
2783 following
2784 fence-paired-atomic.
2785
2786 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2787 - system lgkmcnt(0)
2788
2789 - If OpenCL and
2790 address space is
2791 not generic, omit
2792 lgkmcnt(0).
2793 However, since LLVM
2794 currently has no
2795 address space on
2796 the fence need to
2797 conservatively
2798 always generate
2799 (see comment for
2800 previous fence).
2801 - Could be split into
2802 separate s_waitcnt
2803 vmcnt(0) and
2804 s_waitcnt
2805 lgkmcnt(0) to allow
2806 them to be
2807 independently moved
2808 according to the
2809 following rules.
2810 - s_waitcnt vmcnt(0)
2811 must happen after
2812 any preceding
2813 global/generic
2814 load/store/load
2815 atomic/store
2816 atomic/atomicrmw.
2817 - s_waitcnt lgkmcnt(0)
2818 must happen after
2819 any preceding
2820 local/generic
2821 load/store/load
2822 atomic/store
2823 atomic/atomicrmw.
2824 - Must happen before
2825 any following store
2826 atomic/atomicrmw
2827 with an equal or
2828 wider sync scope
2829 and memory ordering
2830 stronger than
2831 unordered (this is
2832 termed the
2833 fence-paired-atomic).
2834 - Ensures that all
2835 memory operations
2836 to global have
2837 completed before
2838 performing the
2839 following
2840 fence-paired-atomic.
2841
2842 **Acquire-Release Atomic**
2843 ---------------------------------------------------------------------------
2844 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2845 - wavefront - local
2846 - generic
2847 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2848
2849 - If OpenCL, omit
2850 waitcnt.
2851 - Must happen after
2852 any preceding
2853 local/generic
2854 load/store/load
2855 atomic/store
2856 atomic/atomicrmw.
2857 - Must happen before
2858 the following
2859 atomicrmw.
2860 - Ensures that all
2861 memory operations
2862 to local have
2863 completed before
2864 performing the
2865 atomicrmw that is
2866 being released.
2867
2868 2. buffer/global_atomic
2869 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2870 2. s_waitcnt lgkmcnt(0)
2871
2872 - If OpenCL, omit
2873 waitcnt.
2874 - Must happen before
2875 any following
2876 global/generic
2877 load/load
2878 atomic/store/store
2879 atomic/atomicrmw.
2880 - Ensures any
2881 following global
2882 data read is no
2883 older than the load
2884 atomic value being
2885 acquired.
2886
2887 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2888
2889 - If OpenCL, omit
2890 waitcnt.
2891 - Must happen after
2892 any preceding
2893 local/generic
2894 load/store/load
2895 atomic/store
2896 atomic/atomicrmw.
2897 - Must happen before
2898 the following
2899 atomicrmw.
2900 - Ensures that all
2901 memory operations
2902 to local have
2903 completed before
2904 performing the
2905 atomicrmw that is
2906 being released.
2907
2908 2. flat_atomic
2909 3. s_waitcnt lgkmcnt(0)
2910
2911 - If OpenCL, omit
2912 waitcnt.
2913 - Must happen before
2914 any following
2915 global/generic
2916 load/load
2917 atomic/store/store
2918 atomic/atomicrmw.
2919 - Ensures any
2920 following global
2921 data read is no
2922 older than the load
2923 atomic value being
2924 acquired.
2925 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
2926 - system lgkmcnt(0)
2927
2928 - If OpenCL, omit
2929 lgkmcnt(0).
2930 - Could be split into
2931 separate s_waitcnt
2932 vmcnt(0) and
2933 s_waitcnt
2934 lgkmcnt(0) to allow
2935 them to be
2936 independently moved
2937 according to the
2938 following rules.
2939 - s_waitcnt vmcnt(0)
2940 must happen after
2941 any preceding
2942 global/generic
2943 load/store/load
2944 atomic/store
2945 atomic/atomicrmw.
2946 - s_waitcnt lgkmcnt(0)
2947 must happen after
2948 any preceding
2949 local/generic
2950 load/store/load
2951 atomic/store
2952 atomic/atomicrmw.
2953 - Must happen before
2954 the following
2955 atomicrmw.
2956 - Ensures that all
2957 memory operations
2958 to global have
2959 completed before
2960 performing the
2961 atomicrmw that is
2962 being released.
2963
2964 2. buffer/global_atomic
2965 3. s_waitcnt vmcnt(0)
2966
2967 - Must happen before
2968 following
2969 buffer_wbinvl1_vol.
2970 - Ensures the
2971 atomicrmw has
2972 completed before
2973 invalidating the
2974 cache.
2975
2976 4. buffer_wbinvl1_vol
2977
2978 - Must happen before
2979 any following
2980 global/generic
2981 load/load
2982 atomic/atomicrmw.
2983 - Ensures that
2984 following loads
2985 will not see stale
2986 global data.
2987
2988 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
2989 - system lgkmcnt(0)
2990
2991 - If OpenCL, omit
2992 lgkmcnt(0).
2993 - Could be split into
2994 separate s_waitcnt
2995 vmcnt(0) and
2996 s_waitcnt
2997 lgkmcnt(0) to allow
2998 them to be
2999 independently moved
3000 according to the
3001 following rules.
3002 - s_waitcnt vmcnt(0)
3003 must happen after
3004 any preceding
3005 global/generic
3006 load/store/load
3007 atomic/store
3008 atomic/atomicrmw.
3009 - s_waitcnt lgkmcnt(0)
3010 must happen after
3011 any preceding
3012 local/generic
3013 load/store/load
3014 atomic/store
3015 atomic/atomicrmw.
3016 - Must happen before
3017 the following
3018 atomicrmw.
3019 - Ensures that all
3020 memory operations
3021 to global have
3022 completed before
3023 performing the
3024 atomicrmw that is
3025 being released.
3026
3027 2. flat_atomic
3028 3. s_waitcnt vmcnt(0) &
3029 lgkmcnt(0)
3030
3031 - If OpenCL, omit
3032 lgkmcnt(0).
3033 - Must happen before
3034 following
3035 buffer_wbinvl1_vol.
3036 - Ensures the
3037 atomicrmw has
3038 completed before
3039 invalidating the
3040 cache.
3041
3042 4. buffer_wbinvl1_vol
3043
3044 - Must happen before
3045 any following
3046 global/generic
3047 load/load
3048 atomic/atomicrmw.
3049 - Ensures that
3050 following loads
3051 will not see stale
3052 global data.
3053
3054 fence acq_rel - singlethread *none* *none*
3055 - wavefront
3056 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3057
3058 - If OpenCL and
3059 address space is
3060 not generic, omit
3061 waitcnt. However,
3062 since LLVM
3063 currently has no
3064 address space on
3065 the fence need to
3066 conservatively
3067 always generate
3068 (see comment for
3069 previous fence).
3070 - Must happen after
3071 any preceding
3072 local/generic
3073 load/load
3074 atomic/store/store
3075 atomic/atomicrmw.
3076 - Must happen before
3077 any following
3078 global/generic
3079 load/load
3080 atomic/store/store
3081 atomic/atomicrmw.
3082 - Ensures that all
3083 memory operations
3084 to local have
3085 completed before
3086 performing any
3087 following global
3088 memory operations.
3089 - Ensures that the
3090 preceding
3091 local/generic load
3092 atomic/atomicrmw
3093 with an equal or
3094 wider sync scope
3095 and memory ordering
3096 stronger than
3097 unordered (this is
3098 termed the
3099 fence-paired-atomic)
3100 has completed
3101 before following
3102 global memory
3103 operations. This
3104 satisfies the
3105 requirements of
3106 acquire.
3107 - Ensures that all
3108 previous memory
3109 operations have
3110 completed before a
3111 following
3112 local/generic store
3113 atomic/atomicrmw
3114 with an equal or
3115 wider sync scope
3116 and memory ordering
3117 stronger than
3118 unordered (this is
3119 termed the
3120 fence-paired-atomic).
3121 This satisfies the
3122 requirements of
3123 release.
3124
3125 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3126 - system lgkmcnt(0)
3127
3128 - If OpenCL and
3129 address space is
3130 not generic, omit
3131 lgkmcnt(0).
3132 However, since LLVM
3133 currently has no
3134 address space on
3135 the fence need to
3136 conservatively
3137 always generate
3138 (see comment for
3139 previous fence).
3140 - Could be split into
3141 separate s_waitcnt
3142 vmcnt(0) and
3143 s_waitcnt
3144 lgkmcnt(0) to allow
3145 them to be
3146 independently moved
3147 according to the
3148 following rules.
3149 - s_waitcnt vmcnt(0)
3150 must happen after
3151 any preceding
3152 global/generic
3153 load/store/load
3154 atomic/store
3155 atomic/atomicrmw.
3156 - s_waitcnt lgkmcnt(0)
3157 must happen after
3158 any preceding
3159 local/generic
3160 load/store/load
3161 atomic/store
3162 atomic/atomicrmw.
3163 - Must happen before
3164 the following
3165 buffer_wbinvl1_vol.
3166 - Ensures that the
3167 preceding
3168 global/local/generic
3169 load
3170 atomic/atomicrmw
3171 with an equal or
3172 wider sync scope
3173 and memory ordering
3174 stronger than
3175 unordered (this is
3176 termed the
3177 fence-paired-atomic)
3178 has completed
3179 before invalidating
3180 the cache. This
3181 satisfies the
3182 requirements of
3183 acquire.
3184 - Ensures that all
3185 previous memory
3186 operations have
3187 completed before a
3188 following
3189 global/local/generic
3190 store
3191 atomic/atomicrmw
3192 with an equal or
3193 wider sync scope
3194 and memory ordering
3195 stronger than
3196 unordered (this is
3197 termed the
3198 fence-paired-atomic).
3199 This satisfies the
3200 requirements of
3201 release.
3202
3203 2. buffer_wbinvl1_vol
3204
3205 - Must happen before
3206 any following
3207 global/generic
3208 load/load
3209