llvm.org GIT mirror llvm / 0a09220
[AMDGPU] Corrections to memory model description. - Add description on nontemporal support. - Correct OpenCL sequentially consistent and fence code sequences. - Minor test cleanup. Differential Revision: https://reviews.llvm.org/D39073 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@316131 91177308-0d34-0410-b5e6-96231b3b80d8 Tony Tye 1 year, 10 months ago
1 changed file(s) with 379 addition(s) and 177 deletion(s). Raw diff Collapse all Expand all
12391239 =================================== ============== ========= ==============
12401240
12411241 .. TODO
1242 Plan to remove the debug properties metadata.
1242 Plan to remove the debug properties metadata.
12431243
12441244 Kernel Dispatch
12451245 ~~~~~~~~~~~~~~~
14301430 .. table:: Kernel Descriptor for GFX6-GFX9
14311431 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
14321432
1433 ======= ======= =============================== ===========================
1433 ======= ======= =============================== ============================
14341434 Bits Size Field Name Description
1435 ======= ======= =============================== ===========================
1435 ======= ======= =============================== ============================
14361436 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
14371437 address space memory
14381438 required for a work-group
14601460 97 1 bit IsXNACKEnabled Indicates if the generated
14611461 machine code is capable of
14621462 suppoting XNACK.
1463 127:98 30 bits Reserved. Must be 0.
1463 127:98 30 bits Reserved, must be 0.
14641464 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
14651465 negative) from base
14661466 address of kernel
14681468 entry point instruction
14691469 which must be 256 byte
14701470 aligned.
1471 383:192 24 Reserved. Must be 0.
1471 383:192 24 Reserved, must be 0.
14721472 bytes
14731473 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
14741474 program settings used by
14761476 ``COMPUTE_PGM_RSRC1``
14771477 configuration
14781478 register. See
1479 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1479 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
14801480 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
14811481 program settings used by
14821482 CP to set up
15081508 should always be 0.
15091509 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
15101510 should always be 0.
1511 463:458 6 bits Reserved. Must be 0.
1512 511:464 6 Reserved. Must be 0.
1511 463:458 6 bits Reserved, must be 0.
1512 511:464 6 Reserved, must be 0.
15131513 bytes
15141514 512 **Total size 64 bytes.**
1515 ======= ===================================================================
1515 ======= ====================================================================
15161516
15171517 ..
15181518
15191519 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1520 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1520 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
15211521
15221522 ======= ======= =============================== ===========================================================================
15231523 Bits Size Field Name Description
15281528 specific:
15291529
15301530 GFX6-9
1531 roundup((max-vgpg + 1)
1532 / 4) - 1
1531 - max_vgpr 1..256
1532 - roundup((max_vgpg + 1)
1533 / 4) - 1
15331534
15341535 Used by CP to set up
15351536 ``COMPUTE_PGM_RSRC1.VGPRS``.
15391540 specific:
15401541
15411542 GFX6-8
1542 roundup((max-sgpg + 1)
1543 / 8) - 1
1543 - max_sgpr 1..112
1544 - roundup((max_sgpg + 1)
1545 / 8) - 1
15441546 GFX9
1545 roundup((max-sgpg + 1)
1546 / 16) - 1
1547 - max_sgpr 1..112
1548 - roundup((max_sgpg + 1)
1549 / 16) - 1
15471550
15481551 Includes the special SGPRs
15491552 for VCC, Flat Scratch (for
16271630 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
16281631 with DX10 clamp mode
16291632 enabled. Used by the vector
1630 ALU to force DX-10 style
1633 ALU to force DX10 style
16311634 treatment of NaN's (when
16321635 set, clamp NaN to zero,
16331636 otherwise pass NaN
16751678 CP is responsible for
16761679 filling in
16771680 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1678 26 1 bit FP16_OVFL GFX6-8:
1679 Reserved. Must be 0.
1680 GFX9:
1681 Wavefront starts
1682 execution with specified
1683 fp16 overflow mode.
1684
1685 - If 0, then fp16
1686 overflow generates
1681 26 1 bit FP16_OVFL GFX6-8
1682 Reserved, must be 0.
1683 GFX9
1684 Wavefront starts execution
1685 with specified fp16 overflow
1686 mode.
1687
1688 - If 0, fp16 overflow generates
16871689 +/-INF values.
1688 - If 1, then fp16
1689 overflow that is the
1690 result of an +/-INF
1691 input value or divide
1692 by 0 generates a
1693 +/-INF, otherwise
1694 clamps computed
1695 overflow to +/-MAX_FP16
1696 as appropriate.
1690 - If 1, fp16 overflow that is the
1691 result of an +/-INF input value
1692 or divide by 0 produces a +/-INF,
1693 otherwise clamps computed
1694 overflow to +/-MAX_FP16 as
1695 appropriate.
16971696
16981697 Used by CP to set up
16991698 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1700 31:27 5 bits Reserved. Must be 0.
1699 31:27 5 bits Reserved, must be 0.
17011700 32 **Total size 4 bytes**
17021701 ======= ===================================================================================================================
17031702
18541853 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
18551854 _ZERO (rcp_iflag_f32 instruction
18561855 only)
1857 31 1 bit Reserved. Must be 0.
1856 31 1 bit Reserved, must be 0.
18581857 32 **Total size 4 bytes.**
18591858 ======= ===================================================================================================================
18601859
22442243 .. TODO
22452244 Update when implementation complete.
22462245
2247 Support more relaxed OpenCL memory model to be controlled by environment
2248 component of target triple.
2249
22502246 The AMDGPU backend supports the memory synchronization scopes specified in
22512247 :ref:`amdgpu-memory-scopes`.
22522248
22632259 defined before being used. These may be able to be combined with the memory
22642260 model ``s_waitcnt`` instructions as described above.
22652261
2266 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2267 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2268 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2269 memory model which has separate happens-before relations for the global and
2270 local address spaces, and only a fence specifying both global and local address
2271 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2272 allow an address space to be specified the OpenCL fence has to convervatively
2273 assume both local and global address space was specified. However, optimizations
2274 can often be done to eliminate the additional ``s_waitcnt``instructions when
2275 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2276 instructions. The code sequences in the table indicate what can be omitted for
2277 the OpenCL memory. The target triple environment is used to determine if the
2278 source language is OpenCL (see :ref:`amdgpu-opencl`).
2262 The AMDGPU backend supports the following memory models:
2263
2264 HSA Memory Model [HSA]_
2265 The HSA memory model uses a single happens-before relation for all address
2266 spaces (see :ref:`amdgpu-address-spaces`).
2267 OpenCL Memory Model [OpenCL]_
2268 The OpenCL memory model which has separate happens-before relations for the
2269 global and local address spaces. Only a fence specifying both global and
2270 local address space, and seq_cst instructions join the relationships. Since
2271 the LLVM ``memfence`` instruction does not allow an address space to be
2272 specified the OpenCL fence has to convervatively assume both local and
2273 global address space was specified. However, optimizations can often be
2274 done to eliminate the additional ``s_waitcnt`` instructions when there are
2275 no intervening memory instructions which access the corresponding address
2276 space. The code sequences in the table indicate what can be omitted for the
2277 OpenCL memory. The target triple environment is used to determine if the
2278 source language is OpenCL (see :ref:`amdgpu-opencl`).
22792279
22802280 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
22812281 operations.
23072307 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
23082308 vector memory order if they access LDS memory, and out of LDS operation order
23092309 if they access global memory.
2310 * The vector memory operations access a vector L1 cache shared by all wavefronts
2311 on a CU. Therefore, no special action is required for coherence between
2312 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2313 coherence between waves executing in different work-groups as they may be
2314 executing on different CUs.
2310 * The vector memory operations access a single vector L1 cache shared by all
2311 SIMDs a CU. Therefore, no special action is required for coherence between the
2312 lanes of a single wavefront, or for coherence between wavefronts in the same
2313 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2314 executing in different work-groups as they may be executing on different CUs.
23152315 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
23162316 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
23172317 scalar operations are used in a restricted way so do not impact the memory
23752375 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
23762376 as all scalar writes are write-before-read in the same thread.
23772377
2378 Scratch backing memory (which is used for the private address space) is accessed
2379 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2380 is only accessed by a single thread, and is always write-before-read,
2381 there is never a need to invalidate these entries from the L1 cache. Hence all
2382 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2383 lines.
2378 Scratch backing memory (which is used for the private address space)
2379 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2380 address space is only accessed by a single thread, and is always
2381 write-before-read, there is never a need to invalidate these entries from the L1
2382 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2383 volatile cache lines.
23842384
23852385 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2386 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2387 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2388 coherent) and so the L2 cache will coherent with the CPU and other agents.
2386 to invalidate the L2 cache. This also causes it to be treated as
2387 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2388 (cache coherent) and so the L2 cache will coherent with the CPU and other
2389 agents.
23892390
23902391 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
23912392 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
23922393
2393 ============ ============ ============== ========== =======================
2394 ============ ============ ============== ========== ===============================
23942395 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
23952396 Ordering Sync Scope Address
23962397 Space
2397 ============ ============ ============== ========== =======================
2398 ============ ============ ============== ========== ===============================
23982399 **Non-Atomic**
2399 ---------------------------------------------------------------------------
2400 load *none* *none* - global non-volatile
2401 - generic 1. buffer/global/flat_load
2402 volatile
2400 -----------------------------------------------------------------------------------
2401 load *none* *none* - global - !volatile & !nontemporal
2402 - generic
2403 - private 1. buffer/global/flat_load
2404 - constant
2405 - volatile & !nontemporal
2406
24032407 1. buffer/global/flat_load
24042408 glc=1
2409
2410 - nontemporal
2411
2412 1. buffer/global/flat_load
2413 glc=1 slc=1
2414
24052415 load *none* *none* - local 1. ds_load
2406 store *none* *none* - global 1. buffer/global/flat_store
2416 store *none* *none* - global - !nontemporal
24072417 - generic
2418 - private 1. buffer/global/flat_store
2419 - constant
2420 - nontemporal
2421
2422 1. buffer/global/flat_stote
2423 glc=1 slc=1
2424
24082425 store *none* *none* - local 1. ds_store
24092426 **Unordered Atomic**
2410 ---------------------------------------------------------------------------
2427 -----------------------------------------------------------------------------------
24112428 load atomic unordered *any* *any* *Same as non-atomic*.
24122429 store atomic unordered *any* *any* *Same as non-atomic*.
24132430 atomicrmw unordered *any* *any* *Same as monotonic
24142431 atomic*.
24152432 **Monotonic Atomic**
2416 ---------------------------------------------------------------------------
2433 -----------------------------------------------------------------------------------
24172434 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
24182435 - wavefront - generic
24192436 - workgroup
24392456 - wavefront
24402457 - workgroup
24412458 **Acquire Atomic**
2442 ---------------------------------------------------------------------------
2459 -----------------------------------------------------------------------------------
24432460 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
24442461 - wavefront - local
24452462 - generic
2446 load atomic acquire - workgroup - global 1. buffer/global_load
2447 load atomic acquire - workgroup - local 1. ds/flat_load
2448 - generic 2. s_waitcnt lgkmcnt(0)
2449
2450 - If OpenCL, omit
2451 waitcnt.
2463 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2464 load atomic acquire - workgroup - local 1. ds_load
2465 2. s_waitcnt lgkmcnt(0)
2466
2467 - If OpenCL, omit.
24522468 - Must happen before
24532469 any following
24542470 global/generic
24612477 older than the load
24622478 atomic value being
24632479 acquired.
2464
2465 load atomic acquire - agent - global 1. buffer/global_load
2480 load atomic acquire - workgroup - generic 1. flat_load
2481 2. s_waitcnt lgkmcnt(0)
2482
2483 - If OpenCL, omit.
2484 - Must happen before
2485 any following
2486 global/generic
2487 load/load
2488 atomic/store/store
2489 atomic/atomicrmw.
2490 - Ensures any
2491 following global
2492 data read is no
2493 older than the load
2494 atomic value being
2495 acquired.
2496 load atomic acquire - agent - global 1. buffer/global/flat_load
24662497 - system glc=1
24672498 2. s_waitcnt vmcnt(0)
24682499
25152546 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
25162547 - wavefront - local
25172548 - generic
2518 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2519 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2520 - generic 2. waitcnt lgkmcnt(0)
2521
2522 - If OpenCL, omit
2523 waitcnt.
2549 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2550 atomicrmw acquire - workgroup - local 1. ds_atomic
2551 2. waitcnt lgkmcnt(0)
2552
2553 - If OpenCL, omit.
25242554 - Must happen before
25252555 any following
25262556 global/generic
25342564 atomicrmw value
25352565 being acquired.
25362566
2537 atomicrmw acquire - agent - global 1. buffer/global_atomic
2567 atomicrmw acquire - workgroup - generic 1. flat_atomic
2568 2. waitcnt lgkmcnt(0)
2569
2570 - If OpenCL, omit.
2571 - Must happen before
2572 any following
2573 global/generic
2574 load/load
2575 atomic/store/store
2576 atomic/atomicrmw.
2577 - Ensures any
2578 following global
2579 data read is no
2580 older than the
2581 atomicrmw value
2582 being acquired.
2583
2584 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
25382585 - system 2. s_waitcnt vmcnt(0)
25392586
25402587 - Must happen before
25912638
25922639 - If OpenCL and
25932640 address space is
2594 not generic, omit
2595 waitcnt. However,
2596 since LLVM
2641 not generic, omit.
2642 - However, since LLVM
25972643 currently has no
25982644 address space on
25992645 the fence need to
26322678 value read by the
26332679 fence-paired-atomic.
26342680
2635 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2636 - system lgkmcnt(0)
2681 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2682 - system vmcnt(0)
26372683
26382684 - If OpenCL and
26392685 address space is
26402686 not generic, omit
26412687 lgkmcnt(0).
2642 However, since LLVM
2688 - However, since LLVM
26432689 currently has no
26442690 address space on
26452691 the fence need to
26712717 - s_waitcnt lgkmcnt(0)
26722718 must happen after
26732719 any preceding
2674 group/generic load
2720 local/generic load
26752721 atomic/atomicrmw
26762722 with an equal or
26772723 wider sync scope
26982744
26992745 2. buffer_wbinvl1_vol
27002746
2701 - Must happen before
2702 any following global/generic
2747 - Must happen before any
2748 following global/generic
27032749 load/load
27042750 atomic/store/store
27052751 atomic/atomicrmw.
27092755 global data.
27102756
27112757 **Release Atomic**
2712 ---------------------------------------------------------------------------
2758 -----------------------------------------------------------------------------------
27132759 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
27142760 - wavefront - local
27152761 - generic
27162762 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2717 - generic
2718 - If OpenCL, omit
2719 waitcnt.
2763
2764 - If OpenCL, omit.
27202765 - Must happen after
27212766 any preceding
27222767 local/generic
27362781
27372782 2. buffer/global/flat_store
27382783 store atomic release - workgroup - local 1. ds_store
2739 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2740 - system - generic lgkmcnt(0)
2784 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2785
2786 - If OpenCL, omit.
2787 - Must happen after
2788 any preceding
2789 local/generic
2790 load/store/load
2791 atomic/store
2792 atomic/atomicrmw.
2793 - Must happen before
2794 the following
2795 store.
2796 - Ensures that all
2797 memory operations
2798 to local have
2799 completed before
2800 performing the
2801 store that is being
2802 released.
2803
2804 2. flat_store
2805 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2806 - system - generic vmcnt(0)
27412807
27422808 - If OpenCL, omit
27432809 lgkmcnt(0).
27692835 store.
27702836 - Ensures that all
27712837 memory operations
2772 to global have
2838 to memory have
27732839 completed before
27742840 performing the
27752841 store that is being
27802846 - wavefront - local
27812847 - generic
27822848 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2783 - generic
2784 - If OpenCL, omit
2785 waitcnt.
2849
2850 - If OpenCL, omit.
27862851 - Must happen after
27872852 any preceding
27882853 local/generic
28022867
28032868 2. buffer/global/flat_atomic
28042869 atomicrmw release - workgroup - local 1. ds_atomic
2805 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2806 - system - generic lgkmcnt(0)
2870 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2871
2872 - If OpenCL, omit.
2873 - Must happen after
2874 any preceding
2875 local/generic
2876 load/store/load
2877 atomic/store
2878 atomic/atomicrmw.
2879 - Must happen before
2880 the following
2881 atomicrmw.
2882 - Ensures that all
2883 memory operations
2884 to local have
2885 completed before
2886 performing the
2887 atomicrmw that is
2888 being released.
2889
2890 2. flat_atomic
2891 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2892 - system - generic vmcnt(0)
28072893
28082894 - If OpenCL, omit
28092895 lgkmcnt(0).
28412927 the atomicrmw that
28422928 is being released.
28432929
2844 2. buffer/global/ds/flat_atomic*
2930 2. buffer/global/ds/flat_atomic
28452931 fence release - singlethread *none* *none*
28462932 - wavefront
28472933 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
28482934
28492935 - If OpenCL and
28502936 address space is
2851 not generic, omit
2852 waitcnt. However,
2853 since LLVM
2937 not generic, omit.
2938 - However, since LLVM
28542939 currently has no
28552940 address space on
28562941 the fence need to
28572942 conservatively
2858 always generate
2859 (see comment for
2860 previous fence).
2943 always generate. If
2944 fence had an
2945 address space then
2946 set to address
2947 space of OpenCL
2948 fence flag, or to
2949 generic if both
2950 local and global
2951 flags are
2952 specified.
28612953 - Must happen after
28622954 any preceding
28632955 local/generic
28822974 following
28832975 fence-paired-atomic.
28842976
2885 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2886 - system lgkmcnt(0)
2977 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
2978 - system vmcnt(0)
28872979
28882980 - If OpenCL and
28892981 address space is
28902982 not generic, omit
28912983 lgkmcnt(0).
2892 However, since LLVM
2984 - If OpenCL and
2985 address space is
2986 local, omit
2987 vmcnt(0).
2988 - However, since LLVM
28932989 currently has no
28942990 address space on
28952991 the fence need to
28962992 conservatively
2897 always generate
2898 (see comment for
2899 previous fence).
2993 always generate. If
2994 fence had an
2995 address space then
2996 set to address
2997 space of OpenCL
2998 fence flag, or to
2999 generic if both
3000 local and global
3001 flags are
3002 specified.
29003003 - Could be split into
29013004 separate s_waitcnt
29023005 vmcnt(0) and
29323035 fence-paired-atomic).
29333036 - Ensures that all
29343037 memory operations
2935 to global have
3038 have
29363039 completed before
29373040 performing the
29383041 following
29393042 fence-paired-atomic.
29403043
29413044 **Acquire-Release Atomic**
2942 ---------------------------------------------------------------------------
3045 -----------------------------------------------------------------------------------
29433046 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
29443047 - wavefront - local
29453048 - generic
29463049 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
29473050
2948 - If OpenCL, omit
2949 waitcnt.
3051 - If OpenCL, omit.
29503052 - Must happen after
29513053 any preceding
29523054 local/generic
29643066 atomicrmw that is
29653067 being released.
29663068
2967 2. buffer/global_atomic
3069 2. buffer/global/flat_atomic
29683070 atomicrmw acq_rel - workgroup - local 1. ds_atomic
29693071 2. s_waitcnt lgkmcnt(0)
29703072
2971 - If OpenCL, omit
2972 waitcnt.
3073 - If OpenCL, omit.
29733074 - Must happen before
29743075 any following
29753076 global/generic
29853086
29863087 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
29873088
2988 - If OpenCL, omit
2989 waitcnt.
3089 - If OpenCL, omit.
29903090 - Must happen after
29913091 any preceding
29923092 local/generic
30073107 2. flat_atomic
30083108 3. s_waitcnt lgkmcnt(0)
30093109
3010 - If OpenCL, omit
3011 waitcnt.
3110 - If OpenCL, omit.
30123111 - Must happen before
30133112 any following
30143113 global/generic
30213120 older than the load
30223121 atomic value being
30233122 acquired.
3024 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
3025 - system lgkmcnt(0)
3123
3124 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3125 - system vmcnt(0)
30263126
30273127 - If OpenCL, omit
30283128 lgkmcnt(0).
30603160 atomicrmw that is
30613161 being released.
30623162
3063 2. buffer/global_atomic
3163 2. buffer/global/flat_atomic
30643164 3. s_waitcnt vmcnt(0)
30653165
30663166 - Must happen before
30843184 will not see stale
30853185 global data.
30863186
3087 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3088 - system lgkmcnt(0)
3187 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3188 - system vmcnt(0)
30893189
30903190 - If OpenCL, omit
30913191 lgkmcnt(0).
31563256
31573257 - If OpenCL and
31583258 address space is
3159 not generic, omit
3160 waitcnt. However,
3259 not generic, omit.
3260 - However,
31613261 since LLVM
31623262 currently has no
31633263 address space on
31953295 stronger than
31963296 unordered (this is
31973297 termed the
3198 fence-paired-atomic)
3199 has completed
3298 acquire-fence-paired-atomic
3299 ) has completed
32003300 before following
32013301 global memory
32023302 operations. This
32163316 stronger than
32173317 unordered (this is
32183318 termed the
3219 fence-paired-atomic).
3220 This satisfies the
3319 release-fence-paired-atomic
3320 ). This satisfies the
32213321 requirements of
32223322 release.
32233323
3224 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3225 - system lgkmcnt(0)
3324 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3325 - system vmcnt(0)
32263326
32273327 - If OpenCL and
32283328 address space is
32293329 not generic, omit
32303330 lgkmcnt(0).
3231 However, since LLVM
3331 - However, since LLVM
32323332 currently has no
32333333 address space on
32343334 the fence need to
32733373 stronger than
32743374 unordered (this is
32753375 termed the
3276 fence-paired-atomic)
3277 has completed
3376 acquire-fence-paired-atomic
3377 ) has completed
32783378 before invalidating
32793379 the cache. This
32803380 satisfies the
32943394 stronger than
32953395 unordered (this is
32963396 termed the
3297 fence-paired-atomic).
3298 This satisfies the
3397 release-fence-paired-atomic
3398 ). This satisfies the
32993399 requirements of
33003400 release.
33013401
33163416 acquire.
33173417
33183418 **Sequential Consistent Atomic**
3319 ---------------------------------------------------------------------------
3419 -----------------------------------------------------------------------------------
33203420 load atomic seq_cst - singlethread - global *Same as corresponding
3321 - wavefront - local load atomic acquire*.
3322 - workgroup - generic
3323 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3324 - system - local
3325 - generic - Must happen after
3421 - wavefront - local load atomic acquire,
3422 - generic except must generated
3423 all instructions even
3424 for OpenCL.*
3425 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3426 - generic
3427 - Must
3428 happen after
3429 preceding
3430 global/generic load
3431 atomic/store
3432 atomic/atomicrmw
3433 with memory
3434 ordering of seq_cst
3435 and with equal or
3436 wider sync scope.
3437 (Note that seq_cst
3438 fences have their
3439 own s_waitcnt
3440 lgkmcnt(0) and so do
3441 not need to be
3442 considered.)
3443 - Ensures any
3444 preceding
3445 sequential
3446 consistent local
3447 memory instructions
3448 have completed
3449 before executing
3450 this sequentially
3451 consistent
3452 instruction. This
3453 prevents reordering
3454 a seq_cst store
3455 followed by a
3456 seq_cst load. (Note
3457 that seq_cst is
3458 stronger than
3459 acquire/release as
3460 the reordering of
3461 load acquire
3462 followed by a store
3463 release is
3464 prevented by the
3465 waitcnt of
3466 the release, but
3467 there is nothing
3468 preventing a store
3469 release followed by
3470 load acquire from
3471 competing out of
3472 order.)
3473
3474 2. *Following
3475 instructions same as
3476 corresponding load
3477 atomic acquire,
3478 except must generated
3479 all instructions even
3480 for OpenCL.*
3481 load atomic seq_cst - workgroup - local *Same as corresponding
3482 load atomic acquire,
3483 except must generated
3484 all instructions even
3485 for OpenCL.*
3486 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3487 - system - generic vmcnt(0)
3488
3489 - Could be split into
3490 separate s_waitcnt
3491 vmcnt(0)
3492 and s_waitcnt
3493 lgkmcnt(0) to allow
3494 them to be
3495 independently moved
3496 according to the
3497 following rules.
3498 - waitcnt lgkmcnt(0)
3499 must happen after
3500 preceding
3501 global/generic load
3502 atomic/store
3503 atomic/atomicrmw
3504 with memory
3505 ordering of seq_cst
3506 and with equal or
3507 wider sync scope.
3508 (Note that seq_cst
3509 fences have their
3510 own s_waitcnt
3511 lgkmcnt(0) and so do
3512 not need to be
3513 considered.)
3514 - waitcnt vmcnt(0)
3515 must happen after
33263516 preceding
33273517 global/generic load
33283518 atomic/store
33503540 prevents reordering
33513541 a seq_cst store
33523542 followed by a
3353 seq_cst load (Note
3543 seq_cst load. (Note
33543544 that seq_cst is
33553545 stronger than
33563546 acquire/release as
33593549 followed by a store
33603550 release is
33613551 prevented by the
3362 waitcnt vmcnt(0) of
3552 waitcnt of
33633553 the release, but
33643554 there is nothing
33653555 preventing a store
33713561 2. *Following
33723562 instructions same as
33733563 corresponding load
3374 atomic acquire*.
3375
3564 atomic acquire,
3565 except must generated
3566 all instructions even
3567 for OpenCL.*
33763568 store atomic seq_cst - singlethread - global *Same as corresponding
3377 - wavefront - local store atomic release*.
3378 - workgroup - generic
3569 - wavefront - local store atomic release,
3570 - workgroup - generic except must generated
3571 all instructions even
3572 for OpenCL.*
33793573 store atomic seq_cst - agent - global *Same as corresponding
3380 - system - generic store atomic release*.
3574 - system - generic store atomic release,
3575 except must generated
3576 all instructions even
3577 for OpenCL.*
33813578 atomicrmw seq_cst - singlethread - global *Same as corresponding
3382 - wavefront - local atomicrmw acq_rel*.
3383 - workgroup - generic
3579 - wavefront - local atomicrmw acq_rel,
3580 - workgroup - generic except must generated
3581 all instructions even
3582 for OpenCL.*
33843583 atomicrmw seq_cst - agent - global *Same as corresponding
3385 - system - generic atomicrmw acq_rel*.
3584 - system - generic atomicrmw acq_rel,
3585 except must generated
3586 all instructions even
3587 for OpenCL.*
33863588 fence seq_cst - singlethread *none* *Same as corresponding
3387 - wavefront fence acq_rel*.
3388 - workgroup
3389 - agent
3390 - system
3391 ============ ============ ============== ========== =======================
3589 - wavefront fence acq_rel,
3590 - workgroup except must generated
3591 - agent all instructions even
3592 - system for OpenCL.*
3593 ============ ============ ============== ========== ===============================
33923594
33933595 The memory order also adds the single thread optimization constrains defined in
33943596 table
37984000 - *kernel_code_entry_byte_offset* defaults to 256.
37994001 - *wavefront_size* defaults to 6.
38004002 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3801 *private_segment_alignment* default to 4. Note that alignments are specified
4003 *private_segment_alignment* default to 4. Note that alignments are specified
38024004 as a power of two, so a value of **n** means an alignment of 2^ **n**.
38034005
38044006 The *.amd_kernel_code_t* directive must be placed immediately after the