llvm.org GIT mirror llvm / c49e383
[llvm-mca][docs] Always use `llvm-mca` in place of `MCA`. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@338394 91177308-0d34-0410-b5e6-96231b3b80d8 Andrea Di Biagio 1 year, 2 months ago
1 changed file(s) with 49 addition(s) and 52 deletion(s). Raw diff Collapse all Expand all
206206 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
207207 to standard error, and the tool returns 1.
208208
209 HOW MCA WORKS
210 -------------
211
212 MCA takes assembly code as input. The assembly code is parsed into a sequence
213 of MCInst with the help of the existing LLVM target assembly parsers. The
214 parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
215 a performance report.
209 HOW LLVM-MCA WORKS
210 ------------------
211
212 :program:`llvm-mca` takes assembly code as input. The assembly code is parsed
213 into a sequence of MCInst with the help of the existing LLVM target assembly
214 parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
215 to generate a performance report.
216216
217217 The Pipeline module simulates the execution of the machine code sequence in a
218218 loop of iterations (default is 100). During this process, the pipeline collects
219219 a number of execution related statistics. At the end of this process, the
220220 pipeline generates and prints a report from the collected statistics.
221221
222 Here is an example of a performance report generated by MCA for a dot-product
223 of two packed float vectors of four elements. The analysis is conducted for
224 target x86, cpu btver2. The following result can be produced via the following
225 command using the example located at
222 Here is an example of a performance report generated by the tool for a
223 dot-product of two packed float vectors of four elements. The analysis is
224 conducted for target x86, cpu btver2. The following result can be produced via
225 the following command using the example located at
226226 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
227227
228228 .. code-block:: bash
315315
316316 Timeline View
317317 ^^^^^^^^^^^^^
318 MCA's timeline view produces a detailed report of each instruction's state
318 The timeline view produces a detailed report of each instruction's state
319319 transitions through an instruction pipeline. This view is enabled by the
320320 command line option ``-timeline``. As instructions transition through the
321321 various stages of the pipeline, their states are depicted in the view report.
330330
331331 Below is the timeline view for a subset of the dot-product example located in
332332 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
333 MCA using the following command:
333 :program:`llvm-mca` using the following command:
334334
335335 .. code-block:: bash
336336
365365 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
366366
367367 The timeline view is interesting because it shows instruction state changes
368 during execution. It also gives an idea of how MCA processes instructions
368 during execution. It also gives an idea of how the tool processes instructions
369369 executed on the target, and how their timing information might be calculated.
370370
371371 The timeline view is structured in two tables. The first table shows
414414
415415 Table *Average Wait times* helps diagnose performance issues that are caused by
416416 the presence of long latency instructions and potentially long data dependencies
417 which may limit the ILP. Note that MCA, by default, assumes at least 1cy
418 between the dispatch event and the issue event.
417 which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
418 least 1cy between the dispatch event and the issue event.
419419
420420 When the performance is limited by data dependencies and/or long latency
421421 instructions, the number of cycles spent while in the *ready* state is expected
601601 the target scheduling model.
602602
603603 Instructions that are dispatched to the schedulers consume scheduler buffer
604 entries. MCA queries the scheduling model to determine the set of
605 buffered resources consumed by an instruction. Buffered resources are treated
606 like scheduler resources.
604 entries. :program:`llvm-mca` queries the scheduling model to determine the set
605 of buffered resources consumed by an instruction. Buffered resources are
606 treated like scheduler resources.
607607
608608 Instruction Issue
609609 """""""""""""""""
611611 has to wait in the scheduler's buffer until input register operands become
612612 available. Only at that point, does the instruction becomes eligible for
613613 execution and may be issued (potentially out-of-order) for execution.
614 Instruction latencies are computed by MCA with the help of the scheduling
615 model.
616
617 MCA's scheduler is designed to simulate multiple processor schedulers. The
618 scheduler is responsible for tracking data dependencies, and dynamically
619 selecting which processor resources are consumed by instructions.
620
621 The scheduler delegates the management of processor resource units and resource
622 groups to a resource manager. The resource manager is responsible for
623 selecting resource units that are consumed by instructions. For example, if an
624 instruction consumes 1cy of a resource group, the resource manager selects one
625 of the available units from the group; by default, the resource manager uses a
614 Instruction latencies are computed by :program:`llvm-mca` with the help of the
615 scheduling model.
616
617 :program:`llvm-mca`'s scheduler is designed to simulate multiple processor
618 schedulers. The scheduler is responsible for tracking data dependencies, and
619 dynamically selecting which processor resources are consumed by instructions.
620 It delegates the management of processor resource units and resource groups to a
621 resource manager. The resource manager is responsible for selecting resource
622 units that are consumed by instructions. For example, if an instruction
623 consumes 1cy of a resource group, the resource manager selects one of the
624 available units from the group; by default, the resource manager uses a
626625 round-robin selector to guarantee that resource usage is uniformly distributed
627626 between all units of a group.
628627
629 MCA's scheduler implements three instruction queues:
628 :program:`llvm-mca`'s scheduler implements three instruction queues:
630629
631630 * WaitQueue: a queue of instructions whose operands are not ready.
632631 * ReadyQueue: a queue of instructions ready to execute.
637636
638637 Every cycle, the scheduler checks if instructions can be moved from the
639638 WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
640 issued. The algorithm prioritizes older instructions over younger
641 instructions.
639 issued to the underlying pipelines. The algorithm prioritizes older instructions
640 over younger instructions.
642641
643642 Write-Back and Retire Stage
644643 """""""""""""""""""""""""""
655654
656655 Load/Store Unit and Memory Consistency Model
657656 """"""""""""""""""""""""""""""""""""""""""""
658 To simulate an out-of-order execution of memory operations, MCA utilizes a
659 simulated load/store unit (LSUnit) to simulate the speculative execution of
660 loads and stores.
661
662 Each load (or store) consumes an entry in the load (or store) queue. The
663 number of slots in the load/store queues is unknown by MCA, since there is no
664 mention of it in the scheduling model. In practice, users can specify flags
665 ``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
666 store queues respectively. The queues are unbounded by default.
657 To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
658 utilizes a simulated load/store unit (LSUnit) to simulate the speculative
659 execution of loads and stores.
660
661 Each load (or store) consumes an entry in the load (or store) queue. Users can
662 specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
663 load and store queues respectively. The queues are unbounded by default.
667664
668665 The LSUnit implements a relaxed consistency model for memory loads and stores.
669666 The rules are:
700697 loads, the scheduling model provides an "optimistic" load-to-use latency (which
701698 usually matches the load-to-use latency for when there is a hit in the L1D).
702699
703 MCA does not know about serializing operations or memory-barrier like
704 instructions. The LSUnit conservatively assumes that an instruction which has
705 both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
706 That means, it serializes loads without forcing a flush of the load queue.
707 Similarly, instructions that "MayStore" and have unmodeled side effects are
708 treated like store barriers. A full memory barrier is a "MayLoad" and
709 "MayStore" instruction with unmodeled side effects. This is inaccurate, but it
710 is the best that we can do at the moment with the current information available
711 in LLVM.
700 :program:`llvm-mca` does not know about serializing operations or memory-barrier
701 like instructions. The LSUnit conservatively assumes that an instruction which
702 has both "MayLoad" and unmodeled side effects behaves like a "soft"
703 load-barrier. That means, it serializes loads without forcing a flush of the
704 load queue. Similarly, instructions that "MayStore" and have unmodeled side
705 effects are treated like store barriers. A full memory barrier is a "MayLoad"
706 and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
707 it is the best that we can do at the moment with the current information
708 available in LLVM.
712709
713710 A load/store barrier consumes one entry of the load/store queue. A load/store
714711 barrier enforces ordering of loads/stores. A younger load cannot pass a load