llvm.org GIT mirror llvm / d08e6c7
[llvm-mca][docs] Add instruction flow documentation. NFC. Summary: This patch mostly copies the existing Instruction Flow, and stage descriptions from the mca README. I made a few text tweaks, but no semantic changes, and made reference to the "default pipeline." I also removed the internals references (e.g., reference to class names and header files). I did leave the LSUnit name around, but only as an abbreviated word for the load-store unit. Reviewers: andreadb, courbet, RKSimon, gbedwell, filcab Reviewed By: andreadb Subscribers: tschuett, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D49692 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@338319 91177308-0d34-0410-b5e6-96231b3b80d8 Matt Davis 1 year, 2 months ago
1 changed file(s) with 177 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all
548548
549549 In this example, we can conclude that the IPC is mostly limited by data
550550 dependencies, and not by resource pressure.
551
552 Instruction Flow
553 ^^^^^^^^^^^^^^^^
554 This section describes the instruction flow through MCA's default out-of-order
555 pipeline, as well as the functional units involved in the process.
556
557 The default pipeline implements the following sequence of stages used to
558 process instructions.
559
560 * Dispatch (Instruction is dispatched to the schedulers).
561 * Issue (Instruction is issued to the processor pipelines).
562 * Write Back (Instruction is executed, and results are written back).
563 * Retire (Instruction is retired; writes are architecturally committed).
564
565 The default pipeline only models the out-of-order portion of a processor.
566 Therefore, the instruction fetch and decode stages are not modeled. Performance
567 bottlenecks in the frontend are not diagnosed. MCA assumes that instructions
568 have all been decoded and placed into a queue. Also, MCA does not model branch
569 prediction.
570
571 Instruction Dispatch
572 """"""""""""""""""""
573 During the dispatch stage, instructions are picked in program order from a
574 queue of already decoded instructions, and dispatched in groups to the
575 simulated hardware schedulers.
576
577 The size of a dispatch group depends on the availability of the simulated
578 hardware resources. The processor dispatch width defaults to the value
579 of the ``IssueWidth`` in LLVM's scheduling model.
580
581 An instruction can be dispatched if:
582
583 * The size of the dispatch group is smaller than processor's dispatch width.
584 * There are enough entries in the reorder buffer.
585 * There are enough physical registers to do register renaming.
586 * The schedulers are not full.
587
588 Scheduling models can optionally specify which register files are available on
589 the processor. MCA uses that information to initialize register file
590 descriptors. Users can limit the number of physical registers that are
591 globally available for register renaming by using the command option
592 ``-register-file-size``. A value of zero for this option means *unbounded*.
593 By knowing how many registers are available for renaming, MCA can predict
594 dispatch stalls caused by the lack of registers.
595
596 The number of reorder buffer entries consumed by an instruction depends on the
597 number of micro-opcodes specified by the target scheduling model. MCA's
598 reorder buffer's purpose is to track the progress of instructions that are
599 "in-flight," and to retire instructions in program order. The number of
600 entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
601 the target scheduling model.
602
603 Instructions that are dispatched to the schedulers consume scheduler buffer
604 entries. MCA queries the scheduling model to determine the set of
605 buffered resources consumed by an instruction. Buffered resources are treated
606 like scheduler resources.
607
608 Instruction Issue
609 """""""""""""""""
610 Each processor scheduler implements a buffer of instructions. An instruction
611 has to wait in the scheduler's buffer until input register operands become
612 available. Only at that point, does the instruction becomes eligible for
613 execution and may be issued (potentially out-of-order) for execution.
614 Instruction latencies are computed by MCA with the help of the scheduling
615 model.
616
617 MCA's scheduler is designed to simulate multiple processor schedulers. The
618 scheduler is responsible for tracking data dependencies, and dynamically
619 selecting which processor resources are consumed by instructions.
620
621 The scheduler delegates the management of processor resource units and resource
622 groups to a resource manager. The resource manager is responsible for
623 selecting resource units that are consumed by instructions. For example, if an
624 instruction consumes 1cy of a resource group, the resource manager selects one
625 of the available units from the group; by default, the resource manager uses a
626 round-robin selector to guarantee that resource usage is uniformly distributed
627 between all units of a group.
628
629 MCA's scheduler implements three instruction queues:
630
631 * WaitQueue: a queue of instructions whose operands are not ready.
632 * ReadyQueue: a queue of instructions ready to execute.
633 * IssuedQueue: a queue of instructions executing.
634
635 Depending on the operand availability, instructions that are dispatched to the
636 scheduler are either placed into the WaitQueue or into the ReadyQueue.
637
638 Every cycle, the scheduler checks if instructions can be moved from the
639 WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
640 issued. The algorithm prioritizes older instructions over younger
641 instructions.
642
643 Write-Back and Retire Stage
644 """""""""""""""""""""""""""
645 Issued instructions are moved from the ReadyQueue to the IssuedQueue. There,
646 instructions wait until they reach the write-back stage. At that point, they
647 get removed from the queue and the retire control unit is notified.
648
649 When instructions are executed, the retire control unit flags the
650 instruction as "ready to retire."
651
652 Instructions are retired in program order. The register file is notified of
653 the retirement so that it can free the temporary registers that were allocated
654 for the instruction during the register renaming stage.
655
656 Load/Store Unit and Memory Consistency Model
657 """"""""""""""""""""""""""""""""""""""""""""
658 To simulate an out-of-order execution of memory operations, MCA utilizes a
659 simulated load/store unit (LSUnit) to simulate the speculative execution of
660 loads and stores.
661
662 Each load (or store) consumes an entry in the load (or store) queue. The
663 number of slots in the load/store queues is unknown by MCA, since there is no
664 mention of it in the scheduling model. In practice, users can specify flags
665 ``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
666 store queues respectively. The queues are unbounded by default.
667
668 The LSUnit implements a relaxed consistency model for memory loads and stores.
669 The rules are:
670
671 1. A younger load is allowed to pass an older load only if there are no
672 intervening stores or barriers between the two loads.
673 2. A younger load is allowed to pass an older store provided that the load does
674 not alias with the store.
675 3. A younger store is not allowed to pass an older store.
676 4. A younger store is not allowed to pass an older load.
677
678 By default, the LSUnit optimistically assumes that loads do not alias
679 (`-noalias=true`) store operations. Under this assumption, younger loads are
680 always allowed to pass older stores. Essentially, the LSUnit does not attempt
681 to run any alias analysis to predict when loads and stores do not alias with
682 each other.
683
684 Note that, in the case of write-combining memory, rule 3 could be relaxed to
685 allow reordering of non-aliasing store operations. That being said, at the
686 moment, there is no way to further relax the memory model (``-noalias`` is the
687 only option). Essentially, there is no option to specify a different memory
688 type (e.g., write-back, write-combining, write-through; etc.) and consequently
689 to weaken, or strengthen, the memory model.
690
691 Other limitations are:
692
693 * The LSUnit does not know when store-to-load forwarding may occur.
694 * The LSUnit does not know anything about cache hierarchy and memory types.
695 * The LSUnit does not know how to identify serializing operations and memory
696 fences.
697
698 The LSUnit does not attempt to predict if a load or store hits or misses the L1
699 cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
700 loads, the scheduling model provides an "optimistic" load-to-use latency (which
701 usually matches the load-to-use latency for when there is a hit in the L1D).
702
703 MCA does not know about serializing operations or memory-barrier like
704 instructions. The LSUnit conservatively assumes that an instruction which has
705 both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
706 That means, it serializes loads without forcing a flush of the load queue.
707 Similarly, instructions that "MayStore" and have unmodeled side effects are
708 treated like store barriers. A full memory barrier is a "MayLoad" and
709 "MayStore" instruction with unmodeled side effects. This is inaccurate, but it
710 is the best that we can do at the moment with the current information available
711 in LLVM.
712
713 A load/store barrier consumes one entry of the load/store queue. A load/store
714 barrier enforces ordering of loads/stores. A younger load cannot pass a load
715 barrier. Also, a younger store cannot pass a store barrier. A younger load
716 has to wait for the memory/load barrier to execute. A load/store barrier is
717 "executed" when it becomes the oldest entry in the load/store queue(s). That
718 also means, by construction, all of the older loads/stores have been executed.
719
720 In conclusion, the full set of load/store consistency rules are:
721
722 #. A store may not pass a previous store.
723 #. A store may not pass a previous load (regardless of ``-noalias``).
724 #. A store has to wait until an older store barrier is fully executed.
725 #. A load may pass a previous load.
726 #. A load may not pass a previous store unless ``-noalias`` is set.
727 #. A load has to wait until an older load barrier is fully executed.