llvm.org GIT mirror llvm / 033c293
[MCA] Highlight kernel bottlenecks in the summary view. This patch adds a new flag named -bottleneck-analysis to print out information about throughput bottlenecks. MCA knows how to identify and classify dynamic dispatch stalls. However, it doesn't know how to analyze and highlight kernel bottlenecks. The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput). From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). Backend pressure increases (or decreases) when there is a mismatch between the number of opcodes dispatched, and the number of opcodes issued in the same cycle. Since buffer resources are limited, continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time. This patch teaches how to identify situations where backend pressure increases due to: - unavailable pipeline resources. - data dependencies. Data dependencies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck. Contention on pipeline resources may also delay execution of instructions, and lead to a temporary increase in backend pressure. Internally, the Scheduler classifies instructions based on whether register / memory operands are available or not. An instruction is marked as "ready to execute" only if data dependencies are fully resolved. Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, then the Scheduler internally updates a BusyResourceUnits mask with the ID of each unavailable resource. ExecuteStage is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, then ExecuteStage sends a "backend pressure" event to the listeners. That event would contain information about instructions delayed by resource pressure, as well as the BusyResourceUnits mask. Note that ExecuteStage also knows how to identify situations where backpressure increased because of delays introduced by data dependencies. The SummaryView observes "backend pressure" events and prints out a "bottleneck report". Example of bottleneck report: ``` Cycles with backend pressure increase [ 99.89% ] Throughput Bottlenecks: Resource Pressure [ 0.00% ] Data Dependencies: [ 99.89% ] - Register Dependencies [ 0.00% ] - Memory Dependencies [ 99.89% ] ``` A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls. About the time complexity: Time complexity is linear in the number of instructions in the Scheduler::PendingSet. The average slowdown tends to be in the range of ~5-6%. For memory intensive kernels, the slowdown can be significant if flag -noalias=false is specified. In the worst case scenario I have observed a slowdown of ~30% when flag -noalias=false was specified. We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries). For now, this new analysis is disabled by default, and it can be enabled via flag -bottleneck-analysis. Users of MCA as a library can enable the generation of pressure events through the constructor of ExecuteStage. This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494 Differential Revision: https://reviews.llvm.org/D58728 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@355308 91177308-0d34-0410-b5e6-96231b3b80d8 Andrea Di Biagio 7 months ago
15 changed file(s) with 561 addition(s) and 21 deletion(s). Raw diff Collapse all Expand all
167167 view because it doesn't require that the code is simulated. It instead prints
168168 the theoretical uniform distribution of resource pressure for every
169169 instruction in sequence.
170
171 .. option:: -bottleneck-analysis
172
173 Print information about bottlenecks that affect the throughput. This analysis
174 can be expensive, and it is disabled by default. Bottlenecks are highlighted
175 in the summary view.
170176
171177
172178 EXIT STATUS
3131 /// the pre-built "default" out-of-order pipeline.
3232 struct PipelineOptions {
3333 PipelineOptions(unsigned DW, unsigned RFS, unsigned LQS, unsigned SQS,
34 bool NoAlias)
34 bool NoAlias, bool ShouldEnableBottleneckAnalysis = false)
3535 : DispatchWidth(DW), RegisterFileSize(RFS), LoadQueueSize(LQS),
36 StoreQueueSize(SQS), AssumeNoAlias(NoAlias) {}
36 StoreQueueSize(SQS), AssumeNoAlias(NoAlias),
37 EnableBottleneckAnalysis(ShouldEnableBottleneckAnalysis) {}
3738 unsigned DispatchWidth;
3839 unsigned RegisterFileSize;
3940 unsigned LoadQueueSize;
4041 unsigned StoreQueueSize;
4142 bool AssumeNoAlias;
43 bool EnableBottleneckAnalysis;
4244 };
4345
4446 class Context {
124124 const InstRef &IR;
125125 };
126126
127 // A HWPressureEvent describes an increase in backend pressure caused by
128 // the presence of data dependencies or unavailability of pipeline resources.
129 class HWPressureEvent {
130 public:
131 enum GenericReason {
132 INVALID = 0,
133 // Scheduler was unable to issue all the ready instructions because some
134 // pipeline resources were unavailable.
135 RESOURCES,
136 // Instructions could not be issued because of register data dependencies.
137 REGISTER_DEPS,
138 // Instructions could not be issued because of memory dependencies.
139 MEMORY_DEPS
140 };
141
142 HWPressureEvent(GenericReason reason, ArrayRef Insts,
143 uint64_t Mask = 0)
144 : Reason(reason), AffectedInstructions(Insts), ResourceMask(Mask) {}
145
146 // Reason for this increase in backend pressure.
147 GenericReason Reason;
148
149 // Instructions affected (i.e. delayed) by this increase in backend pressure.
150 ArrayRef AffectedInstructions;
151
152 // A mask of unavailable processor resources.
153 const uint64_t ResourceMask;
154 };
155
127156 class HWEventListener {
128157 public:
129158 // Generic events generated by the pipeline.
132161
133162 virtual void onEvent(const HWInstructionEvent &Event) {}
134163 virtual void onEvent(const HWStallEvent &Event) {}
164 virtual void onEvent(const HWPressureEvent &Event) {}
135165
136166 using ResourceRef = std::pair;
137167 virtual void onResourceAvailable(const ResourceRef &RRef) {}
179179 /// Check if the instruction in 'IR' can be dispatched during this cycle.
180180 /// Return SC_AVAILABLE if both scheduler and LS resources are available.
181181 ///
182 /// This method internally sets field HadTokenStall based on the Scheduler
183 /// Status value.
182 /// This method is also responsible for setting field HadTokenStall if
183 /// IR cannot be dispatched to the Scheduler due to unavailable resources.
184184 Status isAvailable(const InstRef &IR);
185185
186186 /// Reserves buffer and LSUnit queue resources that are necessary to issue
224224 /// resources are not available.
225225 InstRef select();
226226
227 /// Returns a mask of busy resources. Each bit of the mask identifies a unique
228 /// processor resource unit. In the absence of bottlenecks caused by resource
229 /// pressure, the mask value returned by this method is always zero.
230 uint64_t getBusyResourceUnits() const { return BusyResourceUnits; }
231227 bool arePipelinesFullyUsed() const {
232228 return !Resources->getAvailableProcResUnits();
233229 }
234230 bool isReadySetEmpty() const { return ReadySet.empty(); }
235231 bool isWaitSetEmpty() const { return WaitSet.empty(); }
232
233 /// This method is called by the ExecuteStage at the end of each cycle to
234 /// identify bottlenecks caused by data dependencies. Vector RegDeps is
235 /// populated by instructions that were not issued because of unsolved
236 /// register dependencies. Vector MemDeps is populated by instructions that
237 /// were not issued because of unsolved memory dependencies.
238 void analyzeDataDependencies(SmallVectorImpl &RegDeps,
239 SmallVectorImpl &MemDeps);
240
241 /// Returns a mask of busy resources, and populates vector Insts with
242 /// instructions that could not be issued to the underlying pipelines because
243 /// not all pipeline resources were available.
244 uint64_t analyzeResourcePressure(SmallVectorImpl &Insts);
236245
237246 // Returns true if the dispatch logic couldn't dispatch a full group due to
238247 // unavailable scheduler and/or LS resources.
447447 // Retire Unit token ID for this instruction.
448448 unsigned RCUTokenID;
449449
450 // A bitmask of busy processor resource units.
451 // This field is set to zero only if execution is not delayed during this
452 // cycle because of unavailable pipeline resources.
450453 uint64_t CriticalResourceMask;
454
455 // An instruction identifier. This field is only set if execution is delayed
456 // by a memory dependency.
451457 unsigned CriticalMemDep;
452458
453459 public:
498504 Stage = IS_RETIRED;
499505 }
500506
501 void updateCriticalResourceMask(uint64_t BusyResourceUnits) {
502 CriticalResourceMask |= BusyResourceUnits;
503 }
504507 uint64_t getCriticalResourceMask() const { return CriticalResourceMask; }
508 unsigned getCriticalMemDep() const { return CriticalMemDep; }
509 void setCriticalResourceMask(uint64_t ResourceMask) {
510 CriticalResourceMask = ResourceMask;
511 }
505512 void setCriticalMemDep(unsigned IID) { CriticalMemDep = IID; }
506 unsigned getCriticalMemDep() const { return CriticalMemDep; }
507513
508514 void cycleEvent();
509515 };
2727 class ExecuteStage final : public Stage {
2828 Scheduler &HWS;
2929
30 unsigned NumDispatchedOpcodes;
31 unsigned NumIssuedOpcodes;
32
33 // True if this stage should notify listeners of HWPressureEvents.
34 bool EnablePressureEvents;
35
3036 Error issueInstruction(InstRef &IR);
3137
3238 // Called at the beginning of each cycle to issue already dispatched
4046 ExecuteStage &operator=(const ExecuteStage &Other) = delete;
4147
4248 public:
43 ExecuteStage(Scheduler &S) : Stage(), HWS(S) {}
49 ExecuteStage(Scheduler &S) : ExecuteStage(S, false) {}
50 ExecuteStage(Scheduler &S, bool ShouldPerformBottleneckAnalysis)
51 : Stage(), HWS(S), NumDispatchedOpcodes(0), NumIssuedOpcodes(0),
52 EnablePressureEvents(ShouldPerformBottleneckAnalysis) {}
4453
4554 // This stage works under the assumption that the Pipeline will eventually
4655 // execute a retire stage. We don't need to check if pipelines and/or
5968 // Instructions that transitioned to the 'Executed' state are automatically
6069 // moved to the next stage (i.e. RetireStage).
6170 Error cycleStart() override;
71 Error cycleEnd() override;
6272 Error execute(InstRef &IR) override;
6373
6474 void notifyInstructionIssued(
4141 auto Fetch = llvm::make_unique(SrcMgr);
4242 auto Dispatch = llvm::make_unique(STI, MRI, Opts.DispatchWidth,
4343 *RCU, *PRF);
44 auto Execute = llvm::make_unique(*HWS);
44 auto Execute =
45 llvm::make_unique(*HWS, Opts.EnableBottleneckAnalysis);
4546 auto Retire = llvm::make_unique(*RCU, *PRF);
4647
4748 // Pass the ownership of all the hardware units to this Context.
182182 InstRef &IR = ReadySet[I];
183183 if (QueueIndex == ReadySet.size() ||
184184 Strategy->compare(IR, ReadySet[QueueIndex])) {
185 const InstrDesc &D = IR.getInstruction()->getDesc();
186 uint64_t BusyResourceMask = Resources->checkAvailability(D);
187 IR.getInstruction()->updateCriticalResourceMask(BusyResourceMask);
185 Instruction &IS = *IR.getInstruction();
186 uint64_t BusyResourceMask = Resources->checkAvailability(IS.getDesc());
187 IS.setCriticalResourceMask(BusyResourceMask);
188188 BusyResourceUnits |= BusyResourceMask;
189189 if (!BusyResourceMask)
190190 QueueIndex = I;
226226 IssuedSet.resize(IssuedSet.size() - RemovedElements);
227227 }
228228
229 uint64_t Scheduler::analyzeResourcePressure(SmallVectorImpl &Insts) {
230 Insts.insert(Insts.end(), ReadySet.begin(), ReadySet.end());
231 return BusyResourceUnits;
232 }
233
234 void Scheduler::analyzeDataDependencies(SmallVectorImpl &RegDeps,
235 SmallVectorImpl &MemDeps) {
236 const auto EndIt = PendingSet.end() - NumDispatchedToThePendingSet;
237 for (InstRef &IR : make_range(PendingSet.begin(), EndIt)) {
238 Instruction &IS = *IR.getInstruction();
239 if (Resources->checkAvailability(IS.getDesc()))
240 continue;
241
242 if (IS.isReady() ||
243 (IS.isMemOp() && LSU.isReady(IR) != IR.getSourceIndex())) {
244 MemDeps.emplace_back(IR);
245 } else {
246 RegDeps.emplace_back(IR);
247 }
248 }
249 }
250
229251 void Scheduler::cycleEvent(SmallVectorImpl &Freed,
230252 SmallVectorImpl &Executed,
231253 SmallVectorImpl &Ready) {
5353 SmallVector, 4> Used;
5454 SmallVector Ready;
5555 HWS.issueInstruction(IR, Used, Ready);
56 NumIssuedOpcodes += IR.getInstruction()->getDesc().NumMicroOps;
5657
5758 notifyReservedOrReleasedBuffers(IR, /* Reserved */ false);
5859
8889 SmallVector Ready;
8990
9091 HWS.cycleEvent(Freed, Executed, Ready);
92 NumDispatchedOpcodes = 0;
93 NumIssuedOpcodes = 0;
9194
9295 for (const ResourceRef &RR : Freed)
9396 notifyResourceAvailable(RR);
103106 notifyInstructionReady(IR);
104107
105108 return issueReadyInstructions();
109 }
110
111 Error ExecuteStage::cycleEnd() {
112 if (!EnablePressureEvents)
113 return ErrorSuccess();
114
115 // Always conservatively report any backpressure events if the dispatch logic
116 // was stalled due to unavailable scheduler resources.
117 if (!HWS.hadTokenStall() && NumDispatchedOpcodes <= NumIssuedOpcodes)
118 return ErrorSuccess();
119
120 SmallVector Insts;
121 uint64_t Mask = HWS.analyzeResourcePressure(Insts);
122 if (Mask) {
123 LLVM_DEBUG(dbgs() << "[E] Backpressure increased because of unavailable "
124 "pipeline resources: "
125 << format_hex(Mask, 16) << '\n');
126 HWPressureEvent Ev(HWPressureEvent::RESOURCES, Insts, Mask);
127 notifyEvent(Ev);
128 return ErrorSuccess();
129 }
130
131 SmallVector RegDeps;
132 SmallVector MemDeps;
133 HWS.analyzeDataDependencies(RegDeps, MemDeps);
134 if (RegDeps.size()) {
135 LLVM_DEBUG(
136 dbgs() << "[E] Backpressure increased by register dependencies\n");
137 HWPressureEvent Ev(HWPressureEvent::REGISTER_DEPS, RegDeps);
138 notifyEvent(Ev);
139 }
140
141 if (MemDeps.size()) {
142 LLVM_DEBUG(dbgs() << "[E] Backpressure increased by memory dependencies\n");
143 HWPressureEvent Ev(HWPressureEvent::MEMORY_DEPS, MemDeps);
144 notifyEvent(Ev);
145 }
146
147 return ErrorSuccess();
106148 }
107149
108150 #ifndef NDEBUG
146188 // be released after MCIS is issued, and all the ResourceCycles for those
147189 // units have been consumed.
148190 bool IsReadyInstruction = HWS.dispatch(IR);
191 NumDispatchedOpcodes += IR.getInstruction()->getDesc().NumMicroOps;
149192 notifyReservedOrReleasedBuffers(IR, /* Reserved */ true);
150193 if (!IsReadyInstruction)
151194 return ErrorSuccess();
0 # NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
1 # RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -timeline -timeline-max-iterations=1 -bottleneck-analysis < %s | FileCheck %s
2
3 add %eax, %ebx
4 add %ebx, %ecx
5 add %ecx, %edx
6 add %edx, %eax
7
8 # CHECK: Iterations: 100
9 # CHECK-NEXT: Instructions: 400
10 # CHECK-NEXT: Total Cycles: 403
11 # CHECK-NEXT: Total uOps: 400
12
13 # CHECK: Dispatch Width: 2
14 # CHECK-NEXT: uOps Per Cycle: 0.99
15 # CHECK-NEXT: IPC: 0.99
16 # CHECK-NEXT: Block RThroughput: 2.0
17
18 # CHECK: Cycles with backend pressure increase [ 94.04% ]
19 # CHECK-NEXT: Throughput Bottlenecks:
20 # CHECK-NEXT: Resource Pressure [ 0.00% ]
21 # CHECK-NEXT: Data Dependencies: [ 94.04% ]
22 # CHECK-NEXT: - Register Dependencies [ 94.04% ]
23 # CHECK-NEXT: - Memory Dependencies [ 0.00% ]
24
25 # CHECK: Instruction Info:
26 # CHECK-NEXT: [1]: #uOps
27 # CHECK-NEXT: [2]: Latency
28 # CHECK-NEXT: [3]: RThroughput
29 # CHECK-NEXT: [4]: MayLoad
30 # CHECK-NEXT: [5]: MayStore
31 # CHECK-NEXT: [6]: HasSideEffects (U)
32
33 # CHECK: [1] [2] [3] [4] [5] [6] Instructions:
34 # CHECK-NEXT: 1 1 0.50 addl %eax, %ebx
35 # CHECK-NEXT: 1 1 0.50 addl %ebx, %ecx
36 # CHECK-NEXT: 1 1 0.50 addl %ecx, %edx
37 # CHECK-NEXT: 1 1 0.50 addl %edx, %eax
38
39 # CHECK: Resources:
40 # CHECK-NEXT: [0] - JALU0
41 # CHECK-NEXT: [1] - JALU1
42 # CHECK-NEXT: [2] - JDiv
43 # CHECK-NEXT: [3] - JFPA
44 # CHECK-NEXT: [4] - JFPM
45 # CHECK-NEXT: [5] - JFPU0
46 # CHECK-NEXT: [6] - JFPU1
47 # CHECK-NEXT: [7] - JLAGU
48 # CHECK-NEXT: [8] - JMul
49 # CHECK-NEXT: [9] - JSAGU
50 # CHECK-NEXT: [10] - JSTC
51 # CHECK-NEXT: [11] - JVALU0
52 # CHECK-NEXT: [12] - JVALU1
53 # CHECK-NEXT: [13] - JVIMUL
54
55 # CHECK: Resource pressure per iteration:
56 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
57 # CHECK-NEXT: 2.00 2.00 - - - - - - - - - - - -
58
59 # CHECK: Resource pressure by instruction:
60 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
61 # CHECK-NEXT: - 1.00 - - - - - - - - - - - - addl %eax, %ebx
62 # CHECK-NEXT: 1.00 - - - - - - - - - - - - - addl %ebx, %ecx
63 # CHECK-NEXT: - 1.00 - - - - - - - - - - - - addl %ecx, %edx
64 # CHECK-NEXT: 1.00 - - - - - - - - - - - - - addl %edx, %eax
65
66 # CHECK: Timeline view:
67 # CHECK-NEXT: Index 0123456
68
69 # CHECK: [0,0] DeER .. addl %eax, %ebx
70 # CHECK-NEXT: [0,1] D=eER.. addl %ebx, %ecx
71 # CHECK-NEXT: [0,2] .D=eER. addl %ecx, %edx
72 # CHECK-NEXT: [0,3] .D==eER addl %edx, %eax
73
74 # CHECK: Average Wait times (based on the timeline view):
75 # CHECK-NEXT: [0]: Executions
76 # CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
77 # CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
78 # CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
79
80 # CHECK: [0] [1] [2] [3]
81 # CHECK-NEXT: 0. 1 1.0 1.0 0.0 addl %eax, %ebx
82 # CHECK-NEXT: 1. 1 2.0 0.0 0.0 addl %ebx, %ecx
83 # CHECK-NEXT: 2. 1 2.0 0.0 0.0 addl %ecx, %edx
84 # CHECK-NEXT: 3. 1 3.0 0.0 0.0 addl %edx, %eax
0 # NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
1 # RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=100 -timeline -timeline-max-iterations=1 -bottleneck-analysis < %s | FileCheck %s
2
3 vhaddps %xmm0, %xmm0, %xmm1
4
5 # CHECK: Iterations: 100
6 # CHECK-NEXT: Instructions: 100
7 # CHECK-NEXT: Total Cycles: 106
8 # CHECK-NEXT: Total uOps: 100
9
10 # CHECK: Dispatch Width: 2
11 # CHECK-NEXT: uOps Per Cycle: 0.94
12 # CHECK-NEXT: IPC: 0.94
13 # CHECK-NEXT: Block RThroughput: 1.0
14
15 # CHECK: Cycles with backend pressure increase [ 76.42% ]
16 # CHECK-NEXT: Throughput Bottlenecks:
17 # CHECK-NEXT: Resource Pressure [ 76.42% ]
18 # CHECK-NEXT: - JFPA [ 76.42% ]
19 # CHECK-NEXT: - JFPU0 [ 76.42% ]
20 # CHECK-NEXT: Data Dependencies: [ 0.00% ]
21 # CHECK-NEXT: - Register Dependencies [ 0.00% ]
22 # CHECK-NEXT: - Memory Dependencies [ 0.00% ]
23
24 # CHECK: Instruction Info:
25 # CHECK-NEXT: [1]: #uOps
26 # CHECK-NEXT: [2]: Latency
27 # CHECK-NEXT: [3]: RThroughput
28 # CHECK-NEXT: [4]: MayLoad
29 # CHECK-NEXT: [5]: MayStore
30 # CHECK-NEXT: [6]: HasSideEffects (U)
31
32 # CHECK: [1] [2] [3] [4] [5] [6] Instructions:
33 # CHECK-NEXT: 1 4 1.00 vhaddps %xmm0, %xmm0, %xmm1
34
35 # CHECK: Resources:
36 # CHECK-NEXT: [0] - JALU0
37 # CHECK-NEXT: [1] - JALU1
38 # CHECK-NEXT: [2] - JDiv
39 # CHECK-NEXT: [3] - JFPA
40 # CHECK-NEXT: [4] - JFPM
41 # CHECK-NEXT: [5] - JFPU0
42 # CHECK-NEXT: [6] - JFPU1
43 # CHECK-NEXT: [7] - JLAGU
44 # CHECK-NEXT: [8] - JMul
45 # CHECK-NEXT: [9] - JSAGU
46 # CHECK-NEXT: [10] - JSTC
47 # CHECK-NEXT: [11] - JVALU0
48 # CHECK-NEXT: [12] - JVALU1
49 # CHECK-NEXT: [13] - JVIMUL
50
51 # CHECK: Resource pressure per iteration:
52 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
53 # CHECK-NEXT: - - - 1.00 - 1.00 - - - - - - - -
54
55 # CHECK: Resource pressure by instruction:
56 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
57 # CHECK-NEXT: - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm0, %xmm0, %xmm1
58
59 # CHECK: Timeline view:
60 # CHECK-NEXT: Index 0123456
61
62 # CHECK: [0,0] DeeeeER vhaddps %xmm0, %xmm0, %xmm1
63
64 # CHECK: Average Wait times (based on the timeline view):
65 # CHECK-NEXT: [0]: Executions
66 # CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
67 # CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
68 # CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
69
70 # CHECK: [0] [1] [2] [3]
71 # CHECK-NEXT: 0. 1 1.0 1.0 0.0 vhaddps %xmm0, %xmm0, %xmm1
0 # NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
1 # RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=1500 -noalias=false -timeline -timeline-max-iterations=1 -bottleneck-analysis < %s | FileCheck %s
2
3 vmovaps (%rsi), %xmm0
4 vmovaps %xmm0, (%rdi)
5 vmovaps 16(%rsi), %xmm0
6 vmovaps %xmm0, 16(%rdi)
7 vmovaps 32(%rsi), %xmm0
8 vmovaps %xmm0, 32(%rdi)
9 vmovaps 48(%rsi), %xmm0
10 vmovaps %xmm0, 48(%rdi)
11
12 # CHECK: Iterations: 1500
13 # CHECK-NEXT: Instructions: 12000
14 # CHECK-NEXT: Total Cycles: 36003
15 # CHECK-NEXT: Total uOps: 12000
16
17 # CHECK: Dispatch Width: 2
18 # CHECK-NEXT: uOps Per Cycle: 0.33
19 # CHECK-NEXT: IPC: 0.33
20 # CHECK-NEXT: Block RThroughput: 4.0
21
22 # CHECK: Cycles with backend pressure increase [ 99.89% ]
23 # CHECK-NEXT: Throughput Bottlenecks:
24 # CHECK-NEXT: Resource Pressure [ 0.00% ]
25 # CHECK-NEXT: Data Dependencies: [ 99.89% ]
26 # CHECK-NEXT: - Register Dependencies [ 0.00% ]
27 # CHECK-NEXT: - Memory Dependencies [ 99.89% ]
28
29 # CHECK: Instruction Info:
30 # CHECK-NEXT: [1]: #uOps
31 # CHECK-NEXT: [2]: Latency
32 # CHECK-NEXT: [3]: RThroughput
33 # CHECK-NEXT: [4]: MayLoad
34 # CHECK-NEXT: [5]: MayStore
35 # CHECK-NEXT: [6]: HasSideEffects (U)
36
37 # CHECK: [1] [2] [3] [4] [5] [6] Instructions:
38 # CHECK-NEXT: 1 5 1.00 * vmovaps (%rsi), %xmm0
39 # CHECK-NEXT: 1 1 1.00 * vmovaps %xmm0, (%rdi)
40 # CHECK-NEXT: 1 5 1.00 * vmovaps 16(%rsi), %xmm0
41 # CHECK-NEXT: 1 1 1.00 * vmovaps %xmm0, 16(%rdi)
42 # CHECK-NEXT: 1 5 1.00 * vmovaps 32(%rsi), %xmm0
43 # CHECK-NEXT: 1 1 1.00 * vmovaps %xmm0, 32(%rdi)
44 # CHECK-NEXT: 1 5 1.00 * vmovaps 48(%rsi), %xmm0
45 # CHECK-NEXT: 1 1 1.00 * vmovaps %xmm0, 48(%rdi)
46
47 # CHECK: Resources:
48 # CHECK-NEXT: [0] - JALU0
49 # CHECK-NEXT: [1] - JALU1
50 # CHECK-NEXT: [2] - JDiv
51 # CHECK-NEXT: [3] - JFPA
52 # CHECK-NEXT: [4] - JFPM
53 # CHECK-NEXT: [5] - JFPU0
54 # CHECK-NEXT: [6] - JFPU1
55 # CHECK-NEXT: [7] - JLAGU
56 # CHECK-NEXT: [8] - JMul
57 # CHECK-NEXT: [9] - JSAGU
58 # CHECK-NEXT: [10] - JSTC
59 # CHECK-NEXT: [11] - JVALU0
60 # CHECK-NEXT: [12] - JVALU1
61 # CHECK-NEXT: [13] - JVIMUL
62
63 # CHECK: Resource pressure per iteration:
64 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
65 # CHECK-NEXT: - - - 2.00 2.00 4.00 4.00 4.00 - 4.00 4.00 - - -
66
67 # CHECK: Resource pressure by instruction:
68 # CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
69 # CHECK-NEXT: - - - - 1.00 1.00 - 1.00 - - - - - - vmovaps (%rsi), %xmm0
70 # CHECK-NEXT: - - - - - - 1.00 - - 1.00 1.00 - - - vmovaps %xmm0, (%rdi)
71 # CHECK-NEXT: - - - 1.00 - 1.00 - 1.00 - - - - - - vmovaps 16(%rsi), %xmm0
72 # CHECK-NEXT: - - - - - - 1.00 - - 1.00 1.00 - - - vmovaps %xmm0, 16(%rdi)
73 # CHECK-NEXT: - - - - 1.00 1.00 - 1.00 - - - - - - vmovaps 32(%rsi), %xmm0
74 # CHECK-NEXT: - - - - - - 1.00 - - 1.00 1.00 - - - vmovaps %xmm0, 32(%rdi)
75 # CHECK-NEXT: - - - 1.00 - 1.00 - 1.00 - - - - - - vmovaps 48(%rsi), %xmm0
76 # CHECK-NEXT: - - - - - - 1.00 - - 1.00 1.00 - - - vmovaps %xmm0, 48(%rdi)
77
78 # CHECK: Timeline view:
79 # CHECK-NEXT: 0123456789
80 # CHECK-NEXT: Index 0123456789 0123456
81
82 # CHECK: [0,0] DeeeeeER . . . .. vmovaps (%rsi), %xmm0
83 # CHECK-NEXT: [0,1] D=====eER . . . .. vmovaps %xmm0, (%rdi)
84 # CHECK-NEXT: [0,2] .D=====eeeeeER . . .. vmovaps 16(%rsi), %xmm0
85 # CHECK-NEXT: [0,3] .D==========eER. . .. vmovaps %xmm0, 16(%rdi)
86 # CHECK-NEXT: [0,4] . D==========eeeeeER. .. vmovaps 32(%rsi), %xmm0
87 # CHECK-NEXT: [0,5] . D===============eER .. vmovaps %xmm0, 32(%rdi)
88 # CHECK-NEXT: [0,6] . D===============eeeeeER. vmovaps 48(%rsi), %xmm0
89 # CHECK-NEXT: [0,7] . D====================eER vmovaps %xmm0, 48(%rdi)
90
91 # CHECK: Average Wait times (based on the timeline view):
92 # CHECK-NEXT: [0]: Executions
93 # CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
94 # CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
95 # CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
96
97 # CHECK: [0] [1] [2] [3]
98 # CHECK-NEXT: 0. 1 1.0 1.0 0.0 vmovaps (%rsi), %xmm0
99 # CHECK-NEXT: 1. 1 6.0 0.0 0.0 vmovaps %xmm0, (%rdi)
100 # CHECK-NEXT: 2. 1 6.0 0.0 0.0 vmovaps 16(%rsi), %xmm0
101 # CHECK-NEXT: 3. 1 11.0 0.0 0.0 vmovaps %xmm0, 16(%rdi)
102 # CHECK-NEXT: 4. 1 11.0 0.0 0.0 vmovaps 32(%rsi), %xmm0
103 # CHECK-NEXT: 5. 1 16.0 0.0 0.0 vmovaps %xmm0, 32(%rdi)
104 # CHECK-NEXT: 6. 1 16.0 0.0 0.0 vmovaps 48(%rsi), %xmm0
105 # CHECK-NEXT: 7. 1 21.0 0.0 0.0 vmovaps %xmm0, 48(%rdi)
2424 SummaryView::SummaryView(const MCSchedModel &Model, ArrayRef S,
2525 unsigned Width)
2626 : SM(Model), Source(S), DispatchWidth(Width), LastInstructionIdx(0),
27 TotalCycles(0), NumMicroOps(0),
27 TotalCycles(0), NumMicroOps(0), BPI({0, 0, 0, 0}),
28 ResourcePressureDistribution(Model.getNumProcResourceKinds(), 0),
2829 ProcResourceUsage(Model.getNumProcResourceKinds(), 0),
2930 ProcResourceMasks(Model.getNumProcResourceKinds()),
30 ResIdx2ProcResID(Model.getNumProcResourceKinds(), 0) {
31 ResIdx2ProcResID(Model.getNumProcResourceKinds(), 0),
32 PressureIncreasedBecauseOfResources(false),
33 PressureIncreasedBecauseOfDataDependencies(false),
34 SeenStallCycles(false) {
3135 computeProcResourceMasks(SM, ProcResourceMasks);
3236 for (unsigned I = 1, E = SM.getNumProcResourceKinds(); I < E; ++I) {
3337 unsigned Index = getResourceStateIndex(ProcResourceMasks[I]);
6064 }
6165 }
6266
67 void SummaryView::onEvent(const HWPressureEvent &Event) {
68 assert(Event.Reason != HWPressureEvent::INVALID &&
69 "Unexpected invalid event!");
70
71 switch (Event.Reason) {
72 default:
73 break;
74
75 case HWPressureEvent::RESOURCES: {
76 PressureIncreasedBecauseOfResources = true;
77 ++BPI.ResourcePressureCycles;
78 uint64_t ResourceMask = Event.ResourceMask;
79 while (ResourceMask) {
80 uint64_t Current = ResourceMask & (-ResourceMask);
81 unsigned Index = getResourceStateIndex(Current);
82 unsigned ProcResID = ResIdx2ProcResID[Index];
83 const MCProcResourceDesc &PRDesc = *SM.getProcResource(ProcResID);
84 if (!PRDesc.SubUnitsIdxBegin) {
85 ResourcePressureDistribution[Index]++;
86 ResourceMask ^= Current;
87 continue;
88 }
89
90 for (unsigned I = 0, E = PRDesc.NumUnits; I < E; ++I) {
91 unsigned OtherProcResID = PRDesc.SubUnitsIdxBegin[I];
92 unsigned OtherMask = ProcResourceMasks[OtherProcResID];
93 ResourcePressureDistribution[getResourceStateIndex(OtherMask)]++;
94 }
95
96 ResourceMask ^= Current;
97 }
98 }
99
100 break;
101 case HWPressureEvent::REGISTER_DEPS:
102 PressureIncreasedBecauseOfDataDependencies = true;
103 ++BPI.RegisterDependencyCycles;
104 break;
105 case HWPressureEvent::MEMORY_DEPS:
106 PressureIncreasedBecauseOfDataDependencies = true;
107 ++BPI.MemoryDependencyCycles;
108 break;
109 }
110 }
111
112 void SummaryView::printBottleneckHints(raw_ostream &OS) const {
113 if (!SeenStallCycles || !BPI.PressureIncreaseCycles)
114 return;
115
116 double PressurePerCycle =
117 (double)BPI.PressureIncreaseCycles * 100 / TotalCycles;
118 double ResourcePressurePerCycle =
119 (double)BPI.ResourcePressureCycles * 100 / TotalCycles;
120 double DDPerCycle = (double)BPI.DataDependencyCycles * 100 / TotalCycles;
121 double RegDepPressurePerCycle =
122 (double)BPI.RegisterDependencyCycles * 100 / TotalCycles;
123 double MemDepPressurePerCycle =
124 (double)BPI.MemoryDependencyCycles * 100 / TotalCycles;
125
126 OS << "\nCycles with backend pressure increase [ "
127 << format("%.2f", floor((PressurePerCycle * 100) + 0.5) / 100) << "% ]";
128
129 OS << "\nThroughput Bottlenecks: "
130 << "\n Resource Pressure [ "
131 << format("%.2f", floor((ResourcePressurePerCycle * 100) + 0.5) / 100)
132 << "% ]";
133
134 if (BPI.PressureIncreaseCycles) {
135 for (unsigned I = 0, E = ResourcePressureDistribution.size(); I < E; ++I) {
136 if (ResourcePressureDistribution[I]) {
137 double Frequency =
138 (double)ResourcePressureDistribution[I] * 100 / TotalCycles;
139 unsigned Index = ResIdx2ProcResID[getResourceStateIndex(1ULL << I)];
140 const MCProcResourceDesc &PRDesc = *SM.getProcResource(Index);
141 OS << "\n - " << PRDesc.Name << " [ "
142 << format("%.2f", floor((Frequency * 100) + 0.5) / 100) << "% ]";
143 }
144 }
145 }
146
147 OS << "\n Data Dependencies: [ "
148 << format("%.2f", floor((DDPerCycle * 100) + 0.5) / 100) << "% ]";
149
150 OS << "\n - Register Dependencies [ "
151 << format("%.2f", floor((RegDepPressurePerCycle * 100) + 0.5) / 100)
152 << "% ]";
153
154 OS << "\n - Memory Dependencies [ "
155 << format("%.2f", floor((MemDepPressurePerCycle * 100) + 0.5) / 100)
156 << "% ]\n\n";
157 }
158
63159 void SummaryView::printView(raw_ostream &OS) const {
64160 unsigned Instructions = Source.size();
65161 unsigned Iterations = (LastInstructionIdx / Instructions) + 1;
84180 TempStream << "\nBlock RThroughput: "
85181 << format("%.1f", floor((BlockRThroughput * 10) + 0.5) / 10)
86182 << '\n';
183
184 printBottleneckHints(TempStream);
87185 TempStream.flush();
88186 OS << Buffer;
89187 }
4444 unsigned TotalCycles;
4545 // The total number of micro opcodes contributed by a block of instructions.
4646 unsigned NumMicroOps;
47
48 struct BackPressureInfo {
49 // Cycles where backpressure increased.
50 unsigned PressureIncreaseCycles;
51 // Cycles where backpressure increased because of pipeline pressure.
52 unsigned ResourcePressureCycles;
53 // Cycles where backpressure increased because of data dependencies.
54 unsigned DataDependencyCycles;
55 // Cycles where backpressure increased because of register dependencies.
56 unsigned RegisterDependencyCycles;
57 // Cycles where backpressure increased because of memory dependencies.
58 unsigned MemoryDependencyCycles;
59 };
60 BackPressureInfo BPI;
61
62 // Resource pressure distribution. There is an element for every processor
63 // resource declared by the scheduling model. Quantities are number of cycles.
64 llvm::SmallVector ResourcePressureDistribution;
65
4766 // For each processor resource, this vector stores the cumulative number of
4867 // resource cycles consumed by the analyzed code block.
4968 llvm::SmallVector ProcResourceUsage;
5776 // Used to map resource indices to actual processor resource IDs.
5877 llvm::SmallVector ResIdx2ProcResID;
5978
79 // True if resource pressure events were notified during this cycle.
80 bool PressureIncreasedBecauseOfResources;
81 bool PressureIncreasedBecauseOfDataDependencies;
82
83 // True if throughput was affected by dispatch stalls.
84 bool SeenStallCycles;
85
6086 // Compute the reciprocal throughput for the analyzed code block.
6187 // The reciprocal block throughput is computed as the MAX between:
6288 // - NumMicroOps / DispatchWidth
6389 // - Total Resource Cycles / #Units (for every resource consumed).
6490 double getBlockRThroughput() const;
6591
92 // Prints a bottleneck message to OS.
93 void printBottleneckHints(llvm::raw_ostream &OS) const;
94
6695 public:
6796 SummaryView(const llvm::MCSchedModel &Model, llvm::ArrayRef S,
6897 unsigned Width);
6998
70 void onCycleEnd() override { ++TotalCycles; }
99 void onCycleEnd() override {
100 ++TotalCycles;
101 if (PressureIncreasedBecauseOfResources ||
102 PressureIncreasedBecauseOfDataDependencies) {
103 ++BPI.PressureIncreaseCycles;
104 if (PressureIncreasedBecauseOfDataDependencies)
105 ++BPI.DataDependencyCycles;
106 PressureIncreasedBecauseOfResources = false;
107 PressureIncreasedBecauseOfDataDependencies = false;
108 }
109 }
71110 void onEvent(const HWInstructionEvent &Event) override;
111 void onEvent(const HWStallEvent &Event) override {
112 SeenStallCycles = true;
113 }
114
115 void onEvent(const HWPressureEvent &Event) override;
72116
73117 void printView(llvm::raw_ostream &OS) const override;
74118 };
174174 cl::desc("Print all views including hardware statistics"),
175175 cl::cat(ViewOptions), cl::init(false));
176176
177 static cl::opt EnableBottleneckAnalysis(
178 "bottleneck-analysis",
179 cl::desc("Enable bottleneck analysis (disabled by default)"),
180 cl::cat(ViewOptions), cl::init(false));
181
177182 namespace {
178183
179184 const Target *getTarget(const char *ProgName) {
386391 mca::Context MCA(*MRI, *STI);
387392
388393 mca::PipelineOptions PO(Width, RegisterFileSize, LoadQueueSize,
389 StoreQueueSize, AssumeNoAlias);
394 StoreQueueSize, AssumeNoAlias,
395 EnableBottleneckAnalysis);
390396
391397 // Number each region in the sequence.
392398 unsigned RegionIdx = 0;