llvm.org GIT mirror llvm / f818386
Here are the notes from our Reoptimizer meetings. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@6923 91177308-0d34-0410-b5e6-96231b3b80d8 Brian Gaeke 16 years ago
2 changed file(s) with 247 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all
0 Wed Jun 25 15:13:51 CDT 2003
1
2 First-level instrumentation
3 ---------------------------
4
5 We use opt to do Bytecode-to-bytecode instrumentation. Look at
6 back-edges and insert llvm_first_trigger() function call which takes
7 no arguments and no return value. This instrumentation is designed to
8 be easy to remove, for instance by writing a NOP over the function
9 call instruction.
10
11 Keep count of every call to llvm_first_trigger(), and maintain
12 counters in a map indexed by return address. If the trigger count
13 exceeds a threshold, we identify a hot loop and perform second-level
14 instrumentation on the hot loop region (the instructions between the
15 target of the back-edge and the branch that causes the back-edge). We
16 do not move code across basic-block boundaries.
17
18
19 Second-level instrumentation
20 ---------------------------
21
22 We remove the first-level instrumentation by overwriting the CALL to
23 llvm_first_trigger() with a NOP.
24
25 The reoptimizer maintains a map between machine-code basic blocks and
26 LLVM BasicBlock*s. We only keep track of paths that start at the
27 first machine-code basic block of the hot loop region.
28
29 How do we keep track of which edges to instrument, and which edges are
30 exits from the hot region? 3 step process.
31
32 1) Do a DFS from the first machine-code basic block of the hot loop
33 region and mark reachable edges.
34
35 2) Do a DFS from the last machine-code basic block of the hot loop
36 region IGNORING back edges, and mark the edges which are reachable in
37 1) and also in 2) (i.e., must be reachable from both the start BB and
38 the end BB of the hot region).
39
40 3) Mark BBs which end in edges that exit the hot region; we need to
41 instrument these differently.
42
43 Assume that there is 1 free register. On SPARC we use %g1, which LLC
44 has agreed not to use. Shift a 1 into it at the beginning. At every
45 edge which corresponds to a conditional branch, we shift 0 for not
46 taken and 1 for taken into a register. This uniquely numbers the paths
47 through the hot region. Silently fail if we need more than 64 bits.
48
49 At the end BB we call countPath and increment the counter based on %g1
50 and the return address of the countPath call. We keep track of the
51 number of iterations and the number of paths. We only run this
52 version 30 or 40 times.
53
54 Find the BBs that total 90% or more of execution, and aggregate them
55 together to form our trace. But we do not allow more than 5 paths; if
56 we have more than 5 we take the ones that are executed the most. We
57 verify our assumption that we picked a hot back-edge in first-level
58 instrumentation, by making sure that the number of times we took an
59 exit edge from the hot trace is less than 10% of the number of
60 iterations.
61
62 LLC has been taught to recognize llvm_first_trigger() calls and NOT
63 generate saves and restores of caller-saved registers around these
64 calls.
65
66
67 Phase behavior
68 --------------
69
70 We turn off llvm_first_trigger() calls with NOPs, but this would hide
71 phase behavior from us (when some funcs/traces stop being hot and
72 others become hot.)
73
74 We have a SIGALRM timer that counts time for us. Every time we get a
75 SIGALRM we look at our priority queue of locations where we have
76 removed llvm_first_trigger() calls. Each location is inserted along
77 with a time when we will next turn instrumentation back on for that
78 call site. If the time has arrived for a particular call site, we pop
79 that off the prio. queue and turn instrumentation back on for that
80 call site.
81
82
83 Generating traces
84 -----------------
85
86 When we finally generate an optimized trace we first copy the code
87 into the trace cache. This leaves us with 3 copies of the code: the
88 original code, the instrumented code, and the optimized trace. The
89 optimized trace does not have instrumentation. The original code and
90 the instrumented code are modified to have a branch to the trace
91 cache, where the optimized traces are kept.
92
93 We copy the code from the original to the instrumentation version
94 by tracing the LLVM-to-Machine code basic block map and then copying
95 each machine code basic block we think is in the hot region into the
96 trace cache. Then we instrument that code. The process is similar for
97 generating the final optimized trace; we copy the same basic blocks
98 because we might need to put in fixup code for exit BBs.
99
100 LLVM basic blocks are not typically used in the Reoptimizer except
101 for the mapping information.
102
103 We are restricted to using single instructions to branch between the
104 original code, trace, and instrumented code. So we have to keep the
105 code copies in memory near the original code (they can't be far enough
106 away that a single pc-relative branch would not work.) Malloc() or
107 data region space is too far away. this impacts the design of the
108 trace cache.
109
110 We use a dummy function that is full of a bunch of for loops which we
111 overwrite with trace-cache code. The trace manager keeps track of
112 whether or not we have enough space in the trace cache, etc.
113
114 The trace insertion routine takes an original start address, a vector
115 of machine instructions representing the trace, index of branches and
116 their corresponding absolute targets, and index of calls and their
117 corresponding absolute targets.
118
119 The trace insertion routine is responsible for inserting branches from
120 the beginning of the original code to the beginning of the optimized
121 trace. This is because at some point the trace cache may run out of
122 space and it may have to evict a trace, at which point the branch to
123 the trace would also have to be removed. It uses a round-robin
124 replacement policy; we have found that this is almost as good as LRU
125 and better than random (especially because of problems fitting the new
126 trace in.)
127
128 We cannot deal with discontiguous trace cache areas. The trace cache
129 is supposed to be cache-line-aligned, but it is not page-aligned.
130
131 We generate instrumentation traces and optimized traces into separate
132 trace caches. We keep the instrumented code around because you don't
133 want to delete a trace when you still might have to return to it
134 (i.e., return from a llvm_first_trigger() or countPath() call.)
135
136
0 Thu Jun 26 14:43:04 CDT 2003
1
2 Information about BinInterface
3 ------------------------------
4
5 Take in a set of instructions with some particular register
6 allocation. It allows you to add, modify, or delete some instructions,
7 in SSA form (kind of like LLVM's MachineInstrs.) Then re-allocate
8 registers. It assumes that the transformations you are doing are safe.
9 It does not update the mapping information or the LLVM representation
10 for the modified trace (so it would not, for instance, support
11 multiple optimization passes; passes have to be aware of and update
12 manually the mapping information.)
13
14 The way you use it is you take the original code and provide it to
15 BinInterface; then you do optimizations to it, then you put it in the
16 trace cache.
17
18 The BinInterface tries to find live-outs for traces so that it can do
19 register allocation on just the trace, and stitch the trace back into
20 the original code. It has to preserve the live-ins and live-outs when
21 it does its register allocation. (On exits from the trace we have
22 epilogues that copy live-outs back into the right registers, but
23 live-ins have to be in the right registers.)
24
25
26 Limitations of BinInterface
27 ---------------------------
28
29 It does copy insertions for PHIs, which it infers from the machine
30 code. The mapping info inserted by LLC is not sufficient to determine
31 the PHIs.
32
33 It does not handle integer or floating-point condition codes and it
34 does not handle floating-point register allocation.
35
36 It is not aggressively able to use lots of registers.
37
38 There is a problem with alloca: we cannot find our spill space for
39 spilling registers, normally allocated on the stack, if the trace
40 follows an alloca(). What might be an acceptable solution would be to
41 disable trace generation on functions that have variable-sized
42 alloca()s. Variable-sized allocas in the trace would also probably
43 screw things up.
44
45 Because of the FP and alloca limitations, the BinInterface is
46 completely disabled right now.
47
48
49 Demo
50 ----
51
52 This is a demo of the Ball & Larus version that does NOT use 2-level
53 profiling.
54
55 1. Compile program with llvm-gcc.
56 2. Run opt -lowerswitch -paths -emitfuncs on the bytecode.
57 -lowerswitch change switch statements to branches
58 -paths Ball & Larus path-profiling algorithm
59 -emitfuncs emit the table of functions
60 3. Run llc to generate SPARC assembly code for the result of step 2.
61 4. Use g++ to link the (instrumented) assembly code.
62
63 We use a script to do all this:
64 ------------------------------------------------------------------------------
65 #!/bin/sh
66 llvm-gcc $1.c -o $1
67 opt -lowerswitch -paths -emitfuncs $1.bc > $1.run.bc
68 llc -f $1.run.bc
69 LIBS=$HOME/llvm_sparc/lib/Debug
70 GXX=/usr/dcs/software/evaluation/bin/g++
71 $GXX -g -L $LIBS $1.run.s -o $1.run.llc \
72 $LIBS/tracecache.o \
73 $LIBS/mapinfo.o \
74 $LIBS/trigger.o \
75 $LIBS/profpaths.o \
76 $LIBS/bininterface.o \
77 $LIBS/support.o \
78 $LIBS/vmcore.o \
79 $LIBS/transformutils.o \
80 $LIBS/bcreader.o \
81 -lscalaropts -lscalaropts -lanalysis \
82 -lmalloc -lcpc -lm -ldl
83 ------------------------------------------------------------------------------
84
85 5. Run the resulting binary. You will see output from BinInterface
86 (described below) intermixed with the output from the program.
87
88
89 Output from BinInterface
90 ------------------------
91
92 BinInterface's debugging code prints out the following stuff in order:
93
94 1. Initial code provided to BinInterface with original register
95 allocation.
96
97 2. Section 0 is the trace prolog, consisting mainly of live-ins and
98 register saves which will be restored in epilogs.
99
100 3. Section 1 is the trace itself, in SSA form used by BinInterface,
101 along with the PHIs that are inserted.
102 PHIs are followed by the copies that implement them.
103 Each branch (i.e., out of the trace) is annotated with the
104 section number that represents the epilog it branches to.
105
106 4. All the other sections starting with Section 2 are trace epilogs.
107 Every branch from the trace has to go to some epilog.
108
109 5. After the last section is the register allocation output.