llvm.org GIT mirror llvm / ee47edf
docs: Sphinxify `docs/tutorial/` Sorry for the massive commit, but I just wanted to knock this one down and it is really straightforward. There are still a couple trivial (i.e. not related to the content) things left to fix: - Use of raw HTML links where :doc:`...` and :ref:`...` could be used instead. If you are a newbie and want to help fix this it would make for some good bite-sized patches; more experienced developers should be focusing on adding new content (to this tutorial or elsewhere, but please _do not_ waste your time on formatting when there is such dire need for documentation (see docs/SphinxQuickstartTemplate.rst to get started writing)). - Highlighting of the kaleidoscope code blocks (currently left as bare `::`). I will be working on writing a custom Pygments highlighter for this, mostly as training for maintaining the `llvm` code-block's lexer in-tree. I want to do this because I am extremely unhappy with how it just "gives up" on the slightest deviation from the expected syntax and leaves the whole code-block un-highlighted. More generally I am looking at writing some Sphinx extensions and keeping them in-tree as well, to support common use cases that currently have no good solution (like "monospace text inside a link"). git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@169343 91177308-0d34-0410-b5e6-96231b3b80d8 Sean Silva 6 years ago
34 changed file(s) with 17107 addition(s) and 19075 deletion(s). Raw diff Collapse all Expand all
4949 @# Kind of a hack, but HTML-formatted docs are on the way out anyway.
5050 @echo "Copying legacy HTML-formatted docs into $(BUILDDIR)/html"
5151 @cp -a *.html $(BUILDDIR)/html
52 @mkdir -p $(BUILDDIR)/html/tutorial
53 @cp tutorial/*.html tutorial/*.png $(BUILDDIR)/html/tutorial
5452 @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
5553
5654 dirhtml:
+0
-348
docs/tutorial/LangImpl1.html less more
None
1 "http://www.w3.org/TR/html4/strict.dtd">
2
3
4
5 Kaleidoscope: Tutorial Introduction and the Lexer
6
7
8
9
10
11
12
13

Kaleidoscope: Tutorial Introduction and the Lexer

14
15
16
  • Up to Tutorial Index
  • 17
  • Chapter 1
  • 18
    19
  • Tutorial Introduction
  • 20
  • The Basic Language
  • 21
  • The Lexer
  • 22
    23
    24
  • Chapter 2: Implementing a Parser and AST
  • 25
    26
    27
    28

    Written by Chris Lattner

    29
    30
    31
    32

    Tutorial Introduction

    33
    34
    35
    36
    37

    Welcome to the "Implementing a language with LLVM" tutorial. This tutorial

    38 runs through the implementation of a simple language, showing how fun and
    39 easy it can be. This tutorial will get you up and started as well as help to
    40 build a framework you can extend to other languages. The code in this tutorial
    41 can also be used as a playground to hack on other LLVM specific things.
    42

    43
    44

    45 The goal of this tutorial is to progressively unveil our language, describing
    46 how it is built up over time. This will let us cover a fairly broad range of
    47 language design and LLVM-specific usage issues, showing and explaining the code
    48 for it all along the way, without overwhelming you with tons of details up
    49 front.

    50
    51

    It is useful to point out ahead of time that this tutorial is really about

    52 teaching compiler techniques and LLVM specifically, not about teaching
    53 modern and sane software engineering principles. In practice, this means that
    54 we'll take a number of shortcuts to simplify the exposition. For example, the
    55 code leaks memory, uses global variables all over the place, doesn't use nice
    56 design patterns like
    57 href="http://en.wikipedia.org/wiki/Visitor_pattern">visitors, etc... but it
    58 is very simple. If you dig in and use the code as a basis for future projects,
    59 fixing these deficiencies shouldn't be hard.

    60
    61

    I've tried to put this tutorial together in a way that makes chapters easy to

    62 skip over if you are already familiar with or are uninterested in the various
    63 pieces. The structure of the tutorial is:
    64

    65
    66
    67
  • Chapter #1: Introduction to the Kaleidoscope
  • 68 language, and the definition of its Lexer - This shows where we are going
    69 and the basic functionality that we want it to do. In order to make this
    70 tutorial maximally understandable and hackable, we choose to implement
    71 everything in C++ instead of using lexer and parser generators. LLVM obviously
    72 works just fine with such tools, feel free to use one if you prefer.
    73
  • Chapter #2: Implementing a Parser and
  • 74 AST - With the lexer in place, we can talk about parsing techniques and
    75 basic AST construction. This tutorial describes recursive descent parsing and
    76 operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific,
    77 the code doesn't even link in LLVM at this point. :)
    78
  • Chapter #3: Code generation to LLVM IR -
  • 79 With the AST ready, we can show off how easy generation of LLVM IR really
    80 is.
    81
  • Chapter #4: Adding JIT and Optimizer
  • 82 Support - Because a lot of people are interested in using LLVM as a JIT,
    83 we'll dive right into it and show you the 3 lines it takes to add JIT support.
    84 LLVM is also useful in many other ways, but this is one simple and "sexy" way
    85 to shows off its power. :)
    86
  • Chapter #5: Extending the Language: Control
  • 87 Flow - With the language up and running, we show how to extend it with
    88 control flow operations (if/then/else and a 'for' loop). This gives us a chance
    89 to talk about simple SSA construction and control flow.
    90
  • Chapter #6: Extending the Language:
  • 91 User-defined Operators - This is a silly but fun chapter that talks about
    92 extending the language to let the user program define their own arbitrary
    93 unary and binary operators (with assignable precedence!). This lets us build a
    94 significant piece of the "language" as library routines.
    95
  • Chapter #7: Extending the Language: Mutable
  • 96 Variables - This chapter talks about adding user-defined local variables
    97 along with an assignment operator. The interesting part about this is how
    98 easy and trivial it is to construct SSA form in LLVM: no, LLVM does not
    99 require your front-end to construct SSA form!
    100
  • Chapter #8: Conclusion and other useful LLVM
  • 101 tidbits - This chapter wraps up the series by talking about potential
    102 ways to extend the language, but also includes a bunch of pointers to info about
    103 "special topics" like adding garbage collection support, exceptions, debugging,
    104 support for "spaghetti stacks", and a bunch of other tips and tricks.
    105
    106
    107
    108

    By the end of the tutorial, we'll have written a bit less than 700 lines of

    109 non-comment, non-blank, lines of code. With this small amount of code, we'll
    110 have built up a very reasonable compiler for a non-trivial language including
    111 a hand-written lexer, parser, AST, as well as code generation support with a JIT
    112 compiler. While other systems may have interesting "hello world" tutorials,
    113 I think the breadth of this tutorial is a great testament to the strengths of
    114 LLVM and why you should consider it if you're interested in language or compiler
    115 design.

    116
    117

    A note about this tutorial: we expect you to extend the language and play

    118 with it on your own. Take the code and go crazy hacking away at it, compilers
    119 don't need to be scary creatures - it can be a lot of fun to play with
    120 languages!

    121
    122
    123
    124
    125

    The Basic Language

    126
    127
    128
    129
    130

    This tutorial will be illustrated with a toy language that we'll call

    131 "Kaleidoscope" (derived
    132 from "meaning beautiful, form, and view").
    133 Kaleidoscope is a procedural language that allows you to define functions, use
    134 conditionals, math, etc. Over the course of the tutorial, we'll extend
    135 Kaleidoscope to support the if/then/else construct, a for loop, user defined
    136 operators, JIT compilation with a simple command line interface, etc.

    137
    138

    Because we want to keep things simple, the only datatype in Kaleidoscope is a

    139 64-bit floating point type (aka 'double' in C parlance). As such, all values
    140 are implicitly double precision and the language doesn't require type
    141 declarations. This gives the language a very nice and simple syntax. For
    142 example, the following simple example computes
    143 href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:

    144
    145
    146
    
                      
                    
    147 # Compute the x'th fibonacci number.
    148 def fib(x)
    149 if x < 3 then
    150 1
    151 else
    152 fib(x-1)+fib(x-2)
    153
    154 # This expression will compute the 40th number.
    155 fib(40)
    156
    157
    158
    159

    We also allow Kaleidoscope to call into standard library functions (the LLVM

    160 JIT makes this completely trivial). This means that you can use the 'extern'
    161 keyword to define a function before you use it (this is also useful for mutually
    162 recursive functions). For example:

    163
    164
    165
    
                      
                    
    166 extern sin(arg);
    167 extern cos(arg);
    168 extern atan2(arg1 arg2);
    169
    170 atan2(sin(.4), cos(42))
    171
    172
    173
    174

    A more interesting example is included in Chapter 6 where we write a little

    175 Kaleidoscope application that displays
    176 a Mandelbrot Set at various levels of magnification.

    177
    178

    Lets dive into the implementation of this language!

    179
    180
    181
    182
    183

    The Lexer

    184
    185
    186
    187
    188

    When it comes to implementing a language, the first thing needed is

    189 the ability to process a text file and recognize what it says. The traditional
    190 way to do this is to use a "
    191 href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer" (aka 'scanner')
    192 to break the input up into "tokens". Each token returned by the lexer includes
    193 a token code and potentially some metadata (e.g. the numeric value of a number).
    194 First, we define the possibilities:
    195

    196
    197
    198
    
                      
                    
    199 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
    200 // of these for known things.
    201 enum Token {
    202 tok_eof = -1,
    203
    204 // commands
    205 tok_def = -2, tok_extern = -3,
    206
    207 // primary
    208 tok_identifier = -4, tok_number = -5,
    209 };
    210
    211 static std::string IdentifierStr; // Filled in if tok_identifier
    212 static double NumVal; // Filled in if tok_number
    213
    214
    215
    216

    Each token returned by our lexer will either be one of the Token enum values

    217 or it will be an 'unknown' character like '+', which is returned as its ASCII
    218 value. If the current token is an identifier, the IdentifierStr
    219 global variable holds the name of the identifier. If the current token is a
    220 numeric literal (like 1.0), NumVal holds its value. Note that we use
    221 global variables for simplicity, this is not the best choice for a real language
    222 implementation :).
    223

    224
    225

    The actual implementation of the lexer is a single function named

    226 gettok. The gettok function is called to return the next token
    227 from standard input. Its definition starts as:

    228
    229
    230
    
                      
                    
    231 /// gettok - Return the next token from standard input.
    232 static int gettok() {
    233 static int LastChar = ' ';
    234
    235 // Skip any whitespace.
    236 while (isspace(LastChar))
    237 LastChar = getchar();
    238
    239
    240
    241

    242 gettok works by calling the C getchar() function to read
    243 characters one at a time from standard input. It eats them as it recognizes
    244 them and stores the last character read, but not processed, in LastChar. The
    245 first thing that it has to do is ignore whitespace between tokens. This is
    246 accomplished with the loop above.

    247
    248

    The next thing gettok needs to do is recognize identifiers and

    249 specific keywords like "def". Kaleidoscope does this with this simple loop:

    250
    251
    252
    
                      
                    
    253 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    254 IdentifierStr = LastChar;
    255 while (isalnum((LastChar = getchar())))
    256 IdentifierStr += LastChar;
    257
    258 if (IdentifierStr == "def") return tok_def;
    259 if (IdentifierStr == "extern") return tok_extern;
    260 return tok_identifier;
    261 }
    262
    263
    264
    265

    Note that this code sets the 'IdentifierStr' global whenever it

    266 lexes an identifier. Also, since language keywords are matched by the same
    267 loop, we handle them here inline. Numeric values are similar:

    268
    269
    270
    
                      
                    
    271 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
    272 std::string NumStr;
    273 do {
    274 NumStr += LastChar;
    275 LastChar = getchar();
    276 } while (isdigit(LastChar) || LastChar == '.');
    277
    278 NumVal = strtod(NumStr.c_str(), 0);
    279 return tok_number;
    280 }
    281
    282
    283
    284

    This is all pretty straight-forward code for processing input. When reading

    285 a numeric value from input, we use the C strtod function to convert it
    286 to a numeric value that we store in NumVal. Note that this isn't doing
    287 sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
    288 if you typed in "1.23". Feel free to extend it :). Next we handle comments:
    289

    290
    291
    292
    
                      
                    
    293 if (LastChar == '#') {
    294 // Comment until end of line.
    295 do LastChar = getchar();
    296 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
    297
    298 if (LastChar != EOF)
    299 return gettok();
    300 }
    301
    302
    303
    304

    We handle comments by skipping to the end of the line and then return the

    305 next token. Finally, if the input doesn't match one of the above cases, it is
    306 either an operator character like '+' or the end of the file. These are handled
    307 with this code:

    308
    309
    310
    
                      
                    
    311 // Check for end of file. Don't eat the EOF.
    312 if (LastChar == EOF)
    313 return tok_eof;
    314
    315 // Otherwise, just return the character as its ascii value.
    316 int ThisChar = LastChar;
    317 LastChar = getchar();
    318 return ThisChar;
    319 }
    320
    321
    322
    323

    With this, we have the complete lexer for the basic Kaleidoscope language

    324 (the full code listing for the Lexer is
    325 available in the next chapter of the tutorial).
    326 Next we'll build a simple parser that uses this to
    327 build an Abstract Syntax Tree. When we have that, we'll include a driver
    328 so that you can use the lexer and parser together.
    329

    330
    331 Next: Implementing a Parser and AST
    332
    333
    334
    335
    336
    337
    338 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!">
    339
    340 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!">
    341
    342 Chris Lattner
    343 The LLVM Compiler Infrastructure
    344 Last modified: $Date$
    345
    346
    347
    0 =================================================
    1 Kaleidoscope: Tutorial Introduction and the Lexer
    2 =================================================
    3
    4 .. contents::
    5 :local:
    6
    7 Written by `Chris Lattner `_
    8
    9 Tutorial Introduction
    10 =====================
    11
    12 Welcome to the "Implementing a language with LLVM" tutorial. This
    13 tutorial runs through the implementation of a simple language, showing
    14 how fun and easy it can be. This tutorial will get you up and started as
    15 well as help to build a framework you can extend to other languages. The
    16 code in this tutorial can also be used as a playground to hack on other
    17 LLVM specific things.
    18
    19 The goal of this tutorial is to progressively unveil our language,
    20 describing how it is built up over time. This will let us cover a fairly
    21 broad range of language design and LLVM-specific usage issues, showing
    22 and explaining the code for it all along the way, without overwhelming
    23 you with tons of details up front.
    24
    25 It is useful to point out ahead of time that this tutorial is really
    26 about teaching compiler techniques and LLVM specifically, *not* about
    27 teaching modern and sane software engineering principles. In practice,
    28 this means that we'll take a number of shortcuts to simplify the
    29 exposition. For example, the code leaks memory, uses global variables
    30 all over the place, doesn't use nice design patterns like
    31 `visitors `_, etc... but
    32 it is very simple. If you dig in and use the code as a basis for future
    33 projects, fixing these deficiencies shouldn't be hard.
    34
    35 I've tried to put this tutorial together in a way that makes chapters
    36 easy to skip over if you are already familiar with or are uninterested
    37 in the various pieces. The structure of the tutorial is:
    38
    39 - `Chapter #1 <#language>`_: Introduction to the Kaleidoscope
    40 language, and the definition of its Lexer - This shows where we are
    41 going and the basic functionality that we want it to do. In order to
    42 make this tutorial maximally understandable and hackable, we choose
    43 to implement everything in C++ instead of using lexer and parser
    44 generators. LLVM obviously works just fine with such tools, feel free
    45 to use one if you prefer.
    46 - `Chapter #2 `_: Implementing a Parser and AST -
    47 With the lexer in place, we can talk about parsing techniques and
    48 basic AST construction. This tutorial describes recursive descent
    49 parsing and operator precedence parsing. Nothing in Chapters 1 or 2
    50 is LLVM-specific, the code doesn't even link in LLVM at this point.
    51 :)
    52 - `Chapter #3 `_: Code generation to LLVM IR - With
    53 the AST ready, we can show off how easy generation of LLVM IR really
    54 is.
    55 - `Chapter #4 `_: Adding JIT and Optimizer Support
    56 - Because a lot of people are interested in using LLVM as a JIT,
    57 we'll dive right into it and show you the 3 lines it takes to add JIT
    58 support. LLVM is also useful in many other ways, but this is one
    59 simple and "sexy" way to shows off its power. :)
    60 - `Chapter #5 `_: Extending the Language: Control
    61 Flow - With the language up and running, we show how to extend it
    62 with control flow operations (if/then/else and a 'for' loop). This
    63 gives us a chance to talk about simple SSA construction and control
    64 flow.
    65 - `Chapter #6 `_: Extending the Language:
    66 User-defined Operators - This is a silly but fun chapter that talks
    67 about extending the language to let the user program define their own
    68 arbitrary unary and binary operators (with assignable precedence!).
    69 This lets us build a significant piece of the "language" as library
    70 routines.
    71 - `Chapter #7 `_: Extending the Language: Mutable
    72 Variables - This chapter talks about adding user-defined local
    73 variables along with an assignment operator. The interesting part
    74 about this is how easy and trivial it is to construct SSA form in
    75 LLVM: no, LLVM does *not* require your front-end to construct SSA
    76 form!
    77 - `Chapter #8 `_: Conclusion and other useful LLVM
    78 tidbits - This chapter wraps up the series by talking about
    79 potential ways to extend the language, but also includes a bunch of
    80 pointers to info about "special topics" like adding garbage
    81 collection support, exceptions, debugging, support for "spaghetti
    82 stacks", and a bunch of other tips and tricks.
    83
    84 By the end of the tutorial, we'll have written a bit less than 700 lines
    85 of non-comment, non-blank, lines of code. With this small amount of
    86 code, we'll have built up a very reasonable compiler for a non-trivial
    87 language including a hand-written lexer, parser, AST, as well as code
    88 generation support with a JIT compiler. While other systems may have
    89 interesting "hello world" tutorials, I think the breadth of this
    90 tutorial is a great testament to the strengths of LLVM and why you
    91 should consider it if you're interested in language or compiler design.
    92
    93 A note about this tutorial: we expect you to extend the language and
    94 play with it on your own. Take the code and go crazy hacking away at it,
    95 compilers don't need to be scary creatures - it can be a lot of fun to
    96 play with languages!
    97
    98 The Basic Language
    99 ==================
    100
    101 This tutorial will be illustrated with a toy language that we'll call
    102 "`Kaleidoscope `_" (derived
    103 from "meaning beautiful, form, and view"). Kaleidoscope is a procedural
    104 language that allows you to define functions, use conditionals, math,
    105 etc. Over the course of the tutorial, we'll extend Kaleidoscope to
    106 support the if/then/else construct, a for loop, user defined operators,
    107 JIT compilation with a simple command line interface, etc.
    108
    109 Because we want to keep things simple, the only datatype in Kaleidoscope
    110 is a 64-bit floating point type (aka 'double' in C parlance). As such,
    111 all values are implicitly double precision and the language doesn't
    112 require type declarations. This gives the language a very nice and
    113 simple syntax. For example, the following simple example computes
    114 `Fibonacci numbers: `_
    115
    116 ::
    117
    118 # Compute the x'th fibonacci number.
    119 def fib(x)
    120 if x < 3 then
    121 1
    122 else
    123 fib(x-1)+fib(x-2)
    124
    125 # This expression will compute the 40th number.
    126 fib(40)
    127
    128 We also allow Kaleidoscope to call into standard library functions (the
    129 LLVM JIT makes this completely trivial). This means that you can use the
    130 'extern' keyword to define a function before you use it (this is also
    131 useful for mutually recursive functions). For example:
    132
    133 ::
    134
    135 extern sin(arg);
    136 extern cos(arg);
    137 extern atan2(arg1 arg2);
    138
    139 atan2(sin(.4), cos(42))
    140
    141 A more interesting example is included in Chapter 6 where we write a
    142 little Kaleidoscope application that `displays a Mandelbrot
    143 Set `_ at various levels of magnification.
    144
    145 Lets dive into the implementation of this language!
    146
    147 The Lexer
    148 =========
    149
    150 When it comes to implementing a language, the first thing needed is the
    151 ability to process a text file and recognize what it says. The
    152 traditional way to do this is to use a
    153 "`lexer `_" (aka
    154 'scanner') to break the input up into "tokens". Each token returned by
    155 the lexer includes a token code and potentially some metadata (e.g. the
    156 numeric value of a number). First, we define the possibilities:
    157
    158 .. code-block:: c++
    159
    160 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
    161 // of these for known things.
    162 enum Token {
    163 tok_eof = -1,
    164
    165 // commands
    166 tok_def = -2, tok_extern = -3,
    167
    168 // primary
    169 tok_identifier = -4, tok_number = -5,
    170 };
    171
    172 static std::string IdentifierStr; // Filled in if tok_identifier
    173 static double NumVal; // Filled in if tok_number
    174
    175 Each token returned by our lexer will either be one of the Token enum
    176 values or it will be an 'unknown' character like '+', which is returned
    177 as its ASCII value. If the current token is an identifier, the
    178 ``IdentifierStr`` global variable holds the name of the identifier. If
    179 the current token is a numeric literal (like 1.0), ``NumVal`` holds its
    180 value. Note that we use global variables for simplicity, this is not the
    181 best choice for a real language implementation :).
    182
    183 The actual implementation of the lexer is a single function named
    184 ``gettok``. The ``gettok`` function is called to return the next token
    185 from standard input. Its definition starts as:
    186
    187 .. code-block:: c++
    188
    189 /// gettok - Return the next token from standard input.
    190 static int gettok() {
    191 static int LastChar = ' ';
    192
    193 // Skip any whitespace.
    194 while (isspace(LastChar))
    195 LastChar = getchar();
    196
    197 ``gettok`` works by calling the C ``getchar()`` function to read
    198 characters one at a time from standard input. It eats them as it
    199 recognizes them and stores the last character read, but not processed,
    200 in LastChar. The first thing that it has to do is ignore whitespace
    201 between tokens. This is accomplished with the loop above.
    202
    203 The next thing ``gettok`` needs to do is recognize identifiers and
    204 specific keywords like "def". Kaleidoscope does this with this simple
    205 loop:
    206
    207 .. code-block:: c++
    208
    209 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    210 IdentifierStr = LastChar;
    211 while (isalnum((LastChar = getchar())))
    212 IdentifierStr += LastChar;
    213
    214 if (IdentifierStr == "def") return tok_def;
    215 if (IdentifierStr == "extern") return tok_extern;
    216 return tok_identifier;
    217 }
    218
    219 Note that this code sets the '``IdentifierStr``' global whenever it
    220 lexes an identifier. Also, since language keywords are matched by the
    221 same loop, we handle them here inline. Numeric values are similar:
    222
    223 .. code-block:: c++
    224
    225 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
    226 std::string NumStr;
    227 do {
    228 NumStr += LastChar;
    229 LastChar = getchar();
    230 } while (isdigit(LastChar) || LastChar == '.');
    231
    232 NumVal = strtod(NumStr.c_str(), 0);
    233 return tok_number;
    234 }
    235
    236 This is all pretty straight-forward code for processing input. When
    237 reading a numeric value from input, we use the C ``strtod`` function to
    238 convert it to a numeric value that we store in ``NumVal``. Note that
    239 this isn't doing sufficient error checking: it will incorrectly read
    240 "1.23.45.67" and handle it as if you typed in "1.23". Feel free to
    241 extend it :). Next we handle comments:
    242
    243 .. code-block:: c++
    244
    245 if (LastChar == '#') {
    246 // Comment until end of line.
    247 do LastChar = getchar();
    248 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
    249
    250 if (LastChar != EOF)
    251 return gettok();
    252 }
    253
    254 We handle comments by skipping to the end of the line and then return
    255 the next token. Finally, if the input doesn't match one of the above
    256 cases, it is either an operator character like '+' or the end of the
    257 file. These are handled with this code:
    258
    259 .. code-block:: c++
    260
    261 // Check for end of file. Don't eat the EOF.
    262 if (LastChar == EOF)
    263 return tok_eof;
    264
    265 // Otherwise, just return the character as its ascii value.
    266 int ThisChar = LastChar;
    267 LastChar = getchar();
    268 return ThisChar;
    269 }
    270
    271 With this, we have the complete lexer for the basic Kaleidoscope
    272 language (the `full code listing `_ for the Lexer
    273 is available in the `next chapter `_ of the tutorial).
    274 Next we'll `build a simple parser that uses this to build an Abstract
    275 Syntax Tree `_. When we have that, we'll include a
    276 driver so that you can use the lexer and parser together.
    277
    278 `Next: Implementing a Parser and AST `_
    279
    +0
    -1231
    docs/tutorial/LangImpl2.html less more
    None
    1 "http://www.w3.org/TR/html4/strict.dtd">
    2
    3
    4
    5 Kaleidoscope: Implementing a Parser and AST
    6
    7
    8
    9
    10
    11
    12
    13

    Kaleidoscope: Implementing a Parser and AST

    14
    15
    16
  • Up to Tutorial Index
  • 17
  • Chapter 2
  • 18
    19
  • Chapter 2 Introduction
  • 20
  • The Abstract Syntax Tree (AST)
  • 21
  • Parser Basics
  • 22
  • Basic Expression Parsing
  • 23
  • Binary Expression Parsing
  • 24
  • Parsing the Rest
  • 25
  • The Driver
  • 26
  • Conclusions
  • 27
  • Full Code Listing
  • 28
    29
    30
  • Chapter 3: Code generation to LLVM IR
  • 31
    32
    33
    34

    Written by Chris Lattner

    35
    36
    37
    38

    Chapter 2 Introduction

    39
    40
    41
    42
    43

    Welcome to Chapter 2 of the "Implementing a language

    44 with LLVM" tutorial. This chapter shows you how to use the lexer, built in
    45 Chapter 1, to build a full
    46 href="http://en.wikipedia.org/wiki/Parsing">parser for
    47 our Kaleidoscope language. Once we have a parser, we'll define and build an
    48 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
    49 Tree (AST).

    50
    51

    The parser we will build uses a combination of

    52 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
    53 Parsing and
    54 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
    55 Parsing to parse the Kaleidoscope language (the latter for
    56 binary expressions and the former for everything else). Before we get to
    57 parsing though, lets talk about the output of the parser: the Abstract Syntax
    58 Tree.

    59
    60
    61
    62
    63

    The Abstract Syntax Tree (AST)

    64
    65
    66
    67
    68

    The AST for a program captures its behavior in such a way that it is easy for

    69 later stages of the compiler (e.g. code generation) to interpret. We basically
    70 want one object for each construct in the language, and the AST should closely
    71 model the language. In Kaleidoscope, we have expressions, a prototype, and a
    72 function object. We'll start with expressions first:

    73
    74
    75
    
                      
                    
    76 /// ExprAST - Base class for all expression nodes.
    77 class ExprAST {
    78 public:
    79 virtual ~ExprAST() {}
    80 };
    81
    82 /// NumberExprAST - Expression class for numeric literals like "1.0".
    83 class NumberExprAST : public ExprAST {
    84 double Val;
    85 public:
    86 NumberExprAST(double val) : Val(val) {}
    87 };
    88
    89
    90
    91

    The code above shows the definition of the base ExprAST class and one

    92 subclass which we use for numeric literals. The important thing to note about
    93 this code is that the NumberExprAST class captures the numeric value of the
    94 literal as an instance variable. This allows later phases of the compiler to
    95 know what the stored numeric value is.

    96
    97

    Right now we only create the AST, so there are no useful accessor methods on

    98 them. It would be very easy to add a virtual method to pretty print the code,
    99 for example. Here are the other expression AST node definitions that we'll use
    100 in the basic form of the Kaleidoscope language:
    101

    102
    103
    104
    
                      
                    
    105 /// VariableExprAST - Expression class for referencing a variable, like "a".
    106 class VariableExprAST : public ExprAST {
    107 std::string Name;
    108 public:
    109 VariableExprAST(const std::string &name) : Name(name) {}
    110 };
    111
    112 /// BinaryExprAST - Expression class for a binary operator.
    113 class BinaryExprAST : public ExprAST {
    114 char Op;
    115 ExprAST *LHS, *RHS;
    116 public:
    117 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
    118 : Op(op), LHS(lhs), RHS(rhs) {}
    119 };
    120
    121 /// CallExprAST - Expression class for function calls.
    122 class CallExprAST : public ExprAST {
    123 std::string Callee;
    124 std::vector<ExprAST*> Args;
    125 public:
    126 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
    127 : Callee(callee), Args(args) {}
    128 };
    129
    130
    131
    132

    This is all (intentionally) rather straight-forward: variables capture the

    133 variable name, binary operators capture their opcode (e.g. '+'), and calls
    134 capture a function name as well as a list of any argument expressions. One thing
    135 that is nice about our AST is that it captures the language features without
    136 talking about the syntax of the language. Note that there is no discussion about
    137 precedence of binary operators, lexical structure, etc.

    138
    139

    For our basic language, these are all of the expression nodes we'll define.

    140 Because it doesn't have conditional control flow, it isn't Turing-complete;
    141 we'll fix that in a later installment. The two things we need next are a way
    142 to talk about the interface to a function, and a way to talk about functions
    143 themselves:

    144
    145
    146
    
                      
                    
    147 /// PrototypeAST - This class represents the "prototype" for a function,
    148 /// which captures its name, and its argument names (thus implicitly the number
    149 /// of arguments the function takes).
    150 class PrototypeAST {
    151 std::string Name;
    152 std::vector<std::string> Args;
    153 public:
    154 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
    155 : Name(name), Args(args) {}
    156 };
    157
    158 /// FunctionAST - This class represents a function definition itself.
    159 class FunctionAST {
    160 PrototypeAST *Proto;
    161 ExprAST *Body;
    162 public:
    163 FunctionAST(PrototypeAST *proto, ExprAST *body)
    164 : Proto(proto), Body(body) {}
    165 };
    166
    167
    168
    169

    In Kaleidoscope, functions are typed with just a count of their arguments.

    170 Since all values are double precision floating point, the type of each argument
    171 doesn't need to be stored anywhere. In a more aggressive and realistic
    172 language, the "ExprAST" class would probably have a type field.

    173
    174

    With this scaffolding, we can now talk about parsing expressions and function

    175 bodies in Kaleidoscope.

    176
    177
    178
    179
    180

    Parser Basics

    181
    182
    183
    184
    185

    Now that we have an AST to build, we need to define the parser code to build

    186 it. The idea here is that we want to parse something like "x+y" (which is
    187 returned as three tokens by the lexer) into an AST that could be generated with
    188 calls like this:

    189
    190
    191
    
                      
                    
    192 ExprAST *X = new VariableExprAST("x");
    193 ExprAST *Y = new VariableExprAST("y");
    194 ExprAST *Result = new BinaryExprAST('+', X, Y);
    195
    196
    197
    198

    In order to do this, we'll start by defining some basic helper routines:

    199
    200
    201
    
                      
                    
    202 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
    203 /// token the parser is looking at. getNextToken reads another token from the
    204 /// lexer and updates CurTok with its results.
    205 static int CurTok;
    206 static int getNextToken() {
    207 return CurTok = gettok();
    208 }
    209
    210
    211
    212

    213 This implements a simple token buffer around the lexer. This allows
    214 us to look one token ahead at what the lexer is returning. Every function in
    215 our parser will assume that CurTok is the current token that needs to be
    216 parsed.

    217
    218
    219
    
                      
                    
    220
    221 /// Error* - These are little helper functions for error handling.
    222 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    223 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    224 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    225
    226
    227
    228

    229 The Error routines are simple helper routines that our parser will use
    230 to handle errors. The error recovery in our parser will not be the best and
    231 is not particular user-friendly, but it will be enough for our tutorial. These
    232 routines make it easier to handle errors in routines that have various return
    233 types: they always return null.

    234
    235

    With these basic helper functions, we can implement the first

    236 piece of our grammar: numeric literals.

    237
    238
    239
    240
    241

    Basic Expression Parsing

    242
    243
    244
    245
    246

    We start with numeric literals, because they are the simplest to process.

    247 For each production in our grammar, we'll define a function which parses that
    248 production. For numeric literals, we have:
    249

    250
    251
    252
    
                      
                    
    253 /// numberexpr ::= number
    254 static ExprAST *ParseNumberExpr() {
    255 ExprAST *Result = new NumberExprAST(NumVal);
    256 getNextToken(); // consume the number
    257 return Result;
    258 }
    259
    260
    261
    262

    This routine is very simple: it expects to be called when the current token

    263 is a tok_number token. It takes the current number value, creates
    264 a NumberExprAST node, advances the lexer to the next token, and finally
    265 returns.

    266
    267

    There are some interesting aspects to this. The most important one is that

    268 this routine eats all of the tokens that correspond to the production and
    269 returns the lexer buffer with the next token (which is not part of the grammar
    270 production) ready to go. This is a fairly standard way to go for recursive
    271 descent parsers. For a better example, the parenthesis operator is defined like
    272 this:

    273
    274
    275
    
                      
                    
    276 /// parenexpr ::= '(' expression ')'
    277 static ExprAST *ParseParenExpr() {
    278 getNextToken(); // eat (.
    279 ExprAST *V = ParseExpression();
    280 if (!V) return 0;
    281
    282 if (CurTok != ')')
    283 return Error("expected ')'");
    284 getNextToken(); // eat ).
    285 return V;
    286 }
    287
    288
    289
    290

    This function illustrates a number of interesting things about the

    291 parser:

    292
    293

    294 1) It shows how we use the Error routines. When called, this function expects
    295 that the current token is a '(' token, but after parsing the subexpression, it
    296 is possible that there is no ')' waiting. For example, if the user types in
    297 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
    298 occur, the parser needs a way to indicate that they happened: in our parser, we
    299 return null on an error.

    300
    301

    2) Another interesting aspect of this function is that it uses recursion by

    302 calling ParseExpression (we will soon see that ParseExpression can call
    303 ParseParenExpr). This is powerful because it allows us to handle
    304 recursive grammars, and keeps each production very simple. Note that
    305 parentheses do not cause construction of AST nodes themselves. While we could
    306 do it this way, the most important role of parentheses are to guide the parser
    307 and provide grouping. Once the parser constructs the AST, parentheses are not
    308 needed.

    309
    310

    The next simple production is for handling variable references and function

    311 calls:

    312
    313
    314
    
                      
                    
    315 /// identifierexpr
    316 /// ::= identifier
    317 /// ::= identifier '(' expression* ')'
    318 static ExprAST *ParseIdentifierExpr() {
    319 std::string IdName = IdentifierStr;
    320
    321 getNextToken(); // eat identifier.
    322
    323 if (CurTok != '(') // Simple variable ref.
    324 return new VariableExprAST(IdName);
    325
    326 // Call.
    327 getNextToken(); // eat (
    328 std::vector<ExprAST*> Args;
    329 if (CurTok != ')') {
    330 while (1) {
    331 ExprAST *Arg = ParseExpression();
    332 if (!Arg) return 0;
    333 Args.push_back(Arg);
    334
    335 if (CurTok == ')') break;
    336
    337 if (CurTok != ',')
    338 return Error("Expected ')' or ',' in argument list");
    339 getNextToken();
    340 }
    341 }
    342
    343 // Eat the ')'.
    344 getNextToken();
    345
    346 return new CallExprAST(IdName, Args);
    347 }
    348
    349
    350
    351

    This routine follows the same style as the other routines. (It expects to be

    352 called if the current token is a tok_identifier token). It also has
    353 recursion and error handling. One interesting aspect of this is that it uses
    354 look-ahead to determine if the current identifier is a stand alone
    355 variable reference or if it is a function call expression. It handles this by
    356 checking to see if the token after the identifier is a '(' token, constructing
    357 either a VariableExprAST or CallExprAST node as appropriate.
    358

    359
    360

    Now that we have all of our simple expression-parsing logic in place, we can

    361 define a helper function to wrap it together into one entry point. We call this
    362 class of expressions "primary" expressions, for reasons that will become more
    363 clear later in the tutorial. In order to
    364 parse an arbitrary primary expression, we need to determine what sort of
    365 expression it is:

    366
    367
    368
    
                      
                    
    369 /// primary
    370 /// ::= identifierexpr
    371 /// ::= numberexpr
    372 /// ::= parenexpr
    373 static ExprAST *ParsePrimary() {
    374 switch (CurTok) {
    375 default: return Error("unknown token when expecting an expression");
    376 case tok_identifier: return ParseIdentifierExpr();
    377 case tok_number: return ParseNumberExpr();
    378 case '(': return ParseParenExpr();
    379 }
    380 }
    381
    382
    383
    384

    Now that you see the definition of this function, it is more obvious why we

    385 can assume the state of CurTok in the various functions. This uses look-ahead
    386 to determine which sort of expression is being inspected, and then parses it
    387 with a function call.

    388
    389

    Now that basic expressions are handled, we need to handle binary expressions.

    390 They are a bit more complex.

    391
    392
    393
    394
    395

    Binary Expression Parsing

    396
    397
    398
    399
    400

    Binary expressions are significantly harder to parse because they are often

    401 ambiguous. For example, when given the string "x+y*z", the parser can choose
    402 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
    403 mathematics, we expect the later parse, because "*" (multiplication) has
    404 higher precedence than "+" (addition).

    405
    406

    There are many ways to handle this, but an elegant and efficient way is to

    407 use
    408 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
    409 Parsing. This parsing technique uses the precedence of binary operators to
    410 guide recursion. To start with, we need a table of precedences:

    411
    412
    413
    
                      
                    
    414 /// BinopPrecedence - This holds the precedence for each binary operator that is
    415 /// defined.
    416 static std::map<char, int> BinopPrecedence;
    417
    418 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    419 static int GetTokPrecedence() {
    420 if (!isascii(CurTok))
    421 return -1;
    422
    423 // Make sure it's a declared binop.
    424 int TokPrec = BinopPrecedence[CurTok];
    425 if (TokPrec <= 0) return -1;
    426 return TokPrec;
    427 }
    428
    429 int main() {
    430 // Install standard binary operators.
    431 // 1 is lowest precedence.
    432 BinopPrecedence['<'] = 10;
    433 BinopPrecedence['+'] = 20;
    434 BinopPrecedence['-'] = 20;
    435 BinopPrecedence['*'] = 40; // highest.
    436 ...
    437 }
    438
    439
    440
    441

    For the basic form of Kaleidoscope, we will only support 4 binary operators

    442 (this can obviously be extended by you, our brave and intrepid reader). The
    443 GetTokPrecedence function returns the precedence for the current token,
    444 or -1 if the token is not a binary operator. Having a map makes it easy to add
    445 new operators and makes it clear that the algorithm doesn't depend on the
    446 specific operators involved, but it would be easy enough to eliminate the map
    447 and do the comparisons in the GetTokPrecedence function. (Or just use
    448 a fixed-size array).

    449
    450

    With the helper above defined, we can now start parsing binary expressions.

    451 The basic idea of operator precedence parsing is to break down an expression
    452 with potentially ambiguous binary operators into pieces. Consider ,for example,
    453 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
    454 as a stream of primary expressions separated by binary operators. As such,
    455 it will first parse the leading primary expression "a", then it will see the
    456 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
    457 are primary expressions, the binary expression parser doesn't need to worry
    458 about nested subexpressions like (c+d) at all.
    459

    460
    461

    462 To start, an expression is a primary expression potentially followed by a
    463 sequence of [binop,primaryexpr] pairs:

    464
    465
    466
    
                      
                    
    467 /// expression
    468 /// ::= primary binoprhs
    469 ///
    470 static ExprAST *ParseExpression() {
    471 ExprAST *LHS = ParsePrimary();
    472 if (!LHS) return 0;
    473
    474 return ParseBinOpRHS(0, LHS);
    475 }
    476
    477
    478
    479

    ParseBinOpRHS is the function that parses the sequence of pairs for

    480 us. It takes a precedence and a pointer to an expression for the part that has been
    481 parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
    482 allowed to be empty, in which case it returns the expression that is passed into
    483 it. In our example above, the code passes the expression for "a" into
    484 ParseBinOpRHS and the current token is "+".

    485
    486

    The precedence value passed into ParseBinOpRHS indicates the

    487 minimal operator precedence that the function is allowed to eat. For
    488 example, if the current pair stream is [+, x] and ParseBinOpRHS is
    489 passed in a precedence of 40, it will not consume any tokens (because the
    490 precedence of '+' is only 20). With this in mind, ParseBinOpRHS starts
    491 with:

    492
    493
    494
    
                      
                    
    495 /// binoprhs
    496 /// ::= ('+' primary)*
    497 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
    498 // If this is a binop, find its precedence.
    499 while (1) {
    500 int TokPrec = GetTokPrecedence();
    501
    502 // If this is a binop that binds at least as tightly as the current binop,
    503 // consume it, otherwise we are done.
    504 if (TokPrec < ExprPrec)
    505 return LHS;
    506
    507
    508
    509

    This code gets the precedence of the current token and checks to see if if is

    510 too low. Because we defined invalid tokens to have a precedence of -1, this
    511 check implicitly knows that the pair-stream ends when the token stream runs out
    512 of binary operators. If this check succeeds, we know that the token is a binary
    513 operator and that it will be included in this expression:

    514
    515
    516
    
                      
                    
    517 // Okay, we know this is a binop.
    518 int BinOp = CurTok;
    519 getNextToken(); // eat binop
    520
    521 // Parse the primary expression after the binary operator.
    522 ExprAST *RHS = ParsePrimary();
    523 if (!RHS) return 0;
    524
    525
    526
    527

    As such, this code eats (and remembers) the binary operator and then parses

    528 the primary expression that follows. This builds up the whole pair, the first of
    529 which is [+, b] for the running example.

    530
    531

    Now that we parsed the left-hand side of an expression and one pair of the

    532 RHS sequence, we have to decide which way the expression associates. In
    533 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
    534 To determine this, we look ahead at "binop" to determine its precedence and
    535 compare it to BinOp's precedence (which is '+' in this case):

    536
    537
    538
    
                      
                    
    539 // If BinOp binds less tightly with RHS than the operator after RHS, let
    540 // the pending operator take RHS as its LHS.
    541 int NextPrec = GetTokPrecedence();
    542 if (TokPrec < NextPrec) {
    543
    544
    545
    546

    If the precedence of the binop to the right of "RHS" is lower or equal to the

    547 precedence of our current operator, then we know that the parentheses associate
    548 as "(a+b) binop ...". In our example, the current operator is "+" and the next
    549 operator is "+", we know that they have the same precedence. In this case we'll
    550 create the AST node for "a+b", and then continue parsing:

    551
    552
    553
    
                      
                    
    554 ... if body omitted ...
    555 }
    556
    557 // Merge LHS/RHS.
    558 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    559 } // loop around to the top of the while loop.
    560 }
    561
    562
    563
    564

    In our example above, this will turn "a+b+" into "(a+b)" and execute the next

    565 iteration of the loop, with "+" as the current token. The code above will eat,
    566 remember, and parse "(c+d)" as the primary expression, which makes the
    567 current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
    568 "*" as the binop to the right of the primary. In this case, the precedence of "*" is
    569 higher than the precedence of "+" so the if condition will be entered.

    570
    571

    The critical question left here is "how can the if condition parse the right

    572 hand side in full"? In particular, to build the AST correctly for our example,
    573 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
    574 do this is surprisingly simple (code from the above two blocks duplicated for
    575 context):

    576
    577
    578
    
                      
                    
    579 // If BinOp binds less tightly with RHS than the operator after RHS, let
    580 // the pending operator take RHS as its LHS.
    581 int NextPrec = GetTokPrecedence();
    582 if (TokPrec < NextPrec) {
    583 RHS = ParseBinOpRHS(TokPrec+1, RHS);
    584 if (RHS == 0) return 0;
    585 }
    586 // Merge LHS/RHS.
    587 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    588 } // loop around to the top of the while loop.
    589 }
    590
    591
    592
    593

    At this point, we know that the binary operator to the RHS of our primary

    594 has higher precedence than the binop we are currently parsing. As such, we know
    595 that any sequence of pairs whose operators are all higher precedence than "+"
    596 should be parsed together and returned as "RHS". To do this, we recursively
    597 invoke the ParseBinOpRHS function specifying "TokPrec+1" as the minimum
    598 precedence required for it to continue. In our example above, this will cause
    599 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
    600 of the '+' expression.

    601
    602

    Finally, on the next iteration of the while loop, the "+g" piece is parsed

    603 and added to the AST. With this little bit of code (14 non-trivial lines), we
    604 correctly handle fully general binary expression parsing in a very elegant way.
    605 This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
    606 running through it with a few tough examples to see how it works.
    607

    608
    609

    This wraps up handling of expressions. At this point, we can point the

    610 parser at an arbitrary token stream and build an expression from it, stopping
    611 at the first token that is not part of the expression. Next up we need to
    612 handle function definitions, etc.

    613
    614
    615
    616
    617

    Parsing the Rest

    618
    619
    620
    621
    622

    623 The next thing missing is handling of function prototypes. In Kaleidoscope,
    624 these are used both for 'extern' function declarations as well as function body
    625 definitions. The code to do this is straight-forward and not very interesting
    626 (once you've survived expressions):
    627

    628
    629
    630
    
                      
                    
    631 /// prototype
    632 /// ::= id '(' id* ')'
    633 static PrototypeAST *ParsePrototype() {
    634 if (CurTok != tok_identifier)
    635 return ErrorP("Expected function name in prototype");
    636
    637 std::string FnName = IdentifierStr;
    638 getNextToken();
    639
    640 if (CurTok != '(')
    641 return ErrorP("Expected '(' in prototype");
    642
    643 // Read the list of argument names.
    644 std::vector<std::string> ArgNames;
    645 while (getNextToken() == tok_identifier)
    646 ArgNames.push_back(IdentifierStr);
    647 if (CurTok != ')')
    648 return ErrorP("Expected ')' in prototype");
    649
    650 // success.
    651 getNextToken(); // eat ')'.
    652
    653 return new PrototypeAST(FnName, ArgNames);
    654 }
    655
    656
    657
    658

    Given this, a function definition is very simple, just a prototype plus

    659 an expression to implement the body:

    660
    661
    662
    
                      
                    
    663 /// definition ::= 'def' prototype expression
    664 static FunctionAST *ParseDefinition() {
    665 getNextToken(); // eat def.
    666 PrototypeAST *Proto = ParsePrototype();
    667 if (Proto == 0) return 0;
    668
    669 if (ExprAST *E = ParseExpression())
    670 return new FunctionAST(Proto, E);
    671 return 0;
    672 }
    673
    674
    675
    676

    In addition, we support 'extern' to declare functions like 'sin' and 'cos' as

    677 well as to support forward declaration of user functions. These 'extern's are just
    678 prototypes with no body:

    679
    680
    681
    
                      
                    
    682 /// external ::= 'extern' prototype
    683 static PrototypeAST *ParseExtern() {
    684 getNextToken(); // eat extern.
    685 return ParsePrototype();
    686 }
    687
    688
    689
    690

    Finally, we'll also let the user type in arbitrary top-level expressions and

    691 evaluate them on the fly. We will handle this by defining anonymous nullary
    692 (zero argument) functions for them:

    693
    694
    695
    
                      
                    
    696 /// toplevelexpr ::= expression
    697 static FunctionAST *ParseTopLevelExpr() {
    698 if (ExprAST *E = ParseExpression()) {
    699 // Make an anonymous proto.
    700 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
    701 return new FunctionAST(Proto, E);
    702 }
    703 return 0;
    704 }
    705
    706
    707
    708

    Now that we have all the pieces, let's build a little driver that will let us

    709 actually execute this code we've built!

    710
    711
    712
    713
    714

    The Driver

    715
    716
    717
    718
    719

    The driver for this simply invokes all of the parsing pieces with a top-level

    720 dispatch loop. There isn't much interesting here, so I'll just include the
    721 top-level loop. See below for full code in the "Top-Level
    722 Parsing" section.

    723
    724
    725
    
                      
                    
    726 /// top ::= definition | external | expression | ';'
    727 static void MainLoop() {
    728 while (1) {
    729 fprintf(stderr, "ready> ");
    730 switch (CurTok) {
    731 case tok_eof: return;
    732 case ';': getNextToken(); break; // ignore top-level semicolons.
    733 case tok_def: HandleDefinition(); break;
    734 case tok_extern: HandleExtern(); break;
    735 default: HandleTopLevelExpression(); break;
    736 }
    737 }
    738 }
    739
    740
    741
    742

    The most interesting part of this is that we ignore top-level semicolons.

    743 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
    744 command line, the parser doesn't know whether that is the end of what you will type
    745 or not. For example, on the next line you could type "def foo..." in which case
    746 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
    747 which would continue the expression. Having top-level semicolons allows you to
    748 type "4+5;", and the parser will know you are done.

    749
    750
    751
    752
    753

    Conclusions

    754
    755
    756
    757
    758

    With just under 400 lines of commented code (240 lines of non-comment,

    759 non-blank code), we fully defined our minimal language, including a lexer,
    760 parser, and AST builder. With this done, the executable will validate
    761 Kaleidoscope code and tell us if it is grammatically invalid. For
    762 example, here is a sample interaction:

    763
    764
    765
    
                      
                    
    766 $ ./a.out
    767 ready> def foo(x y) x+foo(y, 4.0);
    768 Parsed a function definition.
    769 ready> def foo(x y) x+y y;
    770 Parsed a function definition.
    771 Parsed a top-level expr
    772 ready> def foo(x y) x+y );
    773 Parsed a function definition.
    774 Error: unknown token when expecting an expression
    775 ready> extern sin(a);
    776 ready> Parsed an extern
    777 ready> ^D
    778 $
    779
    780
    781
    782

    There is a lot of room for extension here. You can define new AST nodes,

    783 extend the language in many ways, etc. In the next
    784 installment, we will describe how to generate LLVM Intermediate
    785 Representation (IR) from the AST.

    786
    787
    788
    789
    790

    Full Code Listing

    791
    792
    793
    794
    795

    796 Here is the complete code listing for this and the previous chapter.
    797 Note that it is fully self-contained: you don't need LLVM or any external
    798 libraries at all for this. (Besides the C and C++ standard libraries, of
    799 course.) To build this, just compile with:

    800
    801
    802
    
                      
                    
    803 # Compile
    804 clang++ -g -O3 toy.cpp
    805 # Run
    806 ./a.out
    807
    808
    809
    810

    Here is the code:

    811
    812
    813
    
                      
                    
    814 #include <cstdio>
    815 #include <cstdlib>
    816 #include <string>
    817 #include <map>
    818 #include <vector>
    819
    820 //===----------------------------------------------------------------------===//
    821 // Lexer
    822 //===----------------------------------------------------------------------===//
    823
    824 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
    825 // of these for known things.
    826 enum Token {
    827 tok_eof = -1,
    828
    829 // commands
    830 tok_def = -2, tok_extern = -3,
    831
    832 // primary
    833 tok_identifier = -4, tok_number = -5
    834 };
    835
    836 static std::string IdentifierStr; // Filled in if tok_identifier
    837 static double NumVal; // Filled in if tok_number
    838
    839 /// gettok - Return the next token from standard input.
    840 static int gettok() {
    841 static int LastChar = ' ';
    842
    843 // Skip any whitespace.
    844 while (isspace(LastChar))
    845 LastChar = getchar();
    846
    847 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    848 IdentifierStr = LastChar;
    849 while (isalnum((LastChar = getchar())))
    850 IdentifierStr += LastChar;
    851
    852 if (IdentifierStr == "def") return tok_def;
    853 if (IdentifierStr == "extern") return tok_extern;
    854 return tok_identifier;
    855 }
    856
    857 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
    858 std::string NumStr;
    859 do {
    860 NumStr += LastChar;
    861 LastChar = getchar();
    862 } while (isdigit(LastChar) || LastChar == '.');
    863
    864 NumVal = strtod(NumStr.c_str(), 0);
    865 return tok_number;
    866 }
    867
    868 if (LastChar == '#') {
    869 // Comment until end of line.
    870 do LastChar = getchar();
    871 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
    872
    873 if (LastChar != EOF)
    874 return gettok();
    875 }
    876
    877 // Check for end of file. Don't eat the EOF.
    878 if (LastChar == EOF)
    879 return tok_eof;
    880
    881 // Otherwise, just return the character as its ascii value.
    882 int ThisChar = LastChar;
    883 LastChar = getchar();
    884 return ThisChar;
    885 }
    886
    887 //===----------------------------------------------------------------------===//
    888 // Abstract Syntax Tree (aka Parse Tree)
    889 //===----------------------------------------------------------------------===//
    890
    891 /// ExprAST - Base class for all expression nodes.
    892 class ExprAST {
    893 public:
    894 virtual ~ExprAST() {}
    895 };
    896
    897 /// NumberExprAST - Expression class for numeric literals like "1.0".
    898 class NumberExprAST : public ExprAST {
    899 double Val;
    900 public:
    901 NumberExprAST(double val) : Val(val) {}
    902 };
    903
    904 /// VariableExprAST - Expression class for referencing a variable, like "a".
    905 class VariableExprAST : public ExprAST {
    906 std::string Name;
    907 public:
    908 VariableExprAST(const std::string &name) : Name(name) {}
    909 };
    910
    911 /// BinaryExprAST - Expression class for a binary operator.
    912 class BinaryExprAST : public ExprAST {
    913 char Op;
    914 ExprAST *LHS, *RHS;
    915 public:
    916 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
    917 : Op(op), LHS(lhs), RHS(rhs) {}
    918 };
    919
    920 /// CallExprAST - Expression class for function calls.
    921 class CallExprAST : public ExprAST {
    922 std::string Callee;
    923 std::vector<ExprAST*> Args;
    924 public:
    925 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
    926 : Callee(callee), Args(args) {}
    927 };
    928
    929 /// PrototypeAST - This class represents the "prototype" for a function,
    930 /// which captures its name, and its argument names (thus implicitly the number
    931 /// of arguments the function takes).
    932 class PrototypeAST {
    933 std::string Name;
    934 std::vector<std::string> Args;
    935 public:
    936 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
    937 : Name(name), Args(args) {}
    938
    939 };
    940
    941 /// FunctionAST - This class represents a function definition itself.
    942 class FunctionAST {
    943 PrototypeAST *Proto;
    944 ExprAST *Body;
    945 public:
    946 FunctionAST(PrototypeAST *proto, ExprAST *body)
    947 : Proto(proto), Body(body) {}
    948
    949 };
    950
    951 //===----------------------------------------------------------------------===//
    952 // Parser
    953 //===----------------------------------------------------------------------===//
    954
    955 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
    956 /// token the parser is looking at. getNextToken reads another token from the
    957 /// lexer and updates CurTok with its results.
    958 static int CurTok;
    959 static int getNextToken() {
    960 return CurTok = gettok();
    961 }
    962
    963 /// BinopPrecedence - This holds the precedence for each binary operator that is
    964 /// defined.
    965 static std::map<char, int> BinopPrecedence;
    966
    967 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    968 static int GetTokPrecedence() {
    969 if (!isascii(CurTok))
    970 return -1;
    971
    972 // Make sure it's a declared binop.
    973 int TokPrec = BinopPrecedence[CurTok];
    974 if (TokPrec <= 0) return -1;
    975 return TokPrec;
    976 }
    977
    978 /// Error* - These are little helper functions for error handling.
    979 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    980 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    981 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    982
    983 static ExprAST *ParseExpression();
    984
    985 /// identifierexpr
    986 /// ::= identifier
    987 /// ::= identifier '(' expression* ')'
    988 static ExprAST *ParseIdentifierExpr() {
    989 std::string IdName = IdentifierStr;
    990
    991 getNextToken(); // eat identifier.
    992
    993 if (CurTok != '(') // Simple variable ref.
    994 return new VariableExprAST(IdName);
    995
    996 // Call.
    997 getNextToken(); // eat (
    998 std::vector<ExprAST*> Args;
    999 if (CurTok != ')') {
    1000 while (1) {
    1001 ExprAST *Arg = ParseExpression();
    1002 if (!Arg) return 0;
    1003 Args.push_back(Arg);
    1004
    1005 if (CurTok == ')') break;
    1006
    1007 if (CurTok != ',')
    1008 return Error("Expected ')' or ',' in argument list");
    1009 getNextToken();
    1010 }
    1011 }
    1012
    1013 // Eat the ')'.
    1014 getNextToken();
    1015
    1016 return new CallExprAST(IdName, Args);
    1017 }
    1018
    1019 /// numberexpr ::= number
    1020 static ExprAST *ParseNumberExpr() {
    1021 ExprAST *Result = new NumberExprAST(NumVal);
    1022 getNextToken(); // consume the number
    1023 return Result;
    1024 }
    1025
    1026 /// parenexpr ::= '(' expression ')'
    1027 static ExprAST *ParseParenExpr() {
    1028 getNextToken(); // eat (.
    1029 ExprAST *V = ParseExpression();
    1030 if (!V) return 0;
    1031
    1032 if (CurTok != ')')
    1033 return Error("expected ')'");
    1034 getNextToken(); // eat ).
    1035 return V;
    1036 }
    1037
    1038 /// primary
    1039 /// ::= identifierexpr
    1040 /// ::= numberexpr
    1041 /// ::= parenexpr
    1042 static ExprAST *ParsePrimary() {
    1043 switch (CurTok) {
    1044 default: return Error("unknown token when expecting an expression");
    1045 case tok_identifier: return ParseIdentifierExpr();
    1046 case tok_number: return ParseNumberExpr();
    1047 case '(': return ParseParenExpr();
    1048 }
    1049 }
    1050
    1051 /// binoprhs
    1052 /// ::= ('+' primary)*
    1053 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
    1054 // If this is a binop, find its precedence.
    1055 while (1) {
    1056 int TokPrec = GetTokPrecedence();
    1057
    1058 // If this is a binop that binds at least as tightly as the current binop,
    1059 // consume it, otherwise we are done.
    1060 if (TokPrec < ExprPrec)
    1061 return LHS;
    1062
    1063 // Okay, we know this is a binop.
    1064 int BinOp = CurTok;
    1065 getNextToken(); // eat binop
    1066
    1067 // Parse the primary expression after the binary operator.
    1068 ExprAST *RHS = ParsePrimary();
    1069 if (!RHS) return 0;
    1070
    1071 // If BinOp binds less tightly with RHS than the operator after RHS, let
    1072 // the pending operator take RHS as its LHS.
    1073 int NextPrec = GetTokPrecedence();
    1074 if (TokPrec < NextPrec) {
    1075 RHS = ParseBinOpRHS(TokPrec+1, RHS);
    1076 if (RHS == 0) return 0;
    1077 }
    1078
    1079 // Merge LHS/RHS.
    1080 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    1081 }
    1082 }
    1083
    1084 /// expression
    1085 /// ::= primary binoprhs
    1086 ///
    1087 static ExprAST *ParseExpression() {
    1088 ExprAST *LHS = ParsePrimary();
    1089 if (!LHS) return 0;
    1090
    1091 return ParseBinOpRHS(0, LHS);
    1092 }
    1093
    1094 /// prototype
    1095 /// ::= id '(' id* ')'
    1096 static PrototypeAST *ParsePrototype() {
    1097 if (CurTok != tok_identifier)
    1098 return ErrorP("Expected function name in prototype");
    1099
    1100 std::string FnName = IdentifierStr;
    1101 getNextToken();
    1102
    1103 if (CurTok != '(')
    1104 return ErrorP("Expected '(' in prototype");
    1105
    1106 std::vector<std::string> ArgNames;
    1107 while (getNextToken() == tok_identifier)
    1108 ArgNames.push_back(IdentifierStr);
    1109 if (CurTok != ')')
    1110 return ErrorP("Expected ')' in prototype");
    1111
    1112 // success.
    1113 getNextToken(); // eat ')'.
    1114
    1115 return new PrototypeAST(FnName, ArgNames);
    1116 }
    1117
    1118 /// definition ::= 'def' prototype expression
    1119 static FunctionAST *ParseDefinition() {
    1120 getNextToken(); // eat def.
    1121 PrototypeAST *Proto = ParsePrototype();
    1122 if (Proto == 0) return 0;
    1123
    1124 if (ExprAST *E = ParseExpression())
    1125 return new FunctionAST(Proto, E);
    1126 return 0;
    1127 }
    1128
    1129 /// toplevelexpr ::= expression
    1130 static FunctionAST *ParseTopLevelExpr() {
    1131 if (ExprAST *E = ParseExpression()) {
    1132 // Make an anonymous proto.
    1133 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
    1134 return new FunctionAST(Proto, E);
    1135 }
    1136 return 0;
    1137 }
    1138
    1139 /// external ::= 'extern' prototype
    1140 static PrototypeAST *ParseExtern() {
    1141 getNextToken(); // eat extern.
    1142 return ParsePrototype();
    1143 }
    1144
    1145 //===----------------------------------------------------------------------===//
    1146 // Top-Level parsing
    1147 //===----------------------------------------------------------------------===//
    1148
    1149 static void HandleDefinition() {
    1150 if (ParseDefinition()) {
    1151 fprintf(stderr, "Parsed a function definition.\n");
    1152 } else {
    1153 // Skip token for error recovery.
    1154 getNextToken();
    1155 }
    1156 }
    1157
    1158 static void HandleExtern() {
    1159 if (ParseExtern()) {
    1160 fprintf(stderr, "Parsed an extern\n");
    1161 } else {
    1162 // Skip token for error recovery.
    1163 getNextToken();
    1164 }
    1165 }
    1166
    1167 static void HandleTopLevelExpression() {
    1168 // Evaluate a top-level expression into an anonymous function.
    1169 if (ParseTopLevelExpr()) {
    1170 fprintf(stderr, "Parsed a top-level expr\n");
    1171 } else {
    1172 // Skip token for error recovery.
    1173 getNextToken();
    1174 }
    1175 }
    1176
    1177 /// top ::= definition | external | expression | ';'
    1178 static void MainLoop() {
    1179 while (1) {
    1180 fprintf(stderr, "ready> ");
    1181 switch (CurTok) {
    1182 case tok_eof: return;
    1183 case ';': getNextToken(); break; // ignore top-level semicolons.
    1184 case tok_def: HandleDefinition(); break;
    1185 case tok_extern: HandleExtern(); break;
    1186 default: HandleTopLevelExpression(); break;
    1187 }
    1188 }
    1189 }
    1190
    1191 //===----------------------------------------------------------------------===//
    1192 // Main driver code.
    1193 //===----------------------------------------------------------------------===//
    1194
    1195 int main() {
    1196 // Install standard binary operators.
    1197 // 1 is lowest precedence.
    1198 BinopPrecedence['<'] = 10;
    1199 BinopPrecedence['+'] = 20;
    1200 BinopPrecedence['-'] = 20;
    1201 BinopPrecedence['*'] = 40; // highest.
    1202
    1203 // Prime the first token.
    1204 fprintf(stderr, "ready> ");
    1205 getNextToken();
    1206
    1207 // Run the main "interpreter loop" now.
    1208 MainLoop();
    1209
    1210 return 0;
    1211 }
    1212
    1213
    1214 Next: Implementing Code Generation to LLVM IR
    1215
    1216
    1217
    1218
    1219
    1220
    1221 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!">
    1222
    1223 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!">
    1224
    1225 Chris Lattner
    1226 The LLVM Compiler Infrastructure
    1227 Last modified: $Date$
    1228
    1229
    1230
    0 ===========================================
    1 Kaleidoscope: Implementing a Parser and AST
    2 ===========================================
    3
    4 .. contents::
    5 :local:
    6
    7 Written by `Chris Lattner `_
    8
    9 Chapter 2 Introduction
    10 ======================
    11
    12 Welcome to Chapter 2 of the "`Implementing a language with
    13 LLVM `_" tutorial. This chapter shows you how to use the
    14 lexer, built in `Chapter 1 `_, to build a full
    15 `parser `_ for our Kaleidoscope
    16 language. Once we have a parser, we'll define and build an `Abstract
    17 Syntax Tree `_ (AST).
    18
    19 The parser we will build uses a combination of `Recursive Descent
    20 Parsing `_ and
    21 `Operator-Precedence
    22 Parsing `_ to
    23 parse the Kaleidoscope language (the latter for binary expressions and
    24 the former for everything else). Before we get to parsing though, lets
    25 talk about the output of the parser: the Abstract Syntax Tree.
    26
    27 The Abstract Syntax Tree (AST)
    28 ==============================
    29
    30 The AST for a program captures its behavior in such a way that it is
    31 easy for later stages of the compiler (e.g. code generation) to
    32 interpret. We basically want one object for each construct in the
    33 language, and the AST should closely model the language. In
    34 Kaleidoscope, we have expressions, a prototype, and a function object.
    35 We'll start with expressions first:
    36
    37 .. code-block:: c++
    38
    39 /// ExprAST - Base class for all expression nodes.
    40 class ExprAST {
    41 public:
    42 virtual ~ExprAST() {}
    43 };
    44
    45 /// NumberExprAST - Expression class for numeric literals like "1.0".
    46 class NumberExprAST : public ExprAST {
    47 double Val;
    48 public:
    49 NumberExprAST(double val) : Val(val) {}
    50 };
    51
    52 The code above shows the definition of the base ExprAST class and one
    53 subclass which we use for numeric literals. The important thing to note
    54 about this code is that the NumberExprAST class captures the numeric
    55 value of the literal as an instance variable. This allows later phases
    56 of the compiler to know what the stored numeric value is.
    57
    58 Right now we only create the AST, so there are no useful accessor
    59 methods on them. It would be very easy to add a virtual method to pretty
    60 print the code, for example. Here are the other expression AST node
    61 definitions that we'll use in the basic form of the Kaleidoscope
    62 language:
    63
    64 .. code-block:: c++
    65
    66 /// VariableExprAST - Expression class for referencing a variable, like "a".
    67 class VariableExprAST : public ExprAST {
    68 std::string Name;
    69 public:
    70 VariableExprAST(const std::string &name) : Name(name) {}
    71 };
    72
    73 /// BinaryExprAST - Expression class for a binary operator.
    74 class BinaryExprAST : public ExprAST {
    75 char Op;
    76 ExprAST *LHS, *RHS;
    77 public:
    78 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
    79 : Op(op), LHS(lhs), RHS(rhs) {}
    80 };
    81
    82 /// CallExprAST - Expression class for function calls.
    83 class CallExprAST : public ExprAST {
    84 std::string Callee;
    85 std::vector Args;
    86 public:
    87 CallExprAST(const std::string &callee, std::vector &args)
    88 : Callee(callee), Args(args) {}
    89 };
    90
    91 This is all (intentionally) rather straight-forward: variables capture
    92 the variable name, binary operators capture their opcode (e.g. '+'), and
    93 calls capture a function name as well as a list of any argument
    94 expressions. One thing that is nice about our AST is that it captures
    95 the language features without talking about the syntax of the language.
    96 Note that there is no discussion about precedence of binary operators,
    97 lexical structure, etc.
    98
    99 For our basic language, these are all of the expression nodes we'll
    100 define. Because it doesn't have conditional control flow, it isn't
    101 Turing-complete; we'll fix that in a later installment. The two things
    102 we need next are a way to talk about the interface to a function, and a
    103 way to talk about functions themselves:
    104
    105 .. code-block:: c++
    106
    107 /// PrototypeAST - This class represents the "prototype" for a function,
    108 /// which captures its name, and its argument names (thus implicitly the number
    109 /// of arguments the function takes).
    110 class PrototypeAST {
    111 std::string Name;
    112 std::vector Args;
    113 public:
    114 PrototypeAST(const std::string &name, const std::vector &args)
    115 : Name(name), Args(args) {}
    116 };
    117
    118 /// FunctionAST - This class represents a function definition itself.
    119 class FunctionAST {
    120 PrototypeAST *Proto;
    121 ExprAST *Body;
    122 public:
    123 FunctionAST(PrototypeAST *proto, ExprAST *body)
    124 : Proto(proto), Body(body) {}
    125 };
    126
    127 In Kaleidoscope, functions are typed with just a count of their
    128 arguments. Since all values are double precision floating point, the
    129 type of each argument doesn't need to be stored anywhere. In a more
    130 aggressive and realistic language, the "ExprAST" class would probably
    131 have a type field.
    132
    133 With this scaffolding, we can now talk about parsing expressions and
    134 function bodies in Kaleidoscope.
    135
    136 Parser Basics
    137 =============
    138
    139 Now that we have an AST to build, we need to define the parser code to
    140 build it. The idea here is that we want to parse something like "x+y"
    141 (which is returned as three tokens by the lexer) into an AST that could
    142 be generated with calls like this:
    143
    144 .. code-block:: c++
    145
    146 ExprAST *X = new VariableExprAST("x");
    147 ExprAST *Y = new VariableExprAST("y");
    148 ExprAST *Result = new BinaryExprAST('+', X, Y);
    149
    150 In order to do this, we'll start by defining some basic helper routines:
    151
    152 .. code-block:: c++
    153
    154 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
    155 /// token the parser is looking at. getNextToken reads another token from the
    156 /// lexer and updates CurTok with its results.
    157 static int CurTok;
    158 static int getNextToken() {
    159 return CurTok = gettok();
    160 }
    161
    162 This implements a simple token buffer around the lexer. This allows us
    163 to look one token ahead at what the lexer is returning. Every function
    164 in our parser will assume that CurTok is the current token that needs to
    165 be parsed.
    166
    167 .. code-block:: c++
    168
    169
    170 /// Error* - These are little helper functions for error handling.
    171 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    172 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    173 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    174
    175 The ``Error`` routines are simple helper routines that our parser will
    176 use to handle errors. The error recovery in our parser will not be the
    177 best and is not particular user-friendly, but it will be enough for our
    178 tutorial. These routines make it easier to handle errors in routines
    179 that have various return types: they always return null.
    180
    181 With these basic helper functions, we can implement the first piece of
    182 our grammar: numeric literals.
    183
    184 Basic Expression Parsing
    185 ========================
    186
    187 We start with numeric literals, because they are the simplest to
    188 process. For each production in our grammar, we'll define a function
    189 which parses that production. For numeric literals, we have:
    190
    191 .. code-block:: c++
    192
    193 /// numberexpr ::= number
    194 static ExprAST *ParseNumberExpr() {
    195 ExprAST *Result = new NumberExprAST(NumVal);
    196 getNextToken(); // consume the number
    197 return Result;
    198 }
    199
    200 This routine is very simple: it expects to be called when the current
    201 token is a ``tok_number`` token. It takes the current number value,
    202 creates a ``NumberExprAST`` node, advances the lexer to the next token,
    203 and finally returns.
    204
    205 There are some interesting aspects to this. The most important one is
    206 that this routine eats all of the tokens that correspond to the
    207 production and returns the lexer buffer with the next token (which is
    208 not part of the grammar production) ready to go. This is a fairly
    209 standard way to go for recursive descent parsers. For a better example,
    210 the parenthesis operator is defined like this:
    211
    212 .. code-block:: c++
    213
    214 /// parenexpr ::= '(' expression ')'
    215 static ExprAST *ParseParenExpr() {
    216 getNextToken(); // eat (.
    217 ExprAST *V = ParseExpression();
    218 if (!V) return 0;
    219
    220 if (CurTok != ')')
    221 return Error("expected ')'");
    222 getNextToken(); // eat ).
    223 return V;
    224 }
    225
    226 This function illustrates a number of interesting things about the
    227 parser:
    228
    229 1) It shows how we use the Error routines. When called, this function
    230 expects that the current token is a '(' token, but after parsing the
    231 subexpression, it is possible that there is no ')' waiting. For example,
    232 if the user types in "(4 x" instead of "(4)", the parser should emit an
    233 error. Because errors can occur, the parser needs a way to indicate that
    234 they happened: in our parser, we return null on an error.
    235
    236 2) Another interesting aspect of this function is that it uses recursion
    237 by calling ``ParseExpression`` (we will soon see that
    238 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
    239 because it allows us to handle recursive grammars, and keeps each
    240 production very simple. Note that parentheses do not cause construction
    241 of AST nodes themselves. While we could do it this way, the most
    242 important role of parentheses are to guide the parser and provide
    243 grouping. Once the parser constructs the AST, parentheses are not
    244 needed.
    245
    246 The next simple production is for handling variable references and
    247 function calls:
    248
    249 .. code-block:: c++
    250
    251 /// identifierexpr
    252 /// ::= identifier
    253 /// ::= identifier '(' expression* ')'
    254 static ExprAST *ParseIdentifierExpr() {
    255 std::string IdName = IdentifierStr;
    256
    257 getNextToken(); // eat identifier.
    258
    259 if (CurTok != '(') // Simple variable ref.
    260 return new VariableExprAST(IdName);
    261
    262 // Call.
    263 getNextToken(); // eat (
    264 std::vector Args;
    265 if (CurTok != ')') {
    266 while (1) {
    267 ExprAST *Arg = ParseExpression();
    268 if (!Arg) return 0;
    269 Args.push_back(Arg);
    270
    271 if (CurTok == ')') break;
    272
    273 if (CurTok != ',')
    274 return Error("Expected ')' or ',' in argument list");
    275 getNextToken();
    276 }
    277 }
    278
    279 // Eat the ')'.
    280 getNextToken();
    281
    282 return new CallExprAST(IdName, Args);
    283 }
    284
    285 This routine follows the same style as the other routines. (It expects
    286 to be called if the current token is a ``tok_identifier`` token). It
    287 also has recursion and error handling. One interesting aspect of this is
    288 that it uses *look-ahead* to determine if the current identifier is a
    289 stand alone variable reference or if it is a function call expression.
    290 It handles this by checking to see if the token after the identifier is
    291 a '(' token, constructing either a ``VariableExprAST`` or
    292 ``CallExprAST`` node as appropriate.
    293
    294 Now that we have all of our simple expression-parsing logic in place, we
    295 can define a helper function to wrap it together into one entry point.
    296 We call this class of expressions "primary" expressions, for reasons
    297 that will become more clear `later in the
    298 tutorial `_. In order to parse an arbitrary
    299 primary expression, we need to determine what sort of expression it is:
    300
    301 .. code-block:: c++
    302
    303 /// primary
    304 /// ::= identifierexpr
    305 /// ::= numberexpr
    306 /// ::= parenexpr
    307 static ExprAST *ParsePrimary() {
    308 switch (CurTok) {
    309 default: return Error("unknown token when expecting an expression");
    310 case tok_identifier: return ParseIdentifierExpr();
    311 case tok_number: return ParseNumberExpr();
    312 case '(': return ParseParenExpr();
    313 }
    314 }
    315
    316 Now that you see the definition of this function, it is more obvious why
    317 we can assume the state of CurTok in the various functions. This uses
    318 look-ahead to determine which sort of expression is being inspected, and
    319 then parses it with a function call.
    320
    321 Now that basic expressions are handled, we need to handle binary
    322 expressions. They are a bit more complex.
    323
    324 Binary Expression Parsing
    325 =========================
    326
    327 Binary expressions are significantly harder to parse because they are
    328 often ambiguous. For example, when given the string "x+y\*z", the parser
    329 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
    330 definitions from mathematics, we expect the later parse, because "\*"
    331 (multiplication) has higher *precedence* than "+" (addition).
    332
    333 There are many ways to handle this, but an elegant and efficient way is
    334 to use `Operator-Precedence
    335 Parsing `_.
    336 This parsing technique uses the precedence of binary operators to guide
    337 recursion. To start with, we need a table of precedences:
    338
    339 .. code-block:: c++
    340
    341 /// BinopPrecedence - This holds the precedence for each binary operator that is
    342 /// defined.
    343 static std::map BinopPrecedence;
    344
    345 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    346 static int GetTokPrecedence() {
    347 if (!isascii(CurTok))
    348 return -1;
    349
    350 // Make sure it's a declared binop.
    351 int TokPrec = BinopPrecedence[CurTok];
    352 if (TokPrec <= 0) return -1;
    353 return TokPrec;
    354 }
    355
    356 int main() {
    357 // Install standard binary operators.
    358 // 1 is lowest precedence.
    359 BinopPrecedence['<'] = 10;
    360 BinopPrecedence['+'] = 20;
    361 BinopPrecedence['-'] = 20;
    362 BinopPrecedence['*'] = 40; // highest.
    363 ...
    364 }
    365
    366 For the basic form of Kaleidoscope, we will only support 4 binary
    367 operators (this can obviously be extended by you, our brave and intrepid
    368 reader). The ``GetTokPrecedence`` function returns the precedence for
    369 the current token, or -1 if the token is not a binary operator. Having a
    370 map makes it easy to add new operators and makes it clear that the
    371 algorithm doesn't depend on the specific operators involved, but it
    372 would be easy enough to eliminate the map and do the comparisons in the
    373 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
    374
    375 With the helper above defined, we can now start parsing binary
    376 expressions. The basic idea of operator precedence parsing is to break
    377 down an expression with potentially ambiguous binary operators into
    378 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
    379 Operator precedence parsing considers this as a stream of primary
    380 expressions separated by binary operators. As such, it will first parse
    381 the leading primary expression "a", then it will see the pairs [+, b]
    382 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
    383 primary expressions, the binary expression parser doesn't need to worry
    384 about nested subexpressions like (c+d) at all.
    385
    386 To start, an expression is a primary expression potentially followed by
    387 a sequence of [binop,primaryexpr] pairs:
    388
    389 .. code-block:: c++
    390
    391 /// expression
    392 /// ::= primary binoprhs
    393 ///
    394 static ExprAST *ParseExpression() {
    395 ExprAST *LHS = ParsePrimary();
    396 if (!LHS) return 0;
    397
    398 return ParseBinOpRHS(0, LHS);
    399 }
    400
    401 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
    402 us. It takes a precedence and a pointer to an expression for the part
    403 that has been parsed so far. Note that "x" is a perfectly valid
    404 expression: As such, "binoprhs" is allowed to be empty, in which case it
    405 returns the expression that is passed into it. In our example above, the
    406 code passes the expression for "a" into ``ParseBinOpRHS`` and the
    407 current token is "+".
    408
    409 The precedence value passed into ``ParseBinOpRHS`` indicates the
    410 *minimal operator precedence* that the function is allowed to eat. For
    411 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
    412 passed in a precedence of 40, it will not consume any tokens (because
    413 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
    414 starts with:
    415
    416 .. code-block:: c++
    417
    418 /// binoprhs
    419 /// ::= ('+' primary)*
    420 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
    421 // If this is a binop, find its precedence.
    422 while (1) {
    423 int TokPrec = GetTokPrecedence();
    424
    425 // If this is a binop that binds at least as tightly as the current binop,
    426 // consume it, otherwise we are done.
    427 if (TokPrec < ExprPrec)
    428 return LHS;
    429
    430 This code gets the precedence of the current token and checks to see if
    431 if is too low. Because we defined invalid tokens to have a precedence of
    432 -1, this check implicitly knows that the pair-stream ends when the token
    433 stream runs out of binary operators. If this check succeeds, we know
    434 that the token is a binary operator and that it will be included in this
    435 expression:
    436
    437 .. code-block:: c++
    438
    439 // Okay, we know this is a binop.
    440 int BinOp = CurTok;
    441 getNextToken(); // eat binop
    442
    443 // Parse the primary expression after the binary operator.
    444 ExprAST *RHS = ParsePrimary();
    445 if (!RHS) return 0;
    446
    447 As such, this code eats (and remembers) the binary operator and then
    448 parses the primary expression that follows. This builds up the whole
    449 pair, the first of which is [+, b] for the running example.
    450
    451 Now that we parsed the left-hand side of an expression and one pair of
    452 the RHS sequence, we have to decide which way the expression associates.
    453 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
    454 unparsed)". To determine this, we look ahead at "binop" to determine its
    455 precedence and compare it to BinOp's precedence (which is '+' in this
    456 case):
    457
    458 .. code-block:: c++
    459
    460 // If BinOp binds less tightly with RHS than the operator after RHS, let
    461 // the pending operator take RHS as its LHS.
    462 int NextPrec = GetTokPrecedence();
    463 if (TokPrec < NextPrec) {
    464
    465 If the precedence of the binop to the right of "RHS" is lower or equal
    466 to the precedence of our current operator, then we know that the
    467 parentheses associate as "(a+b) binop ...". In our example, the current
    468 operator is "+" and the next operator is "+", we know that they have the
    469 same precedence. In this case we'll create the AST node for "a+b", and
    470 then continue parsing:
    471
    472 .. code-block:: c++
    473
    474 ... if body omitted ...
    475 }
    476
    477 // Merge LHS/RHS.
    478 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    479 } // loop around to the top of the while loop.
    480 }
    481
    482 In our example above, this will turn "a+b+" into "(a+b)" and execute the
    483 next iteration of the loop, with "+" as the current token. The code
    484 above will eat, remember, and parse "(c+d)" as the primary expression,
    485 which makes the current pair equal to [+, (c+d)]. It will then evaluate
    486 the 'if' conditional above with "\*" as the binop to the right of the
    487 primary. In this case, the precedence of "\*" is higher than the
    488 precedence of "+" so the if condition will be entered.
    489
    490 The critical question left here is "how can the if condition parse the
    491 right hand side in full"? In particular, to build the AST correctly for
    492 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
    493 variable. The code to do this is surprisingly simple (code from the
    494 above two blocks duplicated for context):
    495
    496 .. code-block:: c++
    497
    498 // If BinOp binds less tightly with RHS than the operator after RHS, let
    499 // the pending operator take RHS as its LHS.
    500 int NextPrec = GetTokPrecedence();
    501 if (TokPrec < NextPrec) {
    502 RHS = ParseBinOpRHS(TokPrec+1, RHS);
    503 if (RHS == 0) return 0;
    504 }
    505 // Merge LHS/RHS.
    506 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    507 } // loop around to the top of the while loop.
    508 }
    509
    510 At this point, we know that the binary operator to the RHS of our
    511 primary has higher precedence than the binop we are currently parsing.
    512 As such, we know that any sequence of pairs whose operators are all
    513 higher precedence than "+" should be parsed together and returned as
    514 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
    515 specifying "TokPrec+1" as the minimum precedence required for it to
    516 continue. In our example above, this will cause it to return the AST
    517 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
    518 expression.
    519
    520 Finally, on the next iteration of the while loop, the "+g" piece is
    521 parsed and added to the AST. With this little bit of code (14
    522 non-trivial lines), we correctly handle fully general binary expression
    523 parsing in a very elegant way. This was a whirlwind tour of this code,
    524 and it is somewhat subtle. I recommend running through it with a few
    525 tough examples to see how it works.
    526
    527 This wraps up handling of expressions. At this point, we can point the
    528 parser at an arbitrary token stream and build an expression from it,
    529 stopping at the first token that is not part of the expression. Next up
    530 we need to handle function definitions, etc.
    531
    532 Parsing the Rest
    533 ================
    534
    535 The next thing missing is handling of function prototypes. In
    536 Kaleidoscope, these are used both for 'extern' function declarations as
    537 well as function body definitions. The code to do this is
    538 straight-forward and not very interesting (once you've survived
    539 expressions):
    540
    541 .. code-block:: c++
    542
    543 /// prototype
    544 /// ::= id '(' id* ')'
    545 static PrototypeAST *ParsePrototype() {
    546 if (CurTok != tok_identifier)
    547 return ErrorP("Expected function name in prototype");
    548
    549 std::string FnName = IdentifierStr;
    550 getNextToken();
    551
    552 if (CurTok != '(')
    553 return ErrorP("Expected '(' in prototype");
    554
    555 // Read the list of argument names.
    556 std::vector ArgNames;
    557 while (getNextToken() == tok_identifier)
    558 ArgNames.push_back(IdentifierStr);
    559 if (CurTok != ')')
    560 return ErrorP("Expected ')' in prototype");
    561
    562 // success.
    563 getNextToken(); // eat ')'.
    564
    565 return new PrototypeAST(FnName, ArgNames);
    566 }
    567
    568 Given this, a function definition is very simple, just a prototype plus
    569 an expression to implement the body:
    570
    571 .. code-block:: c++
    572
    573 /// definition ::= 'def' prototype expression
    574 static FunctionAST *ParseDefinition() {
    575 getNextToken(); // eat def.
    576 PrototypeAST *Proto = ParsePrototype();
    577 if (Proto == 0) return 0;
    578
    579 if (ExprAST *E = ParseExpression())
    580 return new FunctionAST(Proto, E);
    581 return 0;
    582 }
    583
    584 In addition, we support 'extern' to declare functions like 'sin' and
    585 'cos' as well as to support forward declaration of user functions. These
    586 'extern's are just prototypes with no body:
    587
    588 .. code-block:: c++
    589
    590 /// external ::= 'extern' prototype
    591 static PrototypeAST *ParseExtern() {
    592 getNextToken(); // eat extern.
    593 return ParsePrototype();
    594 }
    595
    596 Finally, we'll also let the user type in arbitrary top-level expressions
    597 and evaluate them on the fly. We will handle this by defining anonymous
    598 nullary (zero argument) functions for them:
    599
    600 .. code-block:: c++
    601
    602 /// toplevelexpr ::= expression
    603 static FunctionAST *ParseTopLevelExpr() {
    604 if (ExprAST *E = ParseExpression()) {
    605 // Make an anonymous proto.
    606 PrototypeAST *Proto = new PrototypeAST("", std::vector());
    607 return new FunctionAST(Proto, E);
    608 }
    609 return 0;
    610 }
    611
    612 Now that we have all the pieces, let's build a little driver that will
    613 let us actually *execute* this code we've built!
    614
    615 The Driver
    616 ==========
    617
    618 The driver for this simply invokes all of the parsing pieces with a
    619 top-level dispatch loop. There isn't much interesting here, so I'll just
    620 include the top-level loop. See `below <#code>`_ for full code in the
    621 "Top-Level Parsing" section.
    622
    623 .. code-block:: c++
    624
    625 /// top ::= definition | external | expression | ';'
    626 static void MainLoop() {
    627 while (1) {
    628 fprintf(stderr, "ready> ");
    629 switch (CurTok) {
    630 case tok_eof: return;
    631 case ';': getNextToken(); break; // ignore top-level semicolons.
    632 case tok_def: HandleDefinition(); break;
    633 case tok_extern: HandleExtern(); break;
    634 default: HandleTopLevelExpression(); break;
    635 }
    636 }
    637 }
    638
    639 The most interesting part of this is that we ignore top-level
    640 semicolons. Why is this, you ask? The basic reason is that if you type
    641 "4 + 5" at the command line, the parser doesn't know whether that is the
    642 end of what you will type or not. For example, on the next line you
    643 could type "def foo..." in which case 4+5 is the end of a top-level
    644 expression. Alternatively you could type "\* 6", which would continue
    645 the expression. Having top-level semicolons allows you to type "4+5;",
    646 and the parser will know you are done.
    647
    648 Conclusions
    649 ===========
    650
    651 With just under 400 lines of commented code (240 lines of non-comment,
    652 non-blank code), we fully defined our minimal language, including a
    653 lexer, parser, and AST builder. With this done, the executable will
    654 validate Kaleidoscope code and tell us if it is grammatically invalid.
    655 For example, here is a sample interaction:
    656
    657 .. code-block:: bash
    658
    659 $ ./a.out
    660 ready> def foo(x y) x+foo(y, 4.0);
    661 Parsed a function definition.
    662 ready> def foo(x y) x+y y;
    663 Parsed a function definition.
    664 Parsed a top-level expr
    665 ready> def foo(x y) x+y );
    666 Parsed a function definition.
    667 Error: unknown token when expecting an expression
    668 ready> extern sin(a);
    669 ready> Parsed an extern
    670 ready> ^D
    671 $
    672
    673 There is a lot of room for extension here. You can define new AST nodes,
    674 extend the language in many ways, etc. In the `next
    675 installment `_, we will describe how to generate LLVM
    676 Intermediate Representation (IR) from the AST.
    677
    678 Full Code Listing
    679 =================
    680
    681 Here is the complete code listing for this and the previous chapter.
    682 Note that it is fully self-contained: you don't need LLVM or any
    683 external libraries at all for this. (Besides the C and C++ standard
    684 libraries, of course.) To build this, just compile with:
    685
    686 .. code-block:: bash
    687
    688 # Compile
    689 clang++ -g -O3 toy.cpp
    690 # Run
    691 ./a.out
    692
    693 Here is the code:
    694
    695 .. code-block:: c++
    696
    697 #include
    698 #include
    699 #include
    700 #include
    701 #include
    702
    703 //===----------------------------------------------------------------------===//
    704 // Lexer
    705 //===----------------------------------------------------------------------===//
    706
    707 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
    708 // of these for known things.
    709 enum Token {
    710 tok_eof = -1,
    711
    712 // commands
    713 tok_def = -2, tok_extern = -3,
    714
    715 // primary
    716 tok_identifier = -4, tok_number = -5
    717 };
    718
    719 static std::string IdentifierStr; // Filled in if tok_identifier
    720 static double NumVal; // Filled in if tok_number
    721
    722 /// gettok - Return the next token from standard input.
    723 static int gettok() {
    724 static int LastChar = ' ';
    725
    726 // Skip any whitespace.
    727 while (isspace(LastChar))
    728 LastChar = getchar();
    729
    730 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    731 IdentifierStr = LastChar;
    732 while (isalnum((LastChar = getchar())))
    733 IdentifierStr += LastChar;
    734
    735 if (IdentifierStr == "def") return tok_def;
    736 if (IdentifierStr == "extern") return tok_extern;
    737 return tok_identifier;
    738 }
    739
    740 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
    741 std::string NumStr;
    742 do {
    743 NumStr += LastChar;
    744 LastChar = getchar();
    745 } while (isdigit(LastChar) || LastChar == '.');
    746
    747 NumVal = strtod(NumStr.c_str(), 0);
    748 return tok_number;
    749 }
    750
    751 if (LastChar == '#') {
    752 // Comment until end of line.
    753 do LastChar = getchar();
    754 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
    755
    756 if (LastChar != EOF)
    757 return gettok();
    758 }
    759
    760 // Check for end of file. Don't eat the EOF.
    761 if (LastChar == EOF)
    762 return tok_eof;
    763
    764 // Otherwise, just return the character as its ascii value.
    765 int ThisChar = LastChar;
    766 LastChar = getchar();
    767 return ThisChar;
    768 }
    769
    770 //===----------------------------------------------------------------------===//
    771 // Abstract Syntax Tree (aka Parse Tree)
    772 //===----------------------------------------------------------------------===//
    773
    774 /// ExprAST - Base class for all expression nodes.
    775 class ExprAST {
    776 public:
    777 virtual ~ExprAST() {}
    778 };
    779
    780 /// NumberExprAST - Expression class for numeric literals like "1.0".
    781 class NumberExprAST : public ExprAST {
    782 double Val;
    783 public:
    784 NumberExprAST(double val) : Val(val) {}
    785 };
    786
    787 /// VariableExprAST - Expression class for referencing a variable, like "a".
    788 class VariableExprAST : public ExprAST {
    789 std::string Name;
    790 public:
    791 VariableExprAST(const std::string &name) : Name(name) {}
    792 };
    793
    794 /// BinaryExprAST - Expression class for a binary operator.
    795 class BinaryExprAST : public ExprAST {
    796 char Op;
    797 ExprAST *LHS, *RHS;
    798 public:
    799 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
    800 : Op(op), LHS(lhs), RHS(rhs) {}
    801 };
    802
    803 /// CallExprAST - Expression class for function calls.
    804 class CallExprAST : public ExprAST {
    805 std::string Callee;
    806 std::vector Args;
    807 public:
    808 CallExprAST(const std::string &callee, std::vector &args)
    809 : Callee(callee), Args(args) {}
    810 };
    811
    812 /// PrototypeAST - This class represents the "prototype" for a function,
    813 /// which captures its name, and its argument names (thus implicitly the number
    814 /// of arguments the function takes).
    815 class PrototypeAST {
    816 std::string Name;
    817 std::vector Args;
    818 public:
    819 PrototypeAST(const std::string &name, const std::vector &args)
    820 : Name(name), Args(args) {}
    821
    822 };
    823
    824 /// FunctionAST - This class represents a function definition itself.
    825 class FunctionAST {
    826 PrototypeAST *Proto;
    827 ExprAST *Body;
    828 public:
    829 FunctionAST(PrototypeAST *proto, ExprAST *body)
    830 : Proto(proto), Body(body) {}
    831
    832 };
    833
    834 //===----------------------------------------------------------------------===//
    835 // Parser
    836 //===----------------------------------------------------------------------===//
    837
    838 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
    839 /// token the parser is looking at. getNextToken reads another token from the
    840 /// lexer and updates CurTok with its results.
    841 static int CurTok;
    842 static int getNextToken() {
    843 return CurTok = gettok();
    844 }
    845
    846 /// BinopPrecedence - This holds the precedence for each binary operator that is
    847 /// defined.
    848 static std::map BinopPrecedence;
    849
    850 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
    851 static int GetTokPrecedence() {
    852 if (!isascii(CurTok))
    853 return -1;
    854
    855 // Make sure it's a declared binop.
    856 int TokPrec = BinopPrecedence[CurTok];
    857 if (TokPrec <= 0) return -1;
    858 return TokPrec;
    859 }
    860
    861 /// Error* - These are little helper functions for error handling.
    862 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
    863 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
    864 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
    865
    866 static ExprAST *ParseExpression();
    867
    868 /// identifierexpr
    869 /// ::= identifier
    870 /// ::= identifier '(' expression* ')'
    871 static ExprAST *ParseIdentifierExpr() {
    872 std::string IdName = IdentifierStr;
    873
    874 getNextToken(); // eat identifier.
    875
    876 if (CurTok != '(') // Simple variable ref.
    877 return new VariableExprAST(IdName);
    878
    879 // Call.
    880 getNextToken(); // eat (
    881 std::vector Args;
    882 if (CurTok != ')') {
    883 while (1) {
    884 ExprAST *Arg = ParseExpression();
    885 if (!Arg) return 0;
    886 Args.push_back(Arg);
    887
    888 if (CurTok == ')') break;
    889
    890 if (CurTok != ',')
    891 return Error("Expected ')' or ',' in argument list");
    892 getNextToken();
    893 }
    894 }
    895
    896 // Eat the ')'.
    897 getNextToken();
    898
    899 return new CallExprAST(IdName, Args);
    900 }
    901
    902 /// numberexpr ::= number
    903 static ExprAST *ParseNumberExpr() {
    904 ExprAST *Result = new NumberExprAST(NumVal);
    905 getNextToken(); // consume the number
    906 return Result;
    907 }
    908
    909 /// parenexpr ::= '(' expression ')'
    910 static ExprAST *ParseParenExpr() {
    911 getNextToken(); // eat (.
    912 ExprAST *V = ParseExpression();
    913 if (!V) return 0;
    914
    915 if (CurTok != ')')
    916 return Error("expected ')'");
    917 getNextToken(); // eat ).
    918 return V;
    919 }
    920
    921 /// primary
    922 /// ::= identifierexpr
    923 /// ::= numberexpr
    924 /// ::= parenexpr
    925 static ExprAST *ParsePrimary() {
    926 switch (CurTok) {
    927 default: return Error("unknown token when expecting an expression");
    928 case tok_identifier: return ParseIdentifierExpr();
    929 case tok_number: return ParseNumberExpr();
    930 case '(': return ParseParenExpr();
    931 }
    932 }
    933
    934 /// binoprhs
    935 /// ::= ('+' primary)*
    936 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
    937 // If this is a binop, find its precedence.
    938 while (1) {
    939 int TokPrec = GetTokPrecedence();
    940
    941 // If this is a binop that binds at least as tightly as the current binop,
    942 // consume it, otherwise we are done.
    943 if (TokPrec < ExprPrec)
    944 return LHS;
    945
    946 // Okay, we know this is a binop.
    947 int BinOp = CurTok;
    948 getNextToken(); // eat binop
    949
    950 // Parse the primary expression after the binary operator.
    951 ExprAST *RHS = ParsePrimary();
    952 if (!RHS) return 0;
    953
    954 // If BinOp binds less tightly with RHS than the operator after RHS, let
    955 // the pending operator take RHS as its LHS.
    956 int NextPrec = GetTokPrecedence();
    957 if (TokPrec < NextPrec) {
    958 RHS = ParseBinOpRHS(TokPrec+1, RHS);
    959 if (RHS == 0) return 0;
    960 }
    961
    962 // Merge LHS/RHS.
    963 LHS = new BinaryExprAST(BinOp, LHS, RHS);
    964 }
    965 }
    966
    967 /// expression
    968 /// ::= primary binoprhs
    969 ///
    970 static ExprAST *ParseExpression() {
    971 ExprAST *LHS = ParsePrimary();
    972 if (!LHS) return 0;
    973
    974 return ParseBinOpRHS(0, LHS);
    975 }
    976
    977 /// prototype
    978 /// ::= id '(' id* ')'
    979 static PrototypeAST *ParsePrototype() {
    980 if (CurTok != tok_identifier)
    981 return ErrorP("Expected function name in prototype");
    982
    983 std::string FnName = IdentifierStr;
    984 getNextToken();
    985
    986 if (CurTok != '(')
    987 return ErrorP("Expected '(' in prototype");
    988
    989 std::vector ArgNames;
    990 while (getNextToken() == tok_identifier)
    991 ArgNames.push_back(IdentifierStr);
    992 if (CurTok != ')')
    993 return ErrorP("Expected ')' in prototype");
    994
    995 // success.
    996 getNextToken(); // eat ')'.
    997
    998 return new PrototypeAST(FnName, ArgNames);
    999 }
    1000
    1001 /// definition ::= 'def' prototype expression
    1002 static FunctionAST *ParseDefinition() {
    1003 getNextToken(); // eat def.
    1004 PrototypeAST *Proto = ParsePrototype();
    1005 if (Proto == 0) return 0;
    1006
    1007 if (ExprAST *E = ParseExpression())
    1008 return new FunctionAST(Proto, E);
    1009 return 0;
    1010 }
    1011
    1012 /// toplevelexpr ::= expression
    1013 static FunctionAST *ParseTopLevelExpr() {
    1014 if (ExprAST *E = ParseExpression()) {
    1015 // Make an anonymous proto.
    1016 PrototypeAST *Proto = new PrototypeAST("", std::vector());
    1017 return new FunctionAST(Proto, E);
    1018 }
    1019 return 0;
    1020 }
    1021
    1022 /// external ::= 'extern' prototype
    1023 static PrototypeAST *ParseExtern() {
    1024 getNextToken(); // eat extern.
    1025 return ParsePrototype();
    1026 }
    1027
    1028 //===----------------------------------------------------------------------===//
    1029 // Top-Level parsing
    1030 //===----------------------------------------------------------------------===//
    1031
    1032 static void HandleDefinition() {
    1033 if (ParseDefinition()) {
    1034 fprintf(stderr, "Parsed a function definition.\n");
    1035 } else {
    1036 // Skip token for error recovery.
    1037 getNextToken();
    1038 }
    1039 }
    1040
    1041 static void HandleExtern() {
    1042 if (ParseExtern()) {
    1043 fprintf(stderr, "Parsed an extern\n");
    1044 } else {
    1045 // Skip token for error recovery.
    1046 getNextToken();
    1047 }
    1048 }
    1049
    1050 static void HandleTopLevelExpression() {
    1051 // Evaluate a top-level expression into an anonymous function.
    1052 if (ParseTopLevelExpr()) {
    1053 fprintf(stderr, "Parsed a top-level expr\n");
    1054 } else {
    1055 // Skip token for error recovery.
    1056 getNextToken();
    1057 }
    1058 }
    1059
    1060 /// top ::= definition | external | expression | ';'
    1061 static void MainLoop() {
    1062 while (1) {
    1063 fprintf(stderr, "ready> ");
    1064 switch (CurTok) {
    1065 case tok_eof: return;
    1066 case ';': getNextToken(); break; // ignore top-level semicolons.
    1067 case tok_def: HandleDefinition(); break;
    1068 case tok_extern: HandleExtern(); break;
    1069 default: HandleTopLevelExpression(); break;
    1070 }
    1071 }
    1072 }
    1073
    1074 //===----------------------------------------------------------------------===//
    1075 // Main driver code.
    1076 //===----------------------------------------------------------------------===//
    1077
    1078 int main() {
    1079 // Install standard binary operators.
    1080 // 1 is lowest precedence.
    1081 BinopPrecedence['<'] = 10;
    1082 BinopPrecedence['+'] = 20;
    1083 BinopPrecedence['-'] = 20;
    1084 BinopPrecedence['*'] = 40; // highest.
    1085
    1086 // Prime the first token.
    1087 fprintf(stderr, "ready> ");
    1088 getNextToken();
    1089
    1090 // Run the main "interpreter loop" now.
    1091 MainLoop();
    1092
    1093 return 0;
    1094 }
    1095
    1096 `Next: Implementing Code Generation to LLVM IR `_
    1097
    +0
    -1268
    docs/tutorial/LangImpl3.html less more
    None
    1 "http://www.w3.org/TR/html4/strict.dtd">
    2
    3
    4
    5 Kaleidoscope: Implementing code generation to LLVM IR
    6
    7
    8
    9
    10
    11
    12
    13

    Kaleidoscope: Code generation to LLVM IR

    14
    15
    16
  • Up to Tutorial Index
  • 17
  • Chapter 3
  • 18
    19
  • Chapter 3 Introduction
  • 20
  • Code Generation Setup
  • 21
  • Expression Code Generation
  • 22
  • Function Code Generation
  • 23
  • Driver Changes and Closing Thoughts
  • 24
  • Full Code Listing
  • 25
    26
    27
  • Chapter 4: Adding JIT and Optimizer
  • 28 Support
    29
    30
    31
    32

    Written by Chris Lattner

    33
    34
    35
    36

    Chapter 3 Introduction

    37
    38
    39
    40
    41

    Welcome to Chapter 3 of the "Implementing a language

    42 with LLVM" tutorial. This chapter shows you how to transform the
    43 href="LangImpl2.html">Abstract Syntax Tree, built in Chapter 2, into LLVM IR.
    44 This will teach you a little bit about how LLVM does things, as well as
    45 demonstrate how easy it is to use. It's much more work to build a lexer and
    46 parser than it is to generate LLVM IR code. :)
    47

    48
    49

    Please note: the code in this chapter and later require LLVM 2.2 or

    50 later. LLVM 2.1 and before will not work with it. Also note that you need
    51 to use a version of this tutorial that matches your LLVM release: If you are
    52 using an official LLVM release, use the version of the documentation included
    53 with your release or on the llvm.org
    54 releases page.

    55
    56
    57
    58
    59

    Code Generation Setup

    60
    61
    62
    63
    64

    65 In order to generate LLVM IR, we want some simple setup to get started. First
    66 we define virtual code generation (codegen) methods in each AST class:

    67
    68
    69
    
                      
                    
    70 /// ExprAST - Base class for all expression nodes.
    71 class ExprAST {
    72 public:
    73 virtual ~ExprAST() {}
    74 virtual Value *Codegen() = 0;
    75 };
    76
    77 /// NumberExprAST - Expression class for numeric literals like "1.0".
    78 class NumberExprAST : public ExprAST {
    79 double Val;
    80 public:
    81 NumberExprAST(double val) : Val(val) {}
    82 virtual Value *Codegen();
    83 };
    84 ...
    85
    86
    87
    88

    The Codegen() method says to emit IR for that AST node along with all the things it

    89 depends on, and they all return an LLVM Value object.
    90 "Value" is the class used to represent a "
    91 href="http://en.wikipedia.org/wiki/Static_single_assignment_form">Static Single
    92 Assignment (SSA) register" or "SSA value" in LLVM. The most distinct aspect
    93 of SSA values is that their value is computed as the related instruction
    94 executes, and it does not get a new value until (and if) the instruction
    95 re-executes. In other words, there is no way to "change" an SSA value. For
    96 more information, please read up on
    97 href="http://en.wikipedia.org/wiki/Static_single_assignment_form">Static Single
    98 Assignment - the concepts are really quite natural once you grok them.

    99
    100

    Note that instead of adding virtual methods to the ExprAST class hierarchy,

    101 it could also make sense to use a
    102 href="http://en.wikipedia.org/wiki/Visitor_pattern">visitor pattern or some
    103 other way to model this. Again, this tutorial won't dwell on good software
    104 engineering practices: for our purposes, adding a virtual method is
    105 simplest.

    106
    107

    The

    108 second thing we want is an "Error" method like we used for the parser, which will
    109 be used to report errors found during code generation (for example, use of an
    110 undeclared parameter):

    111
    112
    113
    
                      
                    
    114 Value *ErrorV(const char *Str) { Error(Str); return 0; }
    115
    116 static Module *TheModule;
    117 static IRBuilder<> Builder(getGlobalContext());
    118 static std::map<std::string, Value*> NamedValues;
    119
    120
    121
    122

    The static variables will be used during code generation. TheModule

    123 is the LLVM construct that contains all of the functions and global variables in
    124 a chunk of code. In many ways, it is the top-level structure that the LLVM IR
    125 uses to contain code.

    126
    127

    The Builder object is a helper object that makes it easy to generate

    128 LLVM instructions. Instances of the
    129 href="http://llvm.org/doxygen/IRBuilder_8h-source.html">IRBuilder
    130 class template keep track of the current place to insert instructions and has
    131 methods to create new instructions.

    132
    133

    The NamedValues map keeps track of which values are defined in the

    134 current scope and what their LLVM representation is. (In other words, it is a
    135 symbol table for the code). In this form of Kaleidoscope, the only things that
    136 can be referenced are function parameters. As such, function parameters will
    137 be in this map when generating code for their function body.

    138
    139

    140 With these basics in place, we can start talking about how to generate code for
    141 each expression. Note that this assumes that the Builder has been set
    142 up to generate code into something. For now, we'll assume that this
    143 has already been done, and we'll just use it to emit code.
    144

    145
    146
    147
    148
    149

    Expression Code Generation

    150
    151
    152
    153
    154

    Generating LLVM code for expression nodes is very straightforward: less

    155 than 45 lines of commented code for all four of our expression nodes. First
    156 we'll do numeric literals:

    157
    158
    159
    
                      
                    
    160 Value *NumberExprAST::Codegen() {
    161 return ConstantFP::get(getGlobalContext(), APFloat(Val));
    162 }
    163
    164
    165
    166

    In the LLVM IR, numeric constants are represented with the

    167 ConstantFP class, which holds the numeric value in an APFloat
    168 internally (APFloat has the capability of holding floating point
    169 constants of Arbitrary Precision). This code basically just
    170 creates and returns a ConstantFP. Note that in the LLVM IR
    171 that constants are all uniqued together and shared. For this reason, the API
    172 uses the "foo::get(...)" idiom instead of "new foo(..)" or "foo::Create(..)".

    173
    174
    175
    
                      
                    
    176 Value *VariableExprAST::Codegen() {
    177 // Look this variable up in the function.
    178 Value *V = NamedValues[Name];
    179 return V ? V : ErrorV("Unknown variable name");
    180 }
    181
    182
    183
    184

    References to variables are also quite simple using LLVM. In the simple version

    185 of Kaleidoscope, we assume that the variable has already been emitted somewhere
    186 and its value is available. In practice, the only values that can be in the
    187 NamedValues map are function arguments. This
    188 code simply checks to see that the specified name is in the map (if not, an
    189 unknown variable is being referenced) and returns the value for it. In future
    190 chapters, we'll add support for loop induction
    191 variables in the symbol table, and for
    192 href="LangImpl7.html#localvars">local variables.

    193
    194
    195
    
                      
                    
    196 Value *BinaryExprAST::Codegen() {
    197 Value *L = LHS->Codegen();
    198 Value *R = RHS->Codegen();
    199 if (L == 0 || R == 0) return 0;
    200
    201 switch (Op) {
    202 case '+': return Builder.CreateFAdd(L, R, "addtmp");
    203 case '-': return Builder.CreateFSub(L, R, "subtmp");
    204 case '*': return Builder.CreateFMul(L, R, "multmp");
    205 case '<':
    206 L = Builder.CreateFCmpULT(L, R, "cmptmp");
    207 // Convert bool 0/1 to double 0.0 or 1.0
    208 return Builder.CreateUIToFP(L, Type::getDoubleTy(getGlobalContext()),
    209 "booltmp");
    210 default: return ErrorV("invalid binary operator");
    211 }
    212 }
    213
    214
    215
    216

    Binary operators start to get more interesting. The basic idea here is that

    217 we recursively emit code for the left-hand side of the expression, then the
    218 right-hand side, then we compute the result of the binary expression. In this
    219 code, we do a simple switch on the opcode to create the right LLVM instruction.
    220

    221
    222

    In the example above, the LLVM builder class is starting to show its value.

    223 IRBuilder knows where to insert the newly created instruction, all you have to
    224 do is specify what instruction to create (e.g. with CreateFAdd), which
    225 operands to use (L and R here) and optionally provide a name
    226 for the generated instruction.

    227
    228

    One nice thing about LLVM is that the name is just a hint. For instance, if

    229 the code above emits multiple "addtmp" variables, LLVM will automatically
    230 provide each one with an increasing, unique numeric suffix. Local value names
    231 for instructions are purely optional, but it makes it much easier to read the
    232 IR dumps.

    233
    234

    LLVM instructions are constrained by

    235 strict rules: for example, the Left and Right operators of
    236 an add instruction must have the same
    237 type, and the result type of the add must match the operand types. Because
    238 all values in Kaleidoscope are doubles, this makes for very simple code for add,
    239 sub and mul.

    240
    241

    On the other hand, LLVM specifies that the

    242 href="../LangRef.html#i_fcmp">fcmp instruction always returns an 'i1' value
    243 (a one bit integer). The problem with this is that Kaleidoscope wants the value to be a 0.0 or 1.0 value. In order to get these semantics, we combine the fcmp instruction with
    244 a uitofp instruction. This instruction
    245 converts its input integer into a floating point value by treating the input
    246 as an unsigned value. In contrast, if we used the
    247 href="../LangRef.html#i_sitofp">sitofp instruction, the Kaleidoscope '<'
    248 operator would return 0.0 and -1.0, depending on the input value.

    249
    250
    251
    
                      
                    
    252 Value *CallExprAST::Codegen() {
    253 // Look up the name in the global module table.
    254 Function *CalleeF = TheModule->getFunction(Callee);
    255 if (CalleeF == 0)
    256 return ErrorV("Unknown function referenced");
    257
    258 // If argument mismatch error.
    259 if (CalleeF->arg_size() != Args.size())
    260 return ErrorV("Incorrect # arguments passed");
    261
    262 std::vector<Value*> ArgsV;
    263 for (unsigned i = 0, e = Args.size(); i != e; ++i) {
    264 ArgsV.push_back(Args[i]->Codegen());
    265 if (ArgsV.back() == 0) return 0;
    266 }
    267
    268 return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
    269 }
    270
    271
    272
    273

    Code generation for function calls is quite straightforward with LLVM. The

    274 code above initially does a function name lookup in the LLVM Module's symbol
    275 table. Recall that the LLVM Module is the container that holds all of the
    276 functions we are JIT'ing. By giving each function the same name as what the
    277 user specifies, we can use the LLVM symbol table to resolve function names for
    278 us.

    279
    280

    Once we have the function to call, we recursively codegen each argument that

    281 is to be passed in, and create an LLVM call
    282 instruction. Note that LLVM uses the native C calling conventions by
    283 default, allowing these calls to also call into standard library functions like
    284 "sin" and "cos", with no additional effort.

    285
    286

    This wraps up our handling of the four basic expressions that we have so far

    287 in Kaleidoscope. Feel free to go in and add some more. For example, by
    288 browsing the LLVM language reference you'll find
    289 several other interesting instructions that are really easy to plug into our
    290 basic framework.

    291
    292
    293
    294
    295

    Function Code Generation

    296
    297
    298
    299
    300

    Code generation for prototypes and functions must handle a number of

    301 details, which make their code less beautiful than expression code
    302 generation, but allows us to illustrate some important points. First, lets
    303 talk about code generation for prototypes: they are used both for function
    304 bodies and external function declarations. The code starts with:

    305
    306
    307
    
                      
                    
    308 Function *PrototypeAST::Codegen() {
    309 // Make the function type: double(double,double) etc.
    310 std::vector<Type*> Doubles(Args.size(),
    311 Type::getDoubleTy(getGlobalContext()));
    312 FunctionType *FT = FunctionType::get(Type::getDoubleTy(getGlobalContext()),
    313 Doubles, false);
    314
    315 Function *F = Function::Create(FT, Function::ExternalLinkage, Name, TheModule);
    316
    317
    318
    319

    This code packs a lot of power into a few lines. Note first that this

    320 function returns a "Function*" instead of a "Value*". Because a "prototype"
    321 really talks about the external interface for a function (not the value computed
    322 by an expression), it makes sense for it to return the LLVM Function it
    323 corresponds to when codegen'd.

    324
    325

    The call to FunctionType::get creates

    326 the FunctionType that should be used for a given Prototype. Since all
    327 function arguments in Kaleidoscope are of type double, the first line creates
    328 a vector of "N" LLVM double types. It then uses the Functiontype::get
    329 method to create a function type that takes "N" doubles as arguments, returns
    330 one double as a result, and that is not vararg (the false parameter indicates
    331 this). Note that Types in LLVM are uniqued just like Constants are, so you
    332 don't "new" a type, you "get" it.

    333
    334

    The final line above actually creates the function that the prototype will

    335 correspond to. This indicates the type, linkage and name to use, as well as which
    336 module to insert into. "external linkage"
    337 means that the function may be defined outside the current module and/or that it
    338 is callable by functions outside the module. The Name passed in is the name the
    339 user specified: since "TheModule" is specified, this name is registered
    340 in "TheModule"s symbol table, which is used by the function call code
    341 above.

    342
    343
    344
    
                      
                    
    345 // If F conflicted, there was already something named 'Name'. If it has a
    346 // body, don't allow redefinition or reextern.
    347 if (F->getName() != Name) {
    348 // Delete the one we just made and get the existing one.
    349 F->eraseFromParent();
    350 F = TheModule->getFunction(Name);
    351
    352
    353
    354

    The Module symbol table works just like the Function symbol table when it

    355 comes to name conflicts: if a new function is created with a name that was previously
    356 added to the symbol table, the new function will get implicitly renamed when added to the
    357 Module. The code above exploits this fact to determine if there was a previous
    358 definition of this function.

    359
    360

    In Kaleidoscope, I choose to allow redefinitions of functions in two cases:

    361 first, we want to allow 'extern'ing a function more than once, as long as the
    362 prototypes for the externs match (since all arguments have the same type, we
    363 just have to check that the number of arguments match). Second, we want to
    364 allow 'extern'ing a function and then defining a body for it. This is useful
    365 when defining mutually recursive functions.

    366
    367

    In order to implement this, the code above first checks to see if there is

    368 a collision on the name of the function. If so, it deletes the function we just
    369 created (by calling eraseFromParent) and then calling
    370 getFunction to get the existing function with the specified name. Note
    371 that many APIs in LLVM have "erase" forms and "remove" forms. The "remove" form
    372 unlinks the object from its parent (e.g. a Function from a Module) and returns
    373 it. The "erase" form unlinks the object and then deletes it.

    374
    375
    376
    
                      
                    
    377 // If F already has a body, reject this.
    378 if (!F->empty()) {
    379 ErrorF("redefinition of function");
    380 return 0;
    381 }
    382
    383 // If F took a different number of args, reject.
    384 if (F->arg_size() != Args.size()) {
    385 ErrorF("redefinition of function with different # args");
    386 return 0;
    387 }
    388 }
    389
    390
    391
    392

    In order to verify the logic above, we first check to see if the pre-existing

    393 function is "empty". In this case, empty means that it has no basic blocks in
    394 it, which means it has no body. If it has no body, it is a forward
    395 declaration. Since we don't allow anything after a full definition of the
    396 function, the code rejects this case. If the previous reference to a function
    397 was an 'extern', we simply verify that the number of arguments for that
    398 definition and this one match up. If not, we emit an error.

    399
    400
    401
    
                      
                    
    402 // Set names for all arguments.
    403 unsigned Idx = 0;
    404 for (Function::arg_iterator AI = F->arg_begin(); Idx != Args.size();
    405 ++AI, ++Idx) {
    406 AI->setName(Args[Idx]);
    407
    408 // Add arguments to variable symbol table.
    409 NamedValues[Args[Idx]] = AI;
    410 }
    411 return F;
    412 }
    413
    414
    415
    416

    The last bit of code for prototypes loops over all of the arguments in the

    417 function, setting the name of the LLVM Argument objects to match, and registering
    418 the arguments in the NamedValues map for future use by the
    419 VariableExprAST AST node. Once this is set up, it returns the Function
    420 object to the caller. Note that we don't check for conflicting
    421 argument names here (e.g. "extern foo(a b a)"). Doing so would be very
    422 straight-forward with the mechanics we have already used above.

    423
    424
    425
    
                      
                    
    426 Function *FunctionAST::Codegen() {
    427 NamedValues.clear();
    428
    429 Function *TheFunction = Proto->Codegen();
    430 if (TheFunction == 0)
    431 return 0;
    432
    433
    434
    435

    Code generation for function definitions starts out simply enough: we just

    436 codegen the prototype (Proto) and verify that it is ok. We then clear out the
    437 NamedValues map to make sure that there isn't anything in it from the
    438 last function we compiled. Code generation of the prototype ensures that there
    439 is an LLVM Function object that is ready to go for us.

    440
    441
    442
    
                      
                    
    443 // Create a new basic block to start insertion into.
    444 BasicBlock *BB = BasicBlock::Create(getGlobalContext(), "entry", TheFunction);
    445 Builder.SetInsertPoint(BB);
    446
    447 if (Value *RetVal = Body->Codegen()) {
    448
    449
    450
    451

    Now we get to the point where the Builder is set up. The first

    452 line creates a new basic
    453 block (named "entry"), which is inserted into TheFunction. The
    454 second line then tells the builder that new instructions should be inserted into
    455 the end of the new basic block. Basic blocks in LLVM are an important part
    456 of functions that define the
    457 href="http://en.wikipedia.org/wiki/Control_flow_graph">Control Flow Graph.
    458 Since we don't have any control flow, our functions will only contain one
    459 block at this point. We'll fix this in Chapter 5 :).

    460
    461
    462
    
                      
                    
    463 if (Value *RetVal = Body->Codegen()) {
    464 // Finish off the function.
    465 Builder.CreateRet(RetVal);
    466
    467 // Validate the generated code, checking for consistency.
    468 verifyFunction(*TheFunction);
    469
    470 return TheFunction;
    471 }
    472
    473
    474
    475

    Once the insertion point is set up, we call the CodeGen() method for

    476 the root expression of the function. If no error happens, this emits code to
    477 compute the expression into the entry block and returns the value that was
    478 computed. Assuming no error, we then create an LLVM
    479 href="../LangRef.html#i_ret">ret instruction, which completes the function.
    480 Once the function is built, we call verifyFunction, which
    481 is provided by LLVM. This function does a variety of consistency checks on the
    482 generated code, to determine if our compiler is doing everything right. Using
    483 this is important: it can catch a lot of bugs. Once the function is finished
    484 and validated, we return it.

    485
    486
    487
    
                      
                    
    488 // Error reading body, remove function.
    489 TheFunction->eraseFromParent();
    490 return 0;
    491 }
    492
    493
    494
    495

    The only piece left here is handling of the error case. For simplicity, we

    496 handle this by merely deleting the function we produced with the
    497 eraseFromParent method. This allows the user to redefine a function
    498 that they incorrectly typed in before: if we didn't delete it, it would live in
    499 the symbol table, with a body, preventing future redefinition.

    500
    501

    This code does have a bug, though. Since the PrototypeAST::Codegen

    502 can return a previously defined forward declaration, our code can actually delete
    503 a forward declaration. There are a number of ways to fix this bug, see what you
    504 can come up with! Here is a testcase:

    505
    506
    507
    
                      
                    
    508 extern foo(a b); # ok, defines foo.
    509 def foo(a b) c; # error, 'c' is invalid.
    510 def bar() foo(1, 2); # error, unknown function "foo"
    511
    512
    513
    514
    515
    516
    517

    Driver Changes and Closing Thoughts

    518
    519
    520
    521
    522

    523 For now, code generation to LLVM doesn't really get us much, except that we can
    524 look at the pretty IR calls. The sample code inserts calls to Codegen into the
    525 "HandleDefinition", "HandleExtern" etc functions, and then
    526 dumps out the LLVM IR. This gives a nice way to look at the LLVM IR for simple
    527 functions. For example:
    528

    529
    530
    531
    
                      
                    
    532 ready> 4+5;
    533 Read top-level expression:
    534 define double @0() {
    535 entry:
    536 ret double 9.000000e+00
    537 }
    538
    539
    540
    541

    Note how the parser turns the top-level expression into anonymous functions

    542 for us. This will be handy when we add JIT
    543 support in the next chapter. Also note that the code is very literally
    544 transcribed, no optimizations are being performed except simple constant
    545 folding done by IRBuilder. We will
    546 add optimizations explicitly in
    547 the next chapter.

    548
    549
    550
    
                      
                    
    551 ready> def foo(a b) a*a + 2*a*b + b*b;
    552 Read function definition:
    553 define double @foo(double %a, double %b) {
    554 entry:
    555 %multmp = fmul double %a, %a
    556 %multmp1 = fmul double 2.000000e+00, %a
    557 %multmp2 = fmul double %multmp1, %b
    558 %addtmp = fadd double %multmp, %multmp2
    559 %multmp3 = fmul double %b, %b
    560 %addtmp4 = fadd double %addtmp, %multmp3
    561 ret double %addtmp4
    562 }
    563
    564
    565
    566

    This shows some simple arithmetic. Notice the striking similarity to the

    567 LLVM builder calls that we use to create the instructions.

    568
    569
    570
    
                      
                    
    571 ready> def bar(a) foo(a, 4.0) + bar(31337);
    572 Read function definition:
    573 define double @bar(double %a) {
    574 entry:
    575 %calltmp = call double @foo(double %a, double 4.000000e+00)
    576 %calltmp1 = call double @bar(double 3.133700e+04)
    577 %addtmp = fadd double %calltmp, %calltmp1
    578 ret double %addtmp
    579 }
    580
    581
    582
    583

    This shows some function calls. Note that this function will take a long

    584 time to execute if you call it. In the future we'll add conditional control
    585 flow to actually make recursion useful :).

    586
    587
    588
    
                      
                    
    589 ready> extern cos(x);
    590 Read extern:
    591 declare double @cos(double)
    592
    593 ready> cos(1.234);
    594 Read top-level expression:
    595 define double @1() {
    596 entry:
    597 %calltmp = call double @cos(double 1.234000e+00)
    598 ret double %calltmp
    599 }
    600
    601
    602
    603

    This shows an extern for the libm "cos" function, and a call to it.

    604
    605
    606
    607
    
                      
                    
    608 ready> ^D
    609 ; ModuleID = 'my cool jit'
    610
    611 define double @0() {
    612 entry:
    613 %addtmp = fadd double 4.000000e+00, 5.000000e+00
    614 ret double %addtmp
    615 }
    616
    617 define double @foo(double %a, double %b) {
    618 entry:
    619 %multmp = fmul double %a, %a
    620 %multmp1 = fmul double 2.000000e+00, %a
    621 %multmp2 = fmul double %multmp1, %b
    622 %addtmp = fadd double %multmp, %multmp2
    623 %multmp3 = fmul double %b, %b
    624 %addtmp4 = fadd double %addtmp, %multmp3
    625 ret double %addtmp4
    626 }
    627
    628 define double @bar(double %a) {
    629 entry:
    630 %calltmp = call double @foo(double %a, double 4.000000e+00)
    631 %calltmp1 = call double @bar(double 3.133700e+04)
    632 %addtmp = fadd double %calltmp, %calltmp1
    633 ret double %addtmp
    634 }
    635
    636 declare double @cos(double)
    637
    638 define double @1() {
    639 entry:
    640 %calltmp = call double @cos(double 1.234000e+00)
    641 ret double %calltmp
    642 }
    643
    644
    645
    646

    When you quit the current demo, it dumps out the IR for the entire module

    647 generated. Here you can see the big picture with all the functions referencing
    648 each other.

    649
    650

    This wraps up the third chapter of the Kaleidoscope tutorial. Up next, we'll

    651 describe how to add JIT codegen and optimizer
    652 support to this so we can actually start running code!

    653
    654
    655
    656
    657
    658

    Full Code Listing