diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..9e37d03 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +site/build \ No newline at end of file diff --git a/site/Makefile b/site/Makefile new file mode 100644 index 0000000..d0c3cbf --- /dev/null +++ b/site/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = source +BUILDDIR = build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/site/make.bat b/site/make.bat new file mode 100644 index 0000000..6fcf05b --- /dev/null +++ b/site/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=source +set BUILDDIR=build + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.https://www.sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/site/source/abstract-syntax-tree.rst b/site/source/abstract-syntax-tree.rst new file mode 100644 index 0000000..4bf867b --- /dev/null +++ b/site/source/abstract-syntax-tree.rst @@ -0,0 +1,10 @@ +==================== +Abstract Syntax Tree +==================== + +TODO + +Example Implementation +====================== + +* See `AST in EZ Language `_. diff --git a/site/source/compiler-books.rst b/site/source/compiler-books.rst new file mode 100644 index 0000000..77fae24 --- /dev/null +++ b/site/source/compiler-books.rst @@ -0,0 +1,135 @@ +============== +Compiler Books +============== + +I own a bunch of compiler books that I have purchased over the years. + +Dragon Books +============ +I have 3 editions of these. + +* Principles of Compiler Design. Aho & Ullman, 1977. +* Compilers: Principles, Techniques and Tools. Aho, Sethi, Ullman, 1986. +* Compilers: Principles, Techniques and Tools, 2nd Ed. Aho, Lam, Sethi, Ullman, 2006. + +These books are criticised today because of the excessive focus on lexical analysis and parsing techniques. +While this is true, they do cover various aspects of a compiler backend such as intermediate representations and +optimization techniques including peephole optimization, data flow analysis, register allocation etc. +I found the description of the lattice in a data flow analysis quite accessible. + +The 2nd edition adopts a more mathematical presentation style, whereas the earlier editions present +algorithms using pseudo code. I think the 1986 edition is the best. + +The dragon books are a bit dated in that newer techniques such as Static Single Assignment or Graph +Coloring Register Allocation etc. are not covered in any detail. I would even say that these books are not +useful if your goal is to work with SSA IR. + +For a different take on 2nd edition see `Review of the second addition of the "Dragon Book" `_. + +Engineering a Compiler, 2nd Ed. Cooper & Torczon. 2012. +======================================================= +This is a more modern version of the Dragon book. It is less focused on the lexical analysis / parsing +phases, and covers the later phases of a compiler in more detail. Exposition is similar to the Dragon book, i.e. describes +techniques conceptually, and some algorithms are described in more detail using a form of pseudo code. + +Defines an intermediate language called ILOC, but this IR does not have support for function calls. + +In practice, I found this book helpful when implementing the Dominator algorithm and SSA transformation. However, it left out +important parts in its coverage of SSA which meant that the algorithms as described do not work. For instance: + +* The SSA construction algorithm inserts Phis in blocks where the original variable is dead (semi-pruned SSA). This then + causes the renaming phase to fail as there is no available definition of the variable. +* Liveness analysis does not cover SSA form and does not handle phis correctly. +* Exiting out of SSA is described conceptually but the algorithms are not described in detail. + +In practice though it is easy to recommend Engineering a Compiler over the Dragon books. + +Both this and the Dragon books describe ahead of time compilers and cover topics that are suited for procedural languages +such as C or traditional Pascal or Fortran. They cover both front-end and back-end techniques; however, on the front-end +side, interesting topics such as Object Orientation, Closures, Generics, Semantic analysis of more complex languages +such as Java are not covered. + +Modern Compiler Implementation in C. Appel. 1998. (Tiger book) +============================================================== +This book takes a hands on tutorial like approach to describing how to implement both the front-end and back-end +of a compiler, using a toy language called Tiger as an example. Algorithms are described in pseudo code. +If I had to choose from the Dragon book, Engineering a compiler, and this book, I would pick this one and +Engineering a Compiler. + +It covers a lot of techniques, and usually presents algorithms in pseudo code form. I consulted this book +when implementing SSA and SCCP, but the descriptions were not sufficiently comprehensive so that I had to +consult other material too. + +This book covers functional languages, closures, as well as Object Oriented languages such as Java. Type inference is +covered too. + +Crafting a Compiler. Fischer, LeBlanc, Cytron. 2010. +==================================================== +The last couple of chapters are the most interesting - these focus on code generation and program optimization. + +The 2nd edition of the book (with Cytron as co author) has a description of Static Single assignment. However the +description is based on a statement level IR, rather than one that uses Basic Blocks. Also, the algorithm for exiting +SSA is not described. + +The 1st edition describes data flow analysis in more detail, but does not cover SSA. + +Apart from the final two chapters, the rest of the book is about parsing and semantic analysis. + +Building an Optimizing Compiler. Bob Morgan. 1998. +================================================== +I have the kindle edition which is very poor and hard to read. I wish I had a paper copy. + +This book is almost completely about the backend of the compiler. I consulted the description of SCCP and +based my implementation at least in part on the descriptions. In particular I found some discussion about how to +exploit local knowledge in conditional branches to handle null checks, which was useful, and not discussed in +other books. + +Advanced Compiler Design & Implementation. Muchnick. 1997. +========================================================== +I have the kindle edition, which is very poor quality and hard to read. + +This book is mostly about the backend of a compiler, focusing on optimization. + +My impression is that this book describes many algorithms in detail. But when I tried to implement one of the +simpler algorithms (18.1 Unreachable Code Elimination) I found that the description left out a +part (No_Path) of the algorithm. + +Introduces the idea of multiple levels of intermediate representation, HIR, MIR and LIR. +I guess this has influenced many compiler implementations. + +Its coverage of SSA is rudimentary - I guess it was written when SSA was still very new. Hence if you are +working with SSA IR then you will need to consult other material. + +The Graph Coloring register allocation algorithm is presented in detail and is based on the paper by +Preston Briggs. + +This book has a reputation of containing many errors, although I assume the latest printings have the errors +fixed. + +Despite its faults, it is a must have book if you want to learn about compiler construction. + +Retargetable C Compiler, A: Design and Implementation. Hanson & Fraser. 1995. +============================================================================= +Describes a production C compiler. Contains detailed walkthrough of the actual compiler code. + +Weak on theoretical aspects, and limited by features of the compiler being described. The compiler +implementation is a single pass code generator, hence its optimizing capabilities are limited. +There is no coverage of data flow analysis or SSA as these weren't used by the implementation. + +In short this describes an old school C compiler that generates code fast, but lacks optimizations. + +Program Flow Analysis: Theory and Applications. Editors Muchnick, Jones. 1981. +============================================================================== +Collection of essays on program analysis, by various authors. This is pre-SSA, hence a bit +dated. + +SSA-Based Compiler Design - various authors +=========================================== +An online version of this book is available `here `_. +This book is a collection of articles on various topics related to SSA. As such it presents more +recent knowledge regarding SSA construction, optimizations based on SSA, and finally destruction and +register allocation. I will have more to say about this book as I use it. + +Other Book Reviews +================== +* `List of compiler books `_ diff --git a/site/source/conf.py b/site/source/conf.py new file mode 100644 index 0000000..2a3e49e --- /dev/null +++ b/site/source/conf.py @@ -0,0 +1,57 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + + +# -- Project information ----------------------------------------------------- + +project = 'CompilerProgramming' +copyright = '2024, Dibyendu Majumdar' +author = 'Dibyendu Majumdar' + +# The full version, including alpha/beta/rc tags +release = '0.1' + + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = [] + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'agogo' + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] + +html_title = 'Compiler Programming' \ No newline at end of file diff --git a/site/source/ez-lang.rst b/site/source/ez-lang.rst new file mode 100644 index 0000000..21edbfc --- /dev/null +++ b/site/source/ez-lang.rst @@ -0,0 +1,369 @@ +The EeZee Programming Language +============================== + +The EeZee programming language is a toy language with just enough features to allow +experimenting with various compiler techniques. + +The base language is intentionally very small. Eventually there will be extended versions +that allow functional and object oriented paradigms. + +Language features +----------------- +* User defined functions +* Integer type +* User defined ``struct`` types +* One dimensional arrays +* Basic control flow such as ``if`` and ``while`` statements + +Keywords +-------- +Following are keywords in the language:: + + func var int struct if else while break continue return null + +Source Unit +----------- + +The EeZee language does not have the concept of modules or imports. Each source file must be +self contained. + +There is no predefined ``main`` function in a source unit. The runtime should allow +any defined function to be invoked by supplying appropriate arguments. + +Types +----- + +The only primitive type in the language is the integer type ``Int``. +The size of this type is unspecified, the default implementation is 64-bit integers. + +There is not a distinct boolean type, non-zero integer values evaluate as true, and zero evaluates as false. + +Users can define one-dimensional arrays and structs. + +Arrays and structs are implicitly reference types, i.e. instances of these types are +allocated on the heap. + +The language does not specify whether the heap is garbage collected or manually managed, it is +up to the implementation. + +A ``struct`` type is a named aggregate with one or more fields. Fields may of be of any supported +type. Struct types are nominal, i.e. each struct type is identified uniquely by its name. +Multiple definitions of a struct type are not allowed. + +An array type is declared by enclosing the element type in brackets, i.e. ``[`` and ``]``. + +There is a ``Null`` type, with a predefined literal named ``null`` of this type. + +When declaring fields or variables of reference types, user may suffix the type name with ``?`` to +indicate a ``Nullable`` type. A ``Null`` is an implicit subtype of all ``Nullable`` types. + +Examples:: + + struct Tree { + var left: Tree? + var right: Tree? + } + struct Test { + var intArray: [Int] + } + struct TreeArray { + var array: [Tree?]? + } + +The language does not require forward declarations. + +Functions +--------- + +Users can declare functions, each function must have a unique name. + +Functions cannot be overloaded. Functions are not closures. + +Functions can accept one or more arguments and may optionally return a result. + +The ``func`` keyword introduces a function declaration. + +Examples:: + + func fib(n: Int)->Int { + var f1=1 + var f2=1 + var i=n + while( i>1 ){ + var temp = f1+f2 + f1=f2 + f2=temp + i=i-1 + } + return f2 + } + + func foo()->Int { + return fib(10) + } + +Literals +-------- + +The only literals are integer values and ``null``. + +Variables and Fields +-------------------- + +The ``var`` keyword is used to introduce a new variable in the current lexical scope, +or to add a field to a struct. + +There are two forms of this: + +When introducing variables, you can supply an initializer; this removes the need to +specify a type. Examples:: + + var i = 1 + var j = foo() + +In this form the type of the variable is inferred from the initializer's type. + +The second form is more suited when declaring fields in a struct. In this form +a type is required - initializer cannot be set. + +Example:: + + struct T + { + var f: Int + var arry: [Int] + } + +Creating new instances of Arrays +-------------------------------- + +The ``new`` keyword is used to create array instances. + +It must be followed by an array type name, and optionally followed by an initializer. + +The array initializer must be a comma separated list of values, enclosed in ``{`` and ``}``. + +The array is sized based on number of values in the initilizer. + +Alternatively the array initializer may have a field named ``len`` that specifies the size of the +array, and a field named ``value`` to specify the value to use. + +Examples:: + + var arry = new [Int] {1,2,3} + var arry2 = new [Int] {len=10, value=0} + +The second example creates an array with 10 elements and sets the initial value to 0. + +Creating new instances of structs +--------------------------------- + +The ``new`` keyword is used to create struct instances. + +It must be followed by the struct type name, and optionally followed by an initializer. + +The struct initializer must be a comma separated list of field initializers, enclosed in ``{`` and ``}``. + +A field initializer has the form of name followed by ``=`` followed by an expression. + +Examples:: + + var stats = new Stats { age=10, height=100 } + + +Control Flow +------------ + +The language is block structured. + +A block is enclosed in ``{`` and ``}`` and introduces a lexical scope. + +The ``if`` statement allows branching based on a condition. The condition must be an +integer expression; a value of zero is ``false``, any other value is ``true``. + +The ``if`` statement can have an optional ``else`` branch. + +The only looping construct is the ``while`` statement; this executes the sub statement +as long as the supplied condition evaluates to a non zero value. + +The ``break`` statement exits a loop. + +The ``continue`` statement branches to the beginning of the loop. + +The ``return`` statement takes an expression if the function is meant to return a value. +It causes the currently executing function to terminate. + +Expressions +----------- + +Following table describes the available operators by their precedence (low to high): + ++------------+-----------------+----------+ +| Operator | Meaning | Type | +| | | | ++============+=================+==========+ +| ``||`` | logical or | Binary | ++------------+-----------------+----------+ +| ``&&`` | logical and | Binary | ++------------+-----------------+----------+ +| ``==`` | relational | Binary | +| ``!=`` | | | +| ``<`` | | | +| ``<=`` | | | +| ``>`` | | | +| ``>=`` | | | ++------------+-----------------+----------+ +| ``+`` | addition | Binary | +| ``-`` | | | ++------------+-----------------+----------+ +| ``*`` | multiplication | Binary | +| ``/`` | | | ++------------+-----------------+----------+ +| ``-`` | negate | Unary | +| ``!`` | | | ++------------+-----------------+----------+ +| ``(...)``, | function call, | Postfix | +| ``[]``, | array index, | | +| ``.`` ID | field access | | ++------------+-----------------+----------+ + + + +Grammar +------- + +The following grammar describes the language syntax:: + + program + : declaration+ EOF + ; + + declaration + : structDeclaration + | functionDeclaration + ; + + structDeclaration + : 'struct' IDENTIFIER '{' fields '}' + ; + + fields + : varDeclaration+ + ; + + varDeclaration + : 'var' IDENTIFIER ':' typeName ';'? + ; + + typeName + : nominalType + | arrayType + ; + + nominalType + : 'Int' + | IDENTIFIER ('?')? + ; + + arrayType + : '[' nominalType ']' ('?')? + ; + + functionDeclaration + : 'func' IDENTIFIER '(' parameters? ')' ('->' typeName)? block + ; + + parameters + : parameter (',' parameter)* + ; + + parameter + : IDENTIFIER ':' typeName + ; + + block + : '{' statement* '}' + ; + + statement + : 'if' '(' expression ')' statement + | 'if' '(' expression ')' statement 'else' statement + | 'while' '(' expression ')' statement + | postfixExpression '=' expression ';'? + | block + | 'break' ';'? + | 'continue' ';'? + | varDeclaration + | 'var' IDENTIFIER '=' expression ';'? + | 'return' orExpression? ';'? + | expression ';'? + ; + + expression + : orExpression + ; + + orExpression + : andExpression ('||' andExpression)* + ; + + andExpression + : relationalExpression ('&&' relationalExpression)* + ; + + relationalExpression + : additionExpression (('==' | '!='| '>'| '<'| '>='| '<=') additionExpression)* + ; + + additionExpression + : multiplicationExpression (('+' | '-') multiplicationExpression)* + ; + + multiplicationExpression + : unaryExpression (('*' | '/' ) unaryExpression)* + ; + + unaryExpression + : ('-' | '!') unaryExpression + | postfixExpression + ; + + postfixExpression + : primaryExpression (indexExpression | callExpression | fieldExpression)* + ; + + indexExpression + : '[' orExpression ']' + ; + + callExpression + : '(' arguments? ')' + ; + + arguments + : orExpression (',' orExpression)* + ; + + fieldExpression + : '.' IDENTIFIER + ; + + primaryExpression + : INTEGER_LITERAL + | IDENTIFIER + | '(' orExpression ')' + | 'new' typeName initExpression + ; + + initExpression + : '{' initializers? '}' + ; + + initializers + : initializer (',' initializer)* + ; + + initializer + : (IDENTIFIER '=')? orExpression + ; + diff --git a/site/source/index.rst b/site/source/index.rst new file mode 100644 index 0000000..be92cb2 --- /dev/null +++ b/site/source/index.rst @@ -0,0 +1,119 @@ +================================ +Welcome to Compiler Programming! +================================ + +This site aims to bring together practical knowledge regarding the design and implementation of optimizing compilers +and interpreters for Programming Languages. + +There are a number of books on Compilers and Interpreters however only a very few of them are accompanied by +source code that implements the topics covered by the book. See below for a list of useful +learning projects that do include source code. + +In recent years, thanks to LLVM, new programming language design has become a fertile space. New Language implementations +tend to focus on the language front-end, leveraging LLVM as the back-end for code optimization and code generation. +While this is beneficial if you only care about the language design aspects, it is unhelpful for the industry +as a whole, because the back-end of an optimizing compiler is a very interesting component, with a rich history of +algorithms and data structures, and is a subject worthy of study on its own. + +We will cover both front-end and back-end techniques. We will implement a small scale language as a way +to learn various techniques, see what the common challenges are and how to address them. +Language design not being our goal, we will keep the language as simple as possible so that it allows us to +focus on important implementation issues. + +Initially we will start with a procedural language. Later we will add features such as closures from functional languages +and classes and objects from OOP languages. We will also look at advanced front end techniques such as type inference and +generics. + +The language will be statically typed to start with because this allows us to investigate the traditional compiler +optimization pipeline. Dyamically typed languages have their own interesting engineering problems. +We will eventually look at gradual typing and dynamic typing. + +Implementation and Discussions +============================== + +* The `EeZee programming language implementation `_ will serve as the playground for exploring various compilation + techniques. +* This site is `maintained in github `_ too, and is generated using Sphinx. +* We have a `Discussion Forum `_. + +Preliminaries +============= + +.. toctree:: + :maxdepth: 1 + :caption: Preliminaries + + prelim-impl-lang + ez-lang + +Basic Front-End techniques +========================== + +.. toctree:: + :maxdepth: 1 + :caption: Parsing Techniques + + lexical-analysis + syntax-analysis + abstract-syntax-tree + type-systems + semantic-analysis + +Basic Back-end techniques +========================= + +.. toctree:: + :maxdepth: 1 + :caption: Backend Basics + + intermediate-representations + +Basic Optimization techniques +============================= + +* Dominators and Control Flow Graph +* Static Single Assignment +* Data Flow Analysis, Type Lattices, Abstract Interpretation +* Peephole Optimizations +* Sea of Nodes Representation +* Code generation and Register Allocation + +Language Tools +============== + +* Debuggers +* Language IDEs + +Advanced Front-end techniques +============================= + +* Type inference +* Classes and objects +* Closures +* Exception handling +* Gradual typing +* Generics + + +Some Useful Learning Resources +============================== + +.. toctree:: + :maxdepth: 2 + :caption: Learning Resources + + learning-resources + +Book Reviews +============ + +.. toctree:: + :maxdepth: 2 + :caption: Reviews + + compiler-books + +Compiler Jobs +============= + +* A listing of `compiler, language and runtime teams `_ for people looking for compiler jobs. \ No newline at end of file diff --git a/site/source/intermediate-representations.rst b/site/source/intermediate-representations.rst new file mode 100644 index 0000000..0bbf344 --- /dev/null +++ b/site/source/intermediate-representations.rst @@ -0,0 +1,208 @@ +============================ +Intermediate Representations +============================ + +An input program in the source language may go through many intermediate representations within +a compiler before it is in a form ready for execution. + +One of the first such intermediate representations that we have seen is the +the Abstract Syntax Tree (AST), which is mainly concerned with the grammar of the source language. + +From the AST, we generate a different kind of intermediate representation, one that is more amenable +to the manipulations required during optimization and execution. There are many such representations; we will +limit ourselves to the following. + +* Stack based IR +* Register based IR +* Sea of Nodes IR + +Stack-Based IR +============== + +The stack based IR encodes stack operations as part of the intermediate representation. Lets look at a simple +example:: + + func foo(n: Int)->Int { + return n+1; + } + +Produces:: + + L0: + load 0 + pushi 1 + addi + jump L1 + L1: + +The stack based IR is so called because many of the intructions in the IR push and pop values to/from an evaluation stack at +runtime. Above for example, we have the following instructions: + +* ``load 0`` - this pushes the value of the input parameter ``n`` to the stack. The ``0`` here identifies the location of the variable ``n``. +* ``pushi 1`` - pushes the constant ``1`` to the stack. +* ``addi`` - pops the two topmost values on the stack, and computes the sum and pushes this to the stack + +So at the end of the program we are left with the sum of ``n+1`` on the stack, and this forms the return +value of the function. + +In this IR, control flow can be represented either using labels and branching instructions, or by grouping +instructions into basic blocks, and linking basic blocks through jump instructions. These two approaches are +equivalent, you can think of a label as indicating the start of a basic block, and a jump as ending +a basic block. + +The idea is that inside a basic block, instructions execute linearly one after the other. +Each basic block ends with a branching instruction, something like a goto or a conditional jump. + +Here is a simple example of input source code and the IR you might see:: + + func foo()->Int + { + return 1 == 1 && 2 == 2 + } + +This results in IR that may look like this:: + + L0: + pushi 1 + pushi 1 + eq + cbr L2 L3 + L2: + pushi 2 + pushi 2 + eq + jump L4 + L3: + pushi 0 + jump L4 + L4: + jump L1 + L1: + +Each basic block begins with a label, which is just the unique name of the block. + +* The ``jump`` instruction transfers control from a basic block to another. +* The ``cbr`` instruction is the conditional branch. It consumes the top most value from the stack, + and if this value is true (in this case, a non-zero value), then control is transferred + to the first block, else to the second block. +* The ``eq`` instruction pops the two topmost values from the stack, compares them and pushes a result: + ``1`` for true or ``0`` for false. + +Advantages +---------- +* The IR is compact to represent in stored form as most instructions do not have operands. + This is a reason why many languages choose to encode their compiled code in + this form. Examples are Java, C#, Web Assembly. +* The IR can be executed easily by an Interpreter. +* Relatively easy to generate IR from an AST. + +Disadvantages +------------- +* Not easy to implement optimizations. +* For a reader it is hard to trace values as they flow through instructions, + as it requires tracking them through a conceptual stack. +* Harder to analyze the IR, although there are methods available to do so. + +Examples +-------- +* `Example implementation in EeZee Programming Language `_. +* `Java Specifications `_. +* `Web Assembly Specifications `_. + +Register Based IR or Three-Address IR +===================================== + +This intermediate representation uses named slots called virtual registers in the instruction when referencing +values. Lets look at the same example we saw above:: + + func foo(n: Int)->Int { + return n+1; + } + +Produces:: + + L0: + %t1 = n+1 + ret %t1 + goto L1 + L1: + +The instructions above are as follows: + +* ``%t1 = n+1`` - is a typical three-address instruction of the form ``result = value1 operator value2``. The name ``%t1`` + refers to a temporary, whereas ``n`` refers to the input argument ``n``. Both of these names are virtual registers. +* ``ret %t1`` - is the return instruction, in this instance it references the temporary. + +The virtual registers in the IR are so called because they do not map to real registers in the target physical machine. +Instead these are just named slots in the abstract machine responsible for executing the IR. Typically, the abstract machine +will assign each virtual register a unique location in its stack frame. So we still end up using the function's +stack frame, but the IR references locations within the stack frame directly using these virtual names, rather than implicitly +through push and pop instructions. During optimization some of the virtual registers will end up in real hardware registers. + +Control flow is represented the same way as for the stack IR. Revisiting the same source example from above, we get following +IR:: + + L0: + %t0 = 1==1 + if %t0 goto L2 else goto L3 + L2: + %t0 = 2==2 + goto L4 + L3: + %t0 = 0 + goto L4 + L4: + ret %t0 + goto L1 + L1: + + +Advantages +---------- +* Readability: the flow of values is easier to trace, whereas with a stack IR you need to conceptualize a stack somewhere, + and track values being pushed and popped. +* Fewer instructions are needed compared to stack IR. +* The IR can be executed easily by an Interpreter. +* Most optimization algorithms can be applied to this form of IR. +* The IR can represent Static Single Assignment (SSA) in a natural way. + +Disadvantages +------------- +* Each instruction has operands, hence representing the IR in serialized form takes more space. +* Harder to generate the IR during compilation. + +Examples +-------- +* `Example basic register IR in EeZee Programming Language `_. +* `Example register IR including SSA form and optimizations in EeZee Programming Language `_. +* `LLVM instruction set `_. +* `Android Dalvik IR `_. + +Sea of Nodes IR +=============== +The final example we will look at is known as the Sea of Nodes IR. + +This IR is quite different from the IRs we described above. + +The key features of this IR are: + +* Instructions are NOT organized into Basic Blocks - instead, intructions form a graph, where + each instruction has as its inputs the definitions it uses. +* Instructions that produce data values are not directly bound to a Basic Block, instead they "float" around, + the order being defined purely in terms of the dependencies between the instructions. +* Control flow is represented in a similar way, and control flows between control flow + instructions. Dependencies between data instructions and control intructions occur at few well + defined places. +* The IR as described above cannot be readily executed, because to execute the IR, the instructions + must be scheduled; you can think of this as a process that puts the instructions into a traditional + Basic Block IR as described earlier. + +Describing Sea of Nodes IR is quite involved. For now, I direct you to the `Simple project `_; this +is an ongoing effort to explain the Sea of Nodes IR representation and how to implement it. + +Beyond how the IR is represented, the main benefits of the Sea of Nodes IR are that: + +* It is an SSA IR +* Various optimizations such as peephole optimizations, value numbering and common subexpressions elimination, + dead code elimination, occur as the IR is built. +* The SoN IR can generate optimized code quickly, suitable for Just-In-Time (JIT) compilers. diff --git a/site/source/learning-resources.rst b/site/source/learning-resources.rst new file mode 100644 index 0000000..8f92714 --- /dev/null +++ b/site/source/learning-resources.rst @@ -0,0 +1,50 @@ +================== +Learning Resources +================== + +Courses +======= + +COMP 512: Advanced Compiler Construction - Rice University, K. Cooper +--------------------------------------------------------------------- +* `COMP 512 Lectures `_. Nice bibliography of important papers related to optimization. + +CS 6120: Advanced Compilers: The Self-Guided Online Course +---------------------------------------------------------- + +* `CS 6120 `_ +* `BRIL `_ +* `github repo `_ + +CS 618: Program Analysis +------------------------ +* `CS 618 Video Lectures `_ +* `An Introduction to Program Analysis `_ + +Static Program Analysis +----------------------- +* `Static Program Analysis `_ +* `TIP `_ +* `Static Program Analysis Part 1 - PLISS 2019 `_ +* `Static Program Analysis Part 2 - PLISS 2019 `_ + +Papers And implementations +========================== + +Sea of Nodes +------------ +* `From Quads to Graphs: An Intermediate Representation's Journey `_ +* `Combining Analyses, Combining Optimizations `_ +* `A Simple Graph-Based Intermediate Representation `_ +* `Global Code Motion Global Value Numbering `_ +* `Simple Sea of Nodes Implementation `_ + +JikesRVM +-------- +* `Dynamic Optimization through the use of Automatic Runtime Specialization `_ +* `Implementation in JikesRVM `_ + +Others +====== + +* `Automatic Program Optimization, by Ron Cytron `_ diff --git a/site/source/lexical-analysis.rst b/site/source/lexical-analysis.rst new file mode 100644 index 0000000..ab1d26f --- /dev/null +++ b/site/source/lexical-analysis.rst @@ -0,0 +1,55 @@ +================ +Lexical Analysis +================ + +When compiling a program we need to recognize the words and punctuations that make up the vocabulary of the language. +This part of the compiler is therefore known as "lexical" analysis. + +Usually a compiler is given one or more input programs, and the first thing it must do is read the program and +figure out what lexical elements appear in the program. + +Typically, these lexical elements are known as tokens. So for example, in the following snippet of code:: + + print('hi') + +We have a number of lexical elements / tokens: + +* ``print`` +* ``(`` +* ``'hi'`` +* ``)`` + +There are many different ways to implement a "lexer" - the name we give to this component of the compiler. + +* We can write this code by hand. This involves scanning the input program character by character and + deciding what tokens appear in the program. +* Or we can specify the lexical elements in a grammar and have a tool generate the code to process the input + program and give us the tokens that appear in the program. + +A lexical analyser can be designed to process input on demand, or it may be designed to translate the entire +input source to a set of tokens at the very beginning. + +Considerations +============== + +* Should comments in the input program be retained as tokens? Usually a lexer will discard comments, but in languages that + allow comments to be retained as documentation, the lexer must not discard them. +* Should end of line markers be retained? Typically lexers drop all intermediate space including line markers, + but if the language syntax depends on line markers then these may need to be retained. +* Should tokens copy the input text, convert them to another form, or retain pointers to the input itself? + Retaining the original form of the lexical token may be important in some cases, for example if the lexer + is used in a code formatter. +* How much can we peek ahead? During later stages of the compiler, depending on the complexity of the language grammar, + it may be necessary to allow the compiler to look ahead one or more tokens without consuming them. +* Ancillary information regarding tokens such a line number, column number in the input source are invaluable for + error reporting. + +Example Hand-Coded Implementation +================================= + +The `Lexer `_ module in the EZ language +implementation contains an example of hand-coded lexical analyser written in Java. This implementation returns tokens +on demand. + +Another example is the `Lua lexer `_. + diff --git a/site/source/prelim-impl-lang.rst b/site/source/prelim-impl-lang.rst new file mode 100644 index 0000000..eef121c --- /dev/null +++ b/site/source/prelim-impl-lang.rst @@ -0,0 +1,28 @@ +Compiler Implementation Language +================================ + +A compiler can be implemented in any language we choose. For a pedagogical project it is more convenient +to choose a language that is widely used, has garbage collection, and comes with excellent tools such +as IDEs and Debuggers. + +Production quality compilers are often written in C, C++ or Rust. For us these languages are too difficult +to work with. + +Lisp and Python appear to be popular languages in teaching projects. Lisp is not as widely used +as we would like our implementation language to be, and dynamically typed languages such as Python are +harder to work with as the project grows. + +Compared to C, C++ and Rust, the programming language D appears to be much more suitable for this project, +from a technical standpoint, that is. It is a garbage collection language that has less friction and is pleasant to +work with. The main negatives are that it is not a popular language, and the tooling is not up to +the standards of other languages. + +Go, Java, Kotlin, Swift and C# seem like good candidates. Java has some limitations that make it harder to write memory optimized +code that is often necessary in a production compiler, but we don't care so much about that. + +I decided to use Java because it is the language I am most familar with, has great tooling, and despite some +short comings, is widely understood by developers around the world. My first choice would have been D if it was +purely a question of technical preference. + +The use of Java biases the implementation towards using some Object Orientation; this is just a consequence of the +most comfortable way of expressing some designs in Java. diff --git a/site/source/semantic-analysis.rst b/site/source/semantic-analysis.rst new file mode 100644 index 0000000..779c487 --- /dev/null +++ b/site/source/semantic-analysis.rst @@ -0,0 +1,7 @@ +================= +Semantic Analysis +================= + +TODO + + diff --git a/site/source/syntax-analysis.rst b/site/source/syntax-analysis.rst new file mode 100644 index 0000000..b14c1cb --- /dev/null +++ b/site/source/syntax-analysis.rst @@ -0,0 +1,10 @@ +=============== +Syntax Analysis +=============== + +TODO + +Example Implementation +====================== + +See `EZ Language Parser `_. diff --git a/site/source/type-systems.rst b/site/source/type-systems.rst new file mode 100644 index 0000000..701f68c --- /dev/null +++ b/site/source/type-systems.rst @@ -0,0 +1,10 @@ +============ +Type Systems +============ + +TODO + +Example Implementation +====================== + +See `Type System in EZ Language `_.