Worked on report.

4c34a823 · Taddeus Kroes · 586933e5 · 4c34a823
Commit 4c34a823 authored 13 years ago by Taddeus Kroes
--- a/report/report.tex
+++ b/report/report.tex
@@ -3,6 +3,10 @@
 \usepackage{amsmath,amsfonts,amssymb,booktabs,graphicx,listings,subfigure}
 \usepackage{float,hyperref}

+% Paragraph indentation
+\setlength{\parindent}{0pt}
+\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
+
 \title{Peephole Optimizer}
 \author{Jayke Meijer (6049885), Richard Torenvliet (6138861), Tadde\"us Kroes
    (6054129)}
@@ -16,248 +20,301 @@

 \section{Introduction}

-The goal of the assignment is to implement the optimization stage of the
-compiler. To reach this goal the parser and the optimizer part of the compiler
-have to be implemented.
+The goal of the assignment is to implement the peephole optimization stage of
+xgcc cross compiler. This requires a MIPS Assembly parser to parse the output
+of the compiler. Also, an assembly writer is needed to write the optimized
+statements back to valid Assembly code for the assembler.

-The output of the xgcc cross compiler on a C program is our input. The output
-of the xgcc cross compiler is in the form of Assembly code, but not optimized.
-Our assignment includes a number of C programs. An important part of the
-assignment is parsing the data. Parsing the data is done with Lex and Yacc. The
-Lexer is a program that finds keywords that meets the regular expression
-provided in the Lexer. After the Lexer, the Yaccer takes over. Yacc can turn
-the keywords in to an action.
+The assignment provides a number of benchmarks written in C. The objective is
+to obtain a high speedup in number of cycles for these benchmarks.

-\section{Design}
+\section{Types of optimizations}

-There are two general types of optimizations of the assembly code, global
-optimizations and optimizations on a so-called basic block. These optimizations
-will be discussed separately
+There are two general types of optimizations on the assembly code: global
+optimizations and optimizations on so-called basic blocks. These optimizations
+will be discussed individually below.

 \subsection{Global optimizations}

 We only perform one global optimization, which is optimizing branch-jump
-statements. The unoptimized Assembly code contains sequences of code of the
-following structure:
+statements. The unoptimized Assembly code contains sequences of statements with
+the following structure:
 \begin{verbatim}
    beq ...,$Lx
    j $Ly
 $Lx:   ...
 \end{verbatim}
-This is inefficient, since there is a jump to a label that follows this code.
-It would be more efficient to replace the branch statement with a \texttt{bne}
-(the opposite case) to the label used in the jump statement. This way the jump
-statement can be eliminated, since the next label follows anyway. The same can
-of course be done for the opposite case, where a \texttt{bne} is changed into a
-\texttt{beq}.
+%This is inefficient, since there is a branch to a label that follows this code.
+In this code, it is more efficient to replace the branch statement with a
+\texttt{bne} (the opposite case) to the label used in the jump statement. This
+way, the jump statement can be eliminated since the label directly follows it.
+The same can be done for the opposite case, where a \texttt{bne} is changed
+into a \texttt{beq}.

 Since this optimization is done between two series of codes with jumps and
 labels, we can not perform this code during the basic block optimizations.

 \subsection{Basic Block Optimizations}

-Optimizations on basic blocks are a more important part of the optimizer.
-First, what is a basic block? A basic block is a sequence of statements
+Optimizations on basic blocks are a more extended part of the optimizer.
+
+First of all, what is a basic block? A basic block is a sequence of statements
 guaranteed to be executed in that order, and that order alone. This is the case
-for a piece of code not containing any branches or jumps.
+for a piece of code not containing any branches or jumps (except for the last
+statement).

-To create a basic block, you need to define what is the leader of a basic
-block. We call a statement a leader if it is either a jump/branch statement, or
-the target of such a statement. Then a basic block runs from one leader until
-the next leader.
+To divide the code into basic blocks, the ``leaders'' have to be found. A
+leading statement is a leader if it is either a jump or branch statement, or
+the target of such a statement. Each leader is the start of a new basic block.

-There are quite a few optimizations we perform on these basic blocks, so we
-will describe the types of optimizations here in stead of each optimization.
+There are five types of optimizations performed on basic blocks in our
+implementation. Each is described individually below.

-\subsubsection*{Standard peephole optimizations}
+\subsubsection{Standard peephole optimizations}

-These are optimizations that simply look for a certain statement or pattern of
+These are optimizations that look for a certain statement or pattern of
 statements, and optimize these. For example,
 \begin{verbatim}
 mov $regA,$regB
 instr $regA, $regA,...
 \end{verbatim}
-can be optimized into
+can be optimized to:
 \begin{verbatim}
 instr $regA, $regB,...
 \end{verbatim}
-since the register \texttt{\$regA} gets overwritten by the second instruction
-anyway, and the instruction can easily use \texttt{\$regB} in stead of
-\texttt{\$regA}. There are a few more of these cases, which are the same as
-those described on the practicum page
+\texttt{\$regA} should contain the same value as \texttt{\$regB} after the move
+statement, so \texttt{\$regB} can be used by \texttt{instr}.  Since
+\texttt{instr} overwrites \texttt{\$regA}, the move statement has not further
+effect after \texttt{instr} and can be removed.
+
+There are a few more of these cases, which are described on the practicum page
 \footnote{\url{http://staff.science.uva.nl/~andy/compiler/prac.html}} and in
 Appendix \ref{opt}.

-\subsubsection*{Common subexpression elimination}
+\subsubsection{Common subexpression elimination}

 A more advanced optimization is common subexpression elimination. This means
-that expensive operations as a multiplication or addition are performed only
-once and the result is then `copied' into variables where needed.
+that expensive operations like multiplications or additions are performed only
+once and the result is then `copied' into registers where needed.
 \begin{verbatim}
-
-addu	$2,$4,$3              addu = $t1, $4, $3
-...                        mov = $2, $t1
+addu $2,$4,$3              addu = $8, $4, $3  # $8 is free
+...                        mov = $2, $8
 ...                   ->   ...
 ...                        ...
-addu	$5,$4,$3              mov = $4, $t1
-
+addu $5,$4,$3              mov = $4, $8
 \end{verbatim}

-A standard method for doing this is the creation of a DAG or Directed Acyclic
-Graph. However, this requires a fairly advanced implementation. Our
-implementation is a slightly less fancy, but easier to implement.
-We search from the end of the block up for instructions that are eligible for
-CSE. If we find one, we check further up in the code for the same instruction,
-and add that to a temporary storage list. This is done until the beginning of
-the block or until one of the arguments of this expression is assigned.
+A standard method for doing this is usage of a DAG or Directed Acyclic Graph.
+However, this requires either the code to be in Static single
+assignment
+form\footnote{\url{http://en.wikipedia.org/wiki/Static\_single\_assignment\_form}},
+or an advanced liveness check. Our implementation contains a (partially tested)
+implementation of DAG creation, but this is not used in the final
+implementation.  However, our implementation does contain a simplified version
+of common subexpression elimination:
+
+The statement list of a block is traversed in reversed order, looking for
+instructions that are eligible for CSE (\texttt{addu}, for example). If such an
+instruction is found, it is marked and the rest of the statement list is
+traversed while marking all statements that are equal to the found instruction.
+If a statement assigns a register that is uses by the instruction, traversal
+stops.
+
+If more than one instruction have been marked, a new instruction is inserted
+above the first occurrence (the last occurrence in reversed order). This
+instruction performs the calculation and saves it in a free temporary register.
+Then, each occurrence is replaced by a \texttt{move} of the free register to
+its original destination register.
+
+This method is obviously less efficient method then the DAG.  However, since
+the basic blocks are generally not very large and the execution time of the
+optimizer is not a primary concern, this is not a large problem.
+
+\subsubsection{Constant folding}

-We now add the instruction above the first use, and write the result in a new
-variable. Then all occurrences of this expression can be replaced by a move of
-from new variable into the original destination variable of the instruction.
-
-This is a less efficient method then the DAG, but because the basic blocks are
-in general not very large and the execution time of the optimizer is not a
-primary concern, this is not a big problem.
-
-\subsubsection*{Fold constants}
 Constant folding is an optimization where the outcome of arithmetics are
-calculated at compile time. If a value x is assigned to a certain value, lets
-say 10, than all next occurences of \texttt{x} are replaced by 10 until a
-redefinition of x. Arithmetics in Assembly are always performed between two
-variables or a variable and a constant. If this is not the case the calculation
-is not possible. See \ref{opt} for an example. In other words until the current
-definition of \texttt{x} becomes dead. Therefore reaching definitions analysis
-is needed. Reaching definitions is a form of liveness analysis, we use the
-liveness analysis within a block and not between blocks.
-
-During the constant folding, so-called algebraic transformations are performed
-as well. Some expression can easily be replaced with more simple once if you
-look at what they are saying algebraically. An example is the statement
-$x = y + 0$, or in Assembly \texttt{addu \$1, \$2, 0}. This can easily be
-changed into $x = y$ or \texttt{move \$1, \$2}.
+calculated at compile time. If a register x is known to contain a constant
+value, all following uses of \texttt{x} can be replaced by that value until a
+redefinition of x.

-Another case is the multiplication with a power of two. This can be done way
-more efficiently by shifting left a number of times. An example:
-\texttt{mult \$regA, \$regB, 4    ->  sll  \$regA, \$regB, 2}. We perform this
-optimization for any multiplication with a power of two.
+Arithmetics in Assembly are always performed between two registers or a
+register and a constant. If the current value of all used registers is known,
+The expression can be executed at-compile-time and the instruction can be
+replaced by an immediate load of the result. See \ref{opt} for an example.

-There are a number of such cases, all of which are once again stated in
-appendix \ref{opt}.
+%In other words until the current definition of \texttt{x} becomes dead.
+%Therefore reaching definitions analysis is needed. Reaching definitions is a
+%form of liveness analysis, we use the liveness analysis within a block and not
+%between blocks.

-\subsubsection*{Copy propagation}
-
-Copy propagation `unpacks' a move instruction, by replacing its destination
-address with its source address in the code following the move instruction.
-
-This is not a direct optimization, but this does allow for a more effective
-dead code elimination.
-
-The code of the block is checked linearly. When a move operation is
-encountered, the source and destination address of this move are stored. When
-a normal operation with a source and a destination address are found, a number
-of checks are performed.
-
-The first check is whether the destination address is stored as a destination
-address of a move instruction. If so, this move instruction is no longer valid,
-so the optimizations can not be done. Otherwise, continue with the second
-check.
-
-In the second check, the source address is compared to the destination
-addresses of all still valid move operations. If these are the same, in the
-current operation the found source address is replaced with the source address
-of the move operation.
+During the constant folding, so-called algebraic transformations are performed
+as well. When calculations are performed using constants, some calculations can
+be replaced by a load- or move-instruction. An example is the statement
+$x = y + 0$, or in Assembly: \texttt{addu \$1, \$2, 0}. This can be replaced by
+$x = y$ or \texttt{move \$1, \$2}. A list of transformations that are performed
+can be found in appendix \ref{opt}.
+
+\subsubsection{Copy propagation}
+
+Copy propagation replaces usage of registers that have been assigned the value
+of another register earlier. In Assembly code, such an assignment is in the
+form of a \texttt{move} instruction.
+
+This is not a direct optimization, but is often does create dead code (the
+\texttt{move} statement) that can be eliminated.
+
+To perform copy propagation within the same basic block, the block is traversed
+until a \texttt{move x, y} instruction is encountered. For each of these ``copy
+statements'', the rest of the block is traversed while looking for usage of the
+\texttt{move}'s destination address \texttt{x}. These usages are replaced by
+usages of \texttt{y}, until either \texttt{x} or \texttt{y} is re-assigned.
+
+%Copy propagation `unpacks' a move instruction, by replacing its destination
+%address with its source address in the code following the move instruction.
+%
+%This is not a direct optimization, but this does allow for a more effective
+%dead code elimination.
+%
+%The code of the block is traversed linearly. If a move operation is
+%encountered, the source and destination address of this move are stored. If a
+%normal operation with a source and a destination address are found, a number of
+%checks are performed.
+%
+%The first check is whether the destination address is stored as a destination
+%address of a move instruction. If so, this move instruction is no longer valid,
+%so the optimizations can not be done. Otherwise, continue with the second
+%check.
+%
+%In the second check, the source address is compared to the destination
+%addresses of all still valid move operations. If these are the same, in the
+%current operation the found source address is replaced with the source address
+%of the move operation.

 An example would be the following:
 \begin{verbatim}
-move $regA, $regB           move $regA, $regB
-...                         ...
-Code not writing $regA, ->  ...
-$regB                       ...
-...                         ...
-addu $regC, $regA, ...      addu $regC, $regB, ...
+move $regA, $regB                    move $regA, $regB
+...                                  ...
+Code not writing $regA or $regB  ->  ...
+...                                  ...
+addu $regC, $regA, ...               addu $regC, $regB, ...
 \end{verbatim}
-This code shows that \texttt{\$regA} is replaced with \texttt{\$regB}. This
-way, the move instruction might have become useless, and it will then be
-removed by the dead code elimination.
-
-\subsection{Dead code elimination}
-
-The final optimization that is performed is dead code elimination. This means
-that when an instruction is executed, but the result is never used, that
-instruction can be removed.
-
-To be able to properly perform dead code elimination, we need to know whether a
-variable will be used, before it is overwritten again. If it does, we call the
-variable live, otherwise the variable is dead. The technique to find out if a
-variable is live is called liveness analysis. We implemented this for the
-entire code, by analysing each block, and using the variables that come in the
-block live as the variables that exit its predecessor live.
+\texttt{\$regA} is replaced with \texttt{\$regB}. Now, the move instruction
+might have become useless. If so, it will be removed by dead code elimination.
+
+To also replace usages in successors of the basic block, a Reaching Definitions
+analysis is used: If a \texttt{move}-statement is in the $REACH_{out}$ set of
+the block, it is used in one of the block's successors. To be able to replace a
+usage, the definition must me the only definition reaching the usage. To
+determine this, copy propagation defines a new dataflow problem that yields the
+$COPY_{in}$ and $COPY_{out}$ sets. the successor The definition is the only
+reaching definition if it is in the successor's $COPY_{in}$ set. If this is the
+case, the usage van be replaced by the destination address of the
+\texttt{move}-statement. \\
+Note: Though we implemented the algorithm as described above, we did not
+encounter any replacements between basic blocks while optimizing the provided
+benchmark scripts. This might mean that our implementation of the copy
+propagation dataflow problem is based on the lecture slides, which only briefly
+describe the algorithm.
+
+\subsubsection{Dead code elimination}
+
+The final optimization that is performed is dead code elimination. This removes
+statements of which the result is never used.
+
+To determine if a register is used from a certain point in the code, liveness
+analysis is used. A variable is ``live'' at a certain point in the code if it
+holds a value that may be needed in the future. Using the $LIVE_{out}$ set
+that is generated by the analysis, we can check if a register is dead after a
+certain point in a basic block. Each statement that assigns a register which
+is dead from that point on is removed.

 \section{Implementation}

-We decided to implement the optimization in Python. We chose this programming
+We decided to implement the optimizations in Python. We chose this programming
 language because Python is an easy language to manipulate strings, work
-object-oriented etc.
-It turns out that a Lex and Yacc are also available as a Python module,
-named PLY(Python Lex-Yacc). This allows us to use one language, Python, instead
-of two, i.e. C and Python. Also no debugging is needed in C, only in Python
-which makes our assignment more feasible.
-
-The program has three steps, parsing the Assembly code into a datastructure we
-can use, the so-called Intermediate Representation, performing optimizations on
-this IR and writing the IR back to Assembly.
+object-oriented etc..

-\subsection{Parsing}
+To implement the parser, we use a Python variant of Yacc and Lex named
+PLY(Python Lex-Yacc). By using this module instead of the regular C
+implementations of Yacc and Lex, we only use a single language in the entire
+project.

-The parsing is done with PLY, which allows us to perform Lex-Yacc tasks in
-Python by using a Lex-Yacc like syntax. This way there is no need to combine
-languages like we should do otherwise since Lex and Yacc are coupled with C.
+The program has three steps:
+\begin{enumerate}
+    \item Parsing the Assembly code to an Intermediate Representation (IR).
+    \item Performing optimizations on the IR.
+    \item Writing the IR back to Assembly code.
+\end{enumerate}

-The decision was made to not recognize exactly every possible instruction in
-the parser, but only if something is for example a command, a comment or a gcc
-directive. We then transform per line to an object called a Statement. A
-statement has a type, a name and optionally a list of arguments. These
-statements together form a statement list, which is placed in another object
-called a Block. In the beginning there is one block for the entire program, but
-after global optimizations this will be separated in several blocks that are
-the basic blocks.
+Our code is provided with this report, and is also available on GitHub: \\
+\url{https://github.com/taddeus/peephole}

-\subsection{Optimizations}
+\subsection{Structure}

-The optimizations are done in two different steps. First the global
-optimizations are performed, which are only the optimizations on branch-jump
-constructions. This is done repeatedly until there are no more changes.
+% TODO

-After all possible global optimizations are done, the program is separated into
-basic blocks. The algorithm to do this is described earlier, and means all
-jump and branch instructions are called leaders, as are their targets. A basic
-block then goes from leader to leader.
+\subsection{Parsing}

-After the division in basic blocks, optimizations are performed on each of
-these basic blocks. This is also done repeatedly, since some times several
-steps can be done to optimize something.
+The parser is implemented using PLY, which uses standard Lex-Yacc syntax in
+given function formats.
+
+The parser assumes that it is given valid Assembly code as input, so it does
+not validate whether, for example, command arguments are valid. This design
+decision was made because the optimizer uses the output of a compiler, which
+should produce valid Assembly code.
+
+The parser recognizes 4 types of ``statements'':
+\begin{itemize}
+    \item \textbf{comment} Line starting with a `\#'.
+    \item \textbf{directive} C-directive, used by the compiler. These are
+                             matched and treated in the same way as comments.
+    \item \textbf{command} Machine instruction, followed 0 to 3 arguments and
+                           optionally an inline comment.
+    \item \textbf{label} Line containing a \texttt{WORD} token, followed by a
+                         colon (`:').
+\end{itemize}
+
+Each statement is represented by a \texttt{Statement} object containing a type,
+a name, optionally a list of arguments and optionally a list of extra options
+(such as inline comments). The parsed list of statements forms a
+\texttt{Program} object, which is the return value of the parser.
+
+\subsection{Optimization loop}
+
+The optimizations are performed in a loop until no more changed are made. The
+optimization loop first performs global optimizations on the entire statement
+list of the program. Second, all dataflow analyses are performed (basic block
+creation, flow graph generation, liveness, reaching definitions, copy
+propagation). Finally, all basic block-level optimizations are executed. if
+either the global or one of the block optimizations yields a change in
+statements, another iteration is executed.

 \subsection{Writing}

-Once all the optimizations have been done, the IR needs to be rewritten into
-Assembly code. After this step the xgcc crosscompiler can make binary code from
-the generated Assembly code.
+Once all the optimizations have been done, the IR needs to be rewritten to
+Assembly code. After this step, the xgcc cross compiler can make binary code
+from the generated Assembly code.

 The writer expects a list of statements, so first the blocks have to be
 concatenated again into a list. After this is done, the list is passed on to
-the writer, which writes the instructions back to Assembly and saves the file
-so we can let xgcc compile it. The original statements can also written to a
-file, so differences in tabs, spaces and newlines do not show up when checking
-the differences between the optimized and non-optimized files.
+the writer, which writes the instructions back to Assembly and saves the file.
+We believe that the writer code is self-explanatory, so we will not discuss it
+in detail here.
+
+The writer has a slightly different output format than the xgcc compiler in
+some cases. Therefore, the main execution file has an option to also write the
+original statement list back to a files way, differences in tabs, spaces and
+newlines do not show up when checking the differences between optimized and
+non-optimized files.

 \subsection{Execution}

-To execute the optimizer, the following command can be given:\\
-\texttt{./main.py <original file> <optimized file> <rewritten original file>}\\
+To execute the optimizer, the following command can be given: \\
+\texttt{./main.py <original file> <optimized file> <rewritten original file>} \\
 There is also a script available that runs the optimizer and automatically
 starts the program \emph{meld}. In meld it is easy to visually compare the
-original file and the optimized file. The command to execute this script is:\\
-\texttt{./run <benchmark name (e.g. whet)>}\\
+original file and the optimized file. The command to execute this script is: \\
+\texttt{./run <benchmark name (e.g. whet)>}

 \section{Testing}

@@ -277,7 +334,7 @@ mistake in the program, not knowing where this bug is. Naturally, this means
 debugging is a lot easier.

 The unit tests can be run by executing \texttt{make test} in the root folder of
-the project. This does require the \texttt{textrunner} module.
+the project. This does require the \texttt{testrunner} module of Python.

 Also available is a coverage report. This report shows how much of the code has
 been unit tested. To make this report, the command \texttt{make coverage} can
@@ -297,15 +354,29 @@ somewhere in the code.
 The following results have been obtained:\\
 \begin{tabular}{|c|c|c|c|c|c|}
 \hline
-Benchmark & Original     & Optimized    & Original & Optimized & Performance \\
-        & Instructions & instructions   & cycles   & cycles    &  boost(cycles)\\
+Benchmark & Original     & Removed      & Original & Optimized & Performance \\
+          & Instructions & instructions & cycles   & cycles    & boost(cycles) \\
+\hline
+pi        &           94 &            2 &          &           &             \% \\
+acron     &          361 &           24 &          &           &             \% \\
+dhrystone &          752 &           52 &          &           &             \% \\
+whet      &          935 &           37 &          &           &             \% \\
+slalom    &         4177 &          227 &          &           &             \% \\
+clinpack  &         3523 &              &          &           &              \% \\
+\hline
+\end{tabular}
+
+\begin{tabular}{|c|c|c|c|c|c|}
+\hline
+Benchmark & Original     & Removed      & Original & Optimized & Performance \\
+          & Instructions & instructions & cycles   & cycles    & boost(cycles)\\
 \hline
 pi        &           94 &      2       &    1714468   &   1714362      &   0.006182676 \%       \\
 acron     &          361 &      19      &    4435687   &   4372825      &   1.417187462 \%       \\
-dhrystone &          752 &      36      &    2887710   &   2742720      &   5.020933542 \%       \\  
+dhrystone &          752 &      36      &    2887710   &   2742720      &   5.020933542 \%       \\
 whet      &          935 &      23      &    2864526   &   2840042      &   0.854731289 \%       \\
 slalom    &         4177 &      107     &    2879140   &   2876105      &   0.143480345 \%       \\
-clinpack  &         3523 &      49      &    1543746   &   1528406      &   1.353201887  \%       \\ 
+clinpack  &         3523 &      49      &    1543746   &   1528406      &   1.353201887  \%       \\
 \hline
 \end{tabular}

@@ -363,15 +434,14 @@ Code not writing $regB  ->  ...
 ...                         ...
 addu $regC, $regB, 4        move $regC, $regD

-
 # Constant folding
-li $regA, constA                ""       
-sw $regA, 16($fp)               ""
-li $regA, constB        ->      ""
-sw $regA, 20($fp)               ""	
-lw $regA, 16($fp)               "" 
-lw $regB, 20($fp)               ""
-addu $regA, $regA, $regA        $li regA, (constA + constB) at compile time
+li $2, 2                    $2 = 2
+sw $2, 16($fp)              16($fp) = 2
+li $2, 3                    $2 = 3
+sw $2, 20($fp)          ->  20($fp) = 3
+lw $2, 16($fp)              $2 = 16($fp) = 2
+lw $3, 20($fp)              $3 = 20($fp) = 3
+addu $2, $2, $3             change to "li $2, 0x00000005"

 # Copy propagation
 move $regA, $regB           move $regA, $regB