| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520 |
- \documentclass[10pt,a4paper]{article}
- \usepackage[latin1]{inputenc}
- \usepackage{amsmath,amsfonts,amssymb,booktabs,graphicx,listings,subfigure}
- \usepackage{float,hyperref}
- % Paragraph indentation
- \setlength{\parindent}{0pt}
- \setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
- \title{Peephole Optimizer}
- \author{Jayke Meijer (6049885), Richard Torenvliet (6138861), Tadde\"us Kroes
- (6054129)}
- \begin{document}
- \maketitle
- \tableofcontents
- \pagebreak
- \section{Introduction}
- The goal of the assignment is to implement the peephole optimization stage of
- xgcc cross compiler. This requires a MIPS Assembly parser to parse the output
- of the compiler. Also, an assembly writer is needed to write the optimized
- statements back to valid Assembly code for the assembler.
- The assignment provides a number of benchmarks written in C. The objective is
- to obtain a high speedup in number of cycles for these benchmarks.
- \section{Types of optimizations}
- There are two general types of optimizations on the assembly code: global
- optimizations and optimizations on so-called basic blocks. These optimizations
- will be discussed individually below.
- \subsection{Global optimizations}
- We only perform one global optimization, which is optimizing branch-jump
- statements. The unoptimized Assembly code contains sequences of statements with
- the following structure:
- \begin{verbatim}
- beq ...,$Lx
- j $Ly
- $Lx: ...
- \end{verbatim}
- %This is inefficient, since there is a branch to a label that follows this code.
- In this code, it is more efficient to replace the branch statement with a
- \texttt{bne} (the opposite case) to the label used in the jump statement. This
- way, the jump statement can be eliminated since the label directly follows it.
- The same can be done for the opposite case, where a \texttt{bne} is changed
- into a \texttt{beq}.
- Since this optimization is done between two series of codes with jumps and
- labels, we can not perform this code during the basic block optimizations.
- \subsection{Basic Block Optimizations}
- Optimizations on basic blocks are a more extended part of the optimizer.
- First of all, what is a basic block? A basic block is a sequence of statements
- guaranteed to be executed in that order, and that order alone. This is the case
- for a piece of code not containing any branches or jumps (except for the last
- statement).
- To divide the code into basic blocks, the ``leaders'' have to be found. A
- leading statement is a leader if it is either a jump or branch statement, or
- the target of such a statement. Each leader is the start of a new basic block.
- There are five types of optimizations performed on basic blocks in our
- implementation. Each is described individually below.
- \subsubsection{Standard peephole optimizations}
- These are optimizations that look for a certain statement or pattern of
- statements, and optimize these. For example,
- \begin{verbatim}
- mov $regA,$regB
- instr $regA, $regA,...
- \end{verbatim}
- can be optimized to:
- \begin{verbatim}
- instr $regA, $regB,...
- \end{verbatim}
- \texttt{\$regA} should contain the same value as \texttt{\$regB} after the move
- statement, so \texttt{\$regB} can be used by \texttt{instr}. Since
- \texttt{instr} overwrites \texttt{\$regA}, the move statement has not further
- effect after \texttt{instr} and can be removed.
- There are a few more of these cases, which are described on the practicum page
- \footnote{\url{http://staff.science.uva.nl/~andy/compiler/prac.html}} and in
- Appendix \ref{opt}.
- \subsubsection{Common subexpression elimination}
- A more advanced optimization is common subexpression elimination. This means
- that expensive operations like multiplications or additions are performed only
- once and the result is then `copied' into registers where needed.
- \begin{verbatim}
- addu $2,$4,$3 addu = $8, $4, $3 # $8 is free
- ... mov = $2, $8
- ... -> ...
- ... ...
- addu $5,$4,$3 mov = $4, $8
- \end{verbatim}
- A standard method for doing this is usage of a DAG or Directed Acyclic Graph.
- However, this requires either the code to be in Static single
- assignment
- form\footnote{\url{http://en.wikipedia.org/wiki/Static\_single\_assignment\_form}},
- or an advanced liveness check. Our implementation contains a (partially tested)
- implementation of DAG creation, but this is not used in the final
- implementation. However, our implementation does contain a simplified version
- of common subexpression elimination:
- The statement list of a block is traversed in reversed order, looking for
- instructions that are eligible for CSE (\texttt{addu}, for example). If such an
- instruction is found, it is marked and the rest of the statement list is
- traversed while marking all statements that are equal to the found instruction.
- If a statement assigns a register that is uses by the instruction, traversal
- stops.
- If more than one instruction have been marked, a new instruction is inserted
- above the first occurrence (the last occurrence in reversed order). This
- instruction performs the calculation and saves it in a free temporary register.
- Then, each occurrence is replaced by a \texttt{move} of the free register to
- its original destination register.
- This method is obviously less efficient method then the DAG. However, since
- the basic blocks are generally not very large and the execution time of the
- optimizer is not a primary concern, this is not a large problem.
- \subsubsection{Constant folding}
- Constant folding is an optimization where the outcome of arithmetics are
- calculated at compile time. If a register x is known to contain a constant
- value, all following uses of \texttt{x} can be replaced by that value until a
- redefinition of x.
- Arithmetics in Assembly are always performed between two registers or a
- register and a constant. If the current value of all used registers is known,
- The expression can be executed at-compile-time and the instruction can be
- replaced by an immediate load of the result. See \ref{opt} for an example.
- %In other words until the current definition of \texttt{x} becomes dead.
- %Therefore reaching definitions analysis is needed. Reaching definitions is a
- %form of liveness analysis, we use the liveness analysis within a block and not
- %between blocks.
- During the constant folding, so-called algebraic transformations are performed
- as well. When calculations are performed using constants, some calculations can
- be replaced by a load- or move-instruction. An example is the statement
- $x = y + 0$, or in Assembly: \texttt{addu \$1, \$2, 0}. This can be replaced by
- $x = y$ or \texttt{move \$1, \$2}. A list of transformations that are performed
- can be found in appendix \ref{opt}.
- \subsubsection{Copy propagation}
- Copy propagation replaces usage of registers that have been assigned the value
- of another register earlier. In Assembly code, such an assignment is in the
- form of a \texttt{move} instruction.
- This is not a direct optimization, but is often does create dead code (the
- \texttt{move} statement) that can be eliminated.
- To perform copy propagation within the same basic block, the block is traversed
- until a \texttt{move x, y} instruction is encountered. For each of these ``copy
- statements'', the rest of the block is traversed while looking for usage of the
- \texttt{move}'s destination address \texttt{x}. These usages are replaced by
- usages of \texttt{y}, until either \texttt{x} or \texttt{y} is re-assigned.
- %Copy propagation `unpacks' a move instruction, by replacing its destination
- %address with its source address in the code following the move instruction.
- %
- %This is not a direct optimization, but this does allow for a more effective
- %dead code elimination.
- %
- %The code of the block is traversed linearly. If a move operation is
- %encountered, the source and destination address of this move are stored. If a
- %normal operation with a source and a destination address are found, a number of
- %checks are performed.
- %
- %The first check is whether the destination address is stored as a destination
- %address of a move instruction. If so, this move instruction is no longer valid,
- %so the optimizations can not be done. Otherwise, continue with the second
- %check.
- %
- %In the second check, the source address is compared to the destination
- %addresses of all still valid move operations. If these are the same, in the
- %current operation the found source address is replaced with the source address
- %of the move operation.
- An example would be the following:
- \begin{verbatim}
- move $regA, $regB move $regA, $regB
- ... ...
- Code not writing $regA or $regB -> ...
- ... ...
- addu $regC, $regA, ... addu $regC, $regB, ...
- \end{verbatim}
- \texttt{\$regA} is replaced with \texttt{\$regB}. Now, the move instruction
- might have become useless. If so, it will be removed by dead code elimination.
- To also replace usages in successors of the basic block, a Reaching Definitions
- analysis is used: If a \texttt{move}-statement is in the $REACH_{out}$ set of
- the block, it is used in one of the block's successors. To be able to replace a
- usage, the definition must me the only definition reaching the usage. To
- determine this, copy propagation defines a new dataflow problem that yields the
- $COPY_{in}$ and $COPY_{out}$ sets. the successor The definition is the only
- reaching definition if it is in the successor's $COPY_{in}$ set. If this is the
- case, the usage van be replaced by the destination address of the
- \texttt{move}-statement. \\
- Note: Though we implemented the algorithm as described above, we did not
- encounter any replacements between basic blocks while optimizing the provided
- benchmark scripts. This might mean that our implementation of the copy
- propagation dataflow problem is based on the lecture slides, which only briefly
- describe the algorithm.
- \subsubsection{Dead code elimination}
- The final optimization that is performed is dead code elimination. This removes
- statements of which the result is never used.
- To determine if a register is used from a certain point in the code, liveness
- analysis is used. A variable is ``live'' at a certain point in the code if it
- holds a value that may be needed in the future. Using the $LIVE_{out}$ set
- that is generated by the analysis, we can check if a register is dead after a
- certain point in a basic block. Each statement that assigns a register which
- is dead from that point on is removed.
- \section{Implementation}
- We decided to implement the optimizations in Python. We chose this programming
- language because Python is an easy language to manipulate strings, work
- object-oriented etc..
- To implement the parser, we use a Python variant of Yacc and Lex named
- PLY(Python Lex-Yacc). By using this module instead of the regular C
- implementations of Yacc and Lex, we only use a single language in the entire
- project.
- The program has three steps:
- \begin{enumerate}
- \item Parsing the Assembly code to an Intermediate Representation (IR).
- \item Performing optimizations on the IR.
- \item Writing the IR back to Assembly code.
- \end{enumerate}
- Our code is provided with this report, and is also available on GitHub: \\
- \url{https://github.com/taddeus/peephole}
- \subsection{Structure}
- The folder structure of out implementation is as follows:
- \begin{itemize}
- \item \texttt{.}
- \begin{itemize}
- \item \texttt{benchmarks/} \\
- All benchmark files and their assembly versions (both unoptimized and
- optimized).
- \item \texttt{src/}
- \begin{itemize}
- \item \texttt{copy\_propagation.py} \\
- \texttt{liveness.py} \\
- \texttt{reaching\_definitions.py} \\
- Implementation of specific dataflow problems.
- \item \texttt{dag.py} \\
- DAG creation (unused).
- \item \texttt{dataflow.py} \\
- Common dataflow functions such as basic block-division and flow
- graph generation. Contains definition of \texttt{BasicBlock} class.
- \item \texttt{dominator.py} \\
- Dominator tree creation (unused, untested).
- \item \texttt{optimize\_advanced.py} \\
- Advanced optimizations on basic block-level: Common subexpression
- elimination, constant folding, copy propagation and dead code
- elimination.
- \item \texttt{optimize\_redundancies.py} \\
- Straight-forward replacement functions for statement sequences on
- global- and basic block-level.
- \item \texttt{parser.py} \\
- Assembly parser.
- \item \texttt{program.py} \\
- Definition of \texttt{Program} class.
- \item \texttt{statement.py} \\
- Definition of \texttt{Statement} and \texttt{Block} classes.
- \item \texttt{writer.py} \\
- Assembly writer.
- \end{itemize}
- \item \texttt{tests/} \\
- Unit tests.
- \item \texttt{main.py} \\
- Runs the optimizer.
- \item \texttt{run} \\
- Runs the optimizer and compares original statement list to the
- optimized one.
- \end{itemize}
- \end{itemize}
- \subsection{Parsing}
- The parser is implemented using PLY, which uses standard Lex-Yacc syntax in
- given function formats.
- The parser assumes that it is given valid Assembly code as input, so it does
- not validate whether, for example, command arguments are valid. This design
- decision was made because the optimizer uses the output of a compiler, which
- should produce valid Assembly code.
- The parser recognizes 4 types of ``statements'':
- \begin{itemize}
- \item \textbf{comment} Line starting with a `\#'.
- \item \textbf{directive} C-directive, used by the compiler. These are
- matched and treated in the same way as comments.
- \item \textbf{command} Machine instruction, followed 0 to 3 arguments and
- optionally an inline comment.
- \item \textbf{label} Line containing a \texttt{WORD} token, followed by a
- colon (`:').
- \end{itemize}
- Each statement is represented by a \texttt{Statement} object containing a type,
- a name, optionally a list of arguments and optionally a list of extra options
- (such as inline comments). The parsed list of statements forms a
- \texttt{Program} object, which is the return value of the parser.
- \subsection{Optimization loop}
- The optimizations are performed in a loop until no more changed are made. The
- optimization loop first performs global optimizations on the entire statement
- list of the program. Second, all dataflow analyses are performed (basic block
- creation, flow graph generation, liveness, reaching definitions, copy
- propagation). Finally, all basic block-level optimizations are executed. if
- either the global or one of the block optimizations yields a change in
- statements, another iteration is executed.
- \subsection{Writing}
- Once all the optimizations have been done, the IR needs to be rewritten to
- Assembly code. After this step, the xgcc cross compiler can make binary code
- from the generated Assembly code.
- The writer expects a list of statements, so first the blocks have to be
- concatenated again into a list. After this is done, the list is passed on to
- the writer, which writes the instructions back to Assembly and saves the file.
- We believe that the writer code is self-explanatory, so we will not discuss it
- in detail here.
- The writer has a slightly different output format than the xgcc compiler in
- some cases. Therefore, the main execution file has an option to also write the
- original statement list back to a file, differences in tabs, spaces and
- newlines do not show up when checking the differences between optimized and
- non-optimized files.
- \subsection{Execution}
- The main execution file (\emph{main.py}) is located in the root directory. It
- takes the file to be optimized as last argument, optionally preceded by one or
- more of the following options:
- \begin{itemize}
- \item \texttt{-o OUT\_FILE}
- Location to save the optimized statement list.
- \item \texttt{-i SOURCE\_OUT\_FILE}
- Location to save the original statement list.
- \item \texttt{-v VERBOSE\_LEVEL}
- The optimizer's verbose level (default 1). \\
- Possible verbose levels are: \\
- 0: No command-line output, no comments in output file. \\
- 1: Results in command-line output, inline optimization comments in
- output file. \\
- 2: Results and debug statements in command-line output, inline
- optimization comments in output file. \\
- \end{itemize}
- The root directory also contains an executable script: \texttt{run}. This
- script calls \texttt{main.py} and automatically starts the program
- \texttt{meld}. In \texttt{meld} it is easy to visually compare the original
- file and the optimized file. The usage of this script is as follows: \\
- \texttt{./run <benchmark name (e.g. whet)> [ <verbose level> ]}
- \section{Testing}
- Of course, it has to be guaranteed that the optimized code still functions
- exactly the same as the none-optimized code. To do this, testing is an
- important part of out program. We have two stages of testing. The first stage
- is unit testing. The second stage is to test whether the compiled code has
- exactly the same output.
- \subsection{Unit testing}
- For almost every piece of important code, unit tests are available. Unit tests
- give the possibility to check whether each small part of the program, for
- instance each small function, is performing as expected. This way bugs are
- found early and very exactly. Otherwise, one would only see that there is a
- mistake in the program, not knowing where this bug is. Naturally, this means
- debugging is a lot easier.
- The unit tests can be run by executing \texttt{make test} in the root folder of
- the project. This does require the \texttt{testrunner} module of Python.
- Also available is a coverage report. This report shows how much of the code has
- been unit tested. To make this report, the command \texttt{make coverage} can
- be run in the root folder. The report is than added as a folder \emph{coverage}
- in which a \emph{index.html} can be used to see the entire report.
- \subsection{Output comparison}
- In order to check whether the optimization does not change the functionality of
- the program, the output of the provided benchmark programs was compared to
- their output after optimization. If any of these outputs was not equal to the
- original output, either our optimizations are to aggressive, or there is a bug
- somewhere in the code.
- \section{Results}
- We have executed the optimizer on each of the benchmark files and compared the
- number of execution cycles of the optimized versions vs. the original versions.
- The results are displayed in the following table: \\
- \\
- \begin{tabular}{|c|c|c|c|c|c|}
- \hline
- Benchmark & Original & Removed & Original & Optimized & Performance \\
- & Instructions & instructions & cycles & cycles & boost(cycles) \\
- \hline
- pi & 94 & 2 & 1714468 & 1714260 & 0.012\% \\
- acron & 361 & 24 & 4435687 & 4314540 & 2.73\% \\
- dhrystone & 752 & 49 & 2887710 & 2852653 & 1.21\% \\
- whet & 935 & 33 & 2864526 & 2828487 & 1.26\% \\
- slalom & 4177 & 226 & 2879140 & 2870375 & 0.304\% \\
- clinpack & 3523 & 231 & 1543746 & 1457479 & 5.59\% \\
- \hline
- \end{tabular}
- \pagebreak
- \appendix
- \section{List of all optimizations}
- \label{opt}
- \textbf{Global optimizations}
- \begin{verbatim}
- beq ...,$Lx bne ...,$Ly
- j $Ly -> $Lx: ...
- $Lx: ...
- bne ...,$Lx beq ...,$Ly
- j $Ly -> $Lx: ...
- $Lx: ...
- j $Lx -> $Lx: ...
- $Lx: ...
- \end{verbatim}
- \textbf{Standard basic block optimizations}
- \begin{verbatim}
- mov $regA,$regA -> --- // remove it
- mov $regA,$regB -> instr $regA, $regB,...
- instr $regA, $regA,...
- instr $regA,... instr $4,...
- mov [$4-$7], $regA -> jal XXX
- jal XXX
- sw $regA,XXX -> sw $regA, XXX
- ld $regA,XXX
- shift $regA,$regA,0 -> --- // remove it
- add $regA,$regA,X -> lw ...,X($regA)
- lw ...,0($regA)
- \end{verbatim}
- \textbf{Advanced basic block optimizations}
- \begin{verbatim}
- # Common subexpression elimination
- addu $regA, $regB, 4 addu $regD, $regB, 4
- ... move $regA, $regD
- Code not writing $regB -> ...
- ... ...
- addu $regC, $regB, 4 move $regC, $regD
- # Constant folding
- li $2, 2 $2 = 2
- sw $2, 16($fp) 16($fp) = 2
- li $2, 3 $2 = 3
- sw $2, 20($fp) -> 20($fp) = 3
- lw $2, 16($fp) $2 = 16($fp) = 2
- lw $3, 20($fp) $3 = 20($fp) = 3
- addu $2, $2, $3 change to "li $2, 0x00000005"
- # Copy propagation
- move $regA, $regB move $regA, $regB
- ... ...
- Code not writing $regA, -> ...
- $regB ...
- ... ...
- addu $regC, $regA, ... addu $regC, $regB, ...
- # Algebraic transformations
- addu $regA, $regB, 0 -> move $regA, $regB
- subu $regA, $regB, 0 -> move $regA, $regB
- mult $regA, $regB, 1 -> move $regA, $regB
- mult $regA, $regB, 0 -> li $regA, 0
- mult $regA, $regB, 2 -> sll $regA, $regB, 1
- \end{verbatim}
- \end{document}
|