report.tex 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397
  1. \documentclass[10pt,a4paper]{article}
  2. \usepackage[latin1]{inputenc}
  3. \usepackage{amsmath,amsfonts,amssymb,booktabs,graphicx,listings,subfigure}
  4. \usepackage{float,hyperref}
  5. \title{Peephole Optimizer}
  6. \author{Jayke Meijer (6049885), Richard Torenvliet (6138861), Tadde\"us Kroes
  7. (6054129)}
  8. \begin{document}
  9. \maketitle
  10. \tableofcontents
  11. \pagebreak
  12. \section{Introduction}
  13. The goal of the assignment is to implement the optimization stage of the
  14. compiler. To reach this goal the parser and the optimizer part of the compiler
  15. have to be implemented.
  16. The output of the xgcc cross compiler on a C program is our input. The output
  17. of the xgcc cross compiler is in the form of Assembly code, but not optimized.
  18. Our assignment includes a number of C programs. An important part of the
  19. assignment is parsing the data. Parsing the data is done with Lex and Yacc. The
  20. Lexer is a program that finds keywords that meets the regular expression
  21. provided in the Lexer. After the Lexer, the Yaccer takes over. Yacc can turn
  22. the keywords in to an action.
  23. \section{Design}
  24. There are two general types of optimizations of the assembly code, global
  25. optimizations and optimizations on a so-called basic block. These optimizations
  26. will be discussed separately
  27. \subsection{Global optimizations}
  28. We only perform one global optimization, which is optimizing branch-jump
  29. statements. The unoptimized Assembly code contains sequences of code of the
  30. following structure:
  31. \begin{verbatim}
  32. beq ...,$Lx
  33. j $Ly
  34. $Lx: ...
  35. \end{verbatim}
  36. This is inefficient, since there is a jump to a label that follows this code.
  37. It would be more efficient to replace the branch statement with a \texttt{bne}
  38. (the opposite case) to the label used in the jump statement. This way the jump
  39. statement can be eliminated, since the next label follows anyway. The same can
  40. of course be done for the opposite case, where a \texttt{bne} is changed into a
  41. \texttt{beq}.
  42. Since this optimization is done between two series of codes with jumps and
  43. labels, we can not perform this code during the basic block optimizations.
  44. \subsection{Basic Block Optimizations}
  45. Optimizations on basic blocks are a more important part of the optimizer.
  46. First, what is a basic block? A basic block is a sequence of statements
  47. guaranteed to be executed in that order, and that order alone. This is the case
  48. for a piece of code not containing any branches or jumps.
  49. To create a basic block, you need to define what is the leader of a basic
  50. block. We call a statement a leader if it is either a jump/branch statement, or
  51. the target of such a statement. Then a basic block runs from one leader until
  52. the next leader.
  53. There are quite a few optimizations we perform on these basic blocks, so we
  54. will describe the types of optimizations here in stead of each optimization.
  55. \subsubsection*{Standard peephole optimizations}
  56. These are optimizations that simply look for a certain statement or pattern of
  57. statements, and optimize these. For example,
  58. \begin{verbatim}
  59. mov $regA,$regB
  60. instr $regA, $regA,...
  61. \end{verbatim}
  62. can be optimized into
  63. \begin{verbatim}
  64. instr $regA, $regB,...
  65. \end{verbatim}
  66. since the register \texttt{\$regA} gets overwritten by the second instruction
  67. anyway, and the instruction can easily use \texttt{\$regB} in stead of
  68. \texttt{\$regA}. There are a few more of these cases, which are the same as
  69. those described on the practicum page
  70. \footnote{\url{http://staff.science.uva.nl/~andy/compiler/prac.html}} and in
  71. Appendix \ref{opt}.
  72. \subsubsection*{Common subexpression elimination}
  73. A more advanced optimization is common subexpression elimination. This means
  74. that expensive operations as a multiplication or addition are performed only
  75. once and the result is then `copied' into variables where needed.
  76. \begin{verbatim}
  77. addu $2,$4,$3 addu = $t1, $4, $3
  78. ... mov = $2, $t1
  79. ... -> ...
  80. ... ...
  81. addu $5,$4,$3 mov = $4, $t1
  82. \end{verbatim}
  83. A standard method for doing this is the creation of a DAG or Directed Acyclic
  84. Graph. However, this requires a fairly advanced implementation. Our
  85. implementation is a slightly less fancy, but easier to implement.
  86. We search from the end of the block up for instructions that are eligible for
  87. CSE. If we find one, we check further up in the code for the same instruction,
  88. and add that to a temporary storage list. This is done until the beginning of
  89. the block or until one of the arguments of this expression is assigned.
  90. We now add the instruction above the first use, and write the result in a new
  91. variable. Then all occurrences of this expression can be replaced by a move of
  92. from new variable into the original destination variable of the instruction.
  93. This is a less efficient method then the DAG, but because the basic blocks are
  94. in general not very large and the execution time of the optimizer is not a
  95. primary concern, this is not a big problem.
  96. \subsubsection*{Fold constants}
  97. Constant folding is an optimization where the outcome of arithmetics are
  98. calculated at compile time. If a value x is assigned to a certain value, lets
  99. say 10, than all next occurences of \texttt{x} are replaced by 10 until a
  100. redefinition of x. Arithmetics in Assembly are always performed between two
  101. variables or a variable and a constant. If this is not the case the calculation
  102. is not possible. See \ref{opt} for an example. In other words until the current
  103. definition of \texttt{x} becomes dead. Therefore reaching definitions analysis
  104. is needed. Reaching definitions is a form of liveness analysis, we use the
  105. liveness analysis within a block and not between blocks.
  106. During the constant folding, so-called algebraic transformations are performed
  107. as well. Some expression can easily be replaced with more simple once if you
  108. look at what they are saying algebraically. An example is the statement
  109. $x = y + 0$, or in Assembly \texttt{addu \$1, \$2, 0}. This can easily be
  110. changed into $x = y$ or \texttt{move \$1, \$2}.
  111. Another case is the multiplication with a power of two. This can be done way
  112. more efficiently by shifting left a number of times. An example:
  113. \texttt{mult \$regA, \$regB, 4 -> sll \$regA, \$regB, 2}. We perform this
  114. optimization for any multiplication with a power of two.
  115. There are a number of such cases, all of which are once again stated in
  116. appendix \ref{opt}.
  117. \subsubsection*{Copy propagation}
  118. Copy propagation `unpacks' a move instruction, by replacing its destination
  119. address with its source address in the code following the move instruction.
  120. This is not a direct optimization, but this does allow for a more effective
  121. dead code elimination.
  122. The code of the block is checked linearly. When a move operation is
  123. encountered, the source and destination address of this move are stored. When
  124. a normal operation with a source and a destination address are found, a number
  125. of checks are performed.
  126. The first check is whether the destination address is stored as a destination
  127. address of a move instruction. If so, this move instruction is no longer valid,
  128. so the optimizations can not be done. Otherwise, continue with the second
  129. check.
  130. In the second check, the source address is compared to the destination
  131. addresses of all still valid move operations. If these are the same, in the
  132. current operation the found source address is replaced with the source address
  133. of the move operation.
  134. An example would be the following:
  135. \begin{verbatim}
  136. move $regA, $regB move $regA, $regB
  137. ... ...
  138. Code not writing $regA, -> ...
  139. $regB ...
  140. ... ...
  141. addu $regC, $regA, ... addu $regC, $regB, ...
  142. \end{verbatim}
  143. This code shows that \texttt{\$regA} is replaced with \texttt{\$regB}. This
  144. way, the move instruction might have become useless, and it will then be
  145. removed by the dead code elimination.
  146. \subsection{Dead code elimination}
  147. The final optimization that is performed is dead code elimination. This means
  148. that when an instruction is executed, but the result is never used, that
  149. instruction can be removed.
  150. To be able to properly perform dead code elimination, we need to know whether a
  151. variable will be used, before it is overwritten again. If it does, we call the
  152. variable live, otherwise the variable is dead. The technique to find out if a
  153. variable is live is called liveness analysis. We implemented this for the
  154. entire code, by analysing each block, and using the variables that come in the
  155. block live as the variables that exit its predecessor live.
  156. \section{Implementation}
  157. We decided to implement the optimization in Python. We chose this programming
  158. language because Python is an easy language to manipulate strings, work
  159. object-oriented etc.
  160. It turns out that a Lex and Yacc are also available as a Python module,
  161. named PLY(Python Lex-Yacc). This allows us to use one language, Python, instead
  162. of two, i.e. C and Python. Also no debugging is needed in C, only in Python
  163. which makes our assignment more feasible.
  164. The program has three steps, parsing the Assembly code into a datastructure we
  165. can use, the so-called Intermediate Representation, performing optimizations on
  166. this IR and writing the IR back to Assembly.
  167. \subsection{Parsing}
  168. The parsing is done with PLY, which allows us to perform Lex-Yacc tasks in
  169. Python by using a Lex-Yacc like syntax. This way there is no need to combine
  170. languages like we should do otherwise since Lex and Yacc are coupled with C.
  171. The decision was made to not recognize exactly every possible instruction in
  172. the parser, but only if something is for example a command, a comment or a gcc
  173. directive. We then transform per line to an object called a Statement. A
  174. statement has a type, a name and optionally a list of arguments. These
  175. statements together form a statement list, which is placed in another object
  176. called a Block. In the beginning there is one block for the entire program, but
  177. after global optimizations this will be separated in several blocks that are
  178. the basic blocks.
  179. \subsection{Optimizations}
  180. The optimizations are done in two different steps. First the global
  181. optimizations are performed, which are only the optimizations on branch-jump
  182. constructions. This is done repeatedly until there are no more changes.
  183. After all possible global optimizations are done, the program is separated into
  184. basic blocks. The algorithm to do this is described earlier, and means all
  185. jump and branch instructions are called leaders, as are their targets. A basic
  186. block then goes from leader to leader.
  187. After the division in basic blocks, optimizations are performed on each of
  188. these basic blocks. This is also done repeatedly, since some times several
  189. steps can be done to optimize something.
  190. \subsection{Writing}
  191. Once all the optimizations have been done, the IR needs to be rewritten into
  192. Assembly code. After this step the xgcc crosscompiler can make binary code from
  193. the generated Assembly code.
  194. The writer expects a list of statements, so first the blocks have to be
  195. concatenated again into a list. After this is done, the list is passed on to
  196. the writer, which writes the instructions back to Assembly and saves the file
  197. so we can let xgcc compile it. The original statements can also written to a
  198. file, so differences in tabs, spaces and newlines do not show up when checking
  199. the differences between the optimized and non-optimized files.
  200. \subsection{Execution}
  201. To execute the optimizer, the following command can be given:\\
  202. \texttt{./main.py <original file> <optimized file> <rewritten original file>}\\
  203. There is also a script available that runs the optimizer and automatically
  204. starts the program \emph{meld}. In meld it is easy to visually compare the
  205. original file and the optimized file. The command to execute this script is:\\
  206. \texttt{./run <benchmark name (e.g. whet)>}\\
  207. \section{Testing}
  208. Of course, it has to be guaranteed that the optimized code still functions
  209. exactly the same as the none-optimized code. To do this, testing is an
  210. important part of out program. We have two stages of testing. The first stage
  211. is unit testing. The second stage is to test whether the compiled code has
  212. exactly the same output.
  213. \subsection{Unit testing}
  214. For almost every piece of important code, unit tests are available. Unit tests
  215. give the possibility to check whether each small part of the program, for
  216. instance each small function, is performing as expected. This way bugs are
  217. found early and very exactly. Otherwise, one would only see that there is a
  218. mistake in the program, not knowing where this bug is. Naturally, this means
  219. debugging is a lot easier.
  220. The unit tests can be run by executing \texttt{make test} in the root folder of
  221. the project. This does require the \texttt{textrunner} module.
  222. Also available is a coverage report. This report shows how much of the code has
  223. been unit tested. To make this report, the command \texttt{make coverage} can
  224. be run in the root folder. The report is than added as a folder \emph{coverage}
  225. in which a \emph{index.html} can be used to see the entire report.
  226. \subsection{Ouput comparison}
  227. In order to check whether the optimization does not change the functioning of
  228. the program, the output of the provided benchmark programs has to be compared
  229. to the output after optimization. If any of these outputs is not equal to the
  230. original output, our optimizations are to aggressive, or there is a bug
  231. somewhere in the code.
  232. \section{Results}
  233. The following results have been obtained:\\
  234. \begin{tabular}{|c|c|c|c|c|c|}
  235. \hline
  236. Benchmark & Original & Optimized & Original & Optimized & Performance \\
  237. & Instructions & instructions & cycles & cycles & boost(cycles)\\
  238. \hline
  239. pi & 94 & & & & \\
  240. acron & 361 & & & & \\
  241. dhrystone & 752 & & & & \\
  242. whet & 935 & & & & \\
  243. slalom & 4177 & & & & \\
  244. clinpack & 3523 & & & & \\
  245. \hline
  246. \end{tabular}
  247. \pagebreak
  248. \appendix
  249. \section{List of all optimizations}
  250. \label{opt}
  251. \textbf{Global optimizations}
  252. \begin{verbatim}
  253. beq ...,$Lx bne ...,$Ly
  254. j $Ly -> $Lx: ...
  255. $Lx: ...
  256. bne ...,$Lx beq ...,$Ly
  257. j $Ly -> $Lx: ...
  258. $Lx: ...
  259. \end{verbatim}
  260. \textbf{Standard basic block optimizations}
  261. \begin{verbatim}
  262. mov $regA,$regA -> --- // remove it
  263. mov $regA,$regB -> instr $regA, $regB,...
  264. instr $regA, $regA,...
  265. instr $regA,... instr $4,...
  266. mov [$4-$7], $regA -> jal XXX
  267. jal XXX
  268. sw $regA,XXX -> sw $regA, XXX
  269. ld $regA,XXX
  270. shift $regA,$regA,0 -> --- // remove it
  271. add $regA,$regA,X -> lw ...,X($regA)
  272. lw ...,0($regA)
  273. \end{verbatim}
  274. \textbf{Advanced basic block optimizations}
  275. \begin{verbatim}
  276. # Common subexpression elimination
  277. addu $regA, $regB, 4 addu $regD, $regB, 4
  278. ... move $regA, $regD
  279. Code not writing $regB -> ...
  280. ... ...
  281. addu $regC, $regB, 4 move $regC, $regD
  282. # Constant folding
  283. li $regA, constA ""
  284. sw $regA, 16($fp) ""
  285. li $regA, constB -> ""
  286. sw $regA, 20($fp) ""
  287. lw $regA, 16($fp) ""
  288. lw $regB, 20($fp) ""
  289. addu $regA, $regA, $regA $li regA, (constA + constB) at compile time
  290. # Copy propagation
  291. move $regA, $regB move $regA, $regB
  292. ... ...
  293. Code not writing $regA, -> ...
  294. $regB ...
  295. ... ...
  296. addu $regC, $regA, ... addu $regC, $regB, ...
  297. # Algebraic transformations
  298. addu $regA, $regB, 0 -> move $regA, $regB
  299. subu $regA, $regB, 0 -> move $regA, $regB
  300. mult $regA, $regB, 1 -> move $regA, $regB
  301. mult $regA, $regB, 0 -> li $regA, 0
  302. mult $regA, $regB, 2 -> sll $regA, $regB, 1
  303. \end{verbatim}
  304. \end{document}