report.tex 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535
  1. \documentclass[10pt,a4paper]{article}
  2. \usepackage[latin1]{inputenc}
  3. \usepackage{amsmath,amsfonts,amssymb,booktabs,graphicx,listings,subfigure}
  4. \usepackage{float,hyperref}
  5. % Paragraph indentation
  6. \setlength{\parindent}{0pt}
  7. \setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
  8. \title{Peephole Optimizer}
  9. \author{Jayke Meijer (6049885), Richard Torenvliet (6138861), Tadde\"us Kroes
  10. (6054129)}
  11. \begin{document}
  12. \maketitle
  13. \tableofcontents
  14. \pagebreak
  15. \section{Introduction}
  16. The goal of the assignment is to implement the peephole optimization stage of
  17. xgcc cross compiler. This requires a MIPS Assembly parser to parse the output
  18. of the compiler. Also, an assembly writer is needed to write the optimized
  19. statements back to valid Assembly code for the assembler.
  20. The assignment provides a number of benchmarks written in C. The objective is
  21. to obtain a high speedup in number of cycles for these benchmarks.
  22. \section{Types of optimizations}
  23. There are two general types of optimizations on the assembly code: global
  24. optimizations and optimizations on so-called basic blocks. These optimizations
  25. will be discussed individually below.
  26. \subsection{Global optimizations}
  27. We only perform one global optimization, which is optimizing branch-jump
  28. statements. The unoptimized Assembly code contains sequences of statements with
  29. the following structure:
  30. \begin{verbatim}
  31. beq ...,$Lx
  32. j $Ly
  33. $Lx: ...
  34. \end{verbatim}
  35. %This is inefficient, since there is a branch to a label that follows this code.
  36. In this code, it is more efficient to replace the branch statement with a
  37. \texttt{bne} (the opposite case) to the label used in the jump statement. This
  38. way, the jump statement can be eliminated since the label directly follows it.
  39. The same can be done for the opposite case, where a \texttt{bne} is changed
  40. into a \texttt{beq}.
  41. Since this optimization is done between two series of codes with jumps and
  42. labels, we can not perform this code during the basic block optimizations.
  43. \subsection{Basic Block Optimizations}
  44. Optimizations on basic blocks are a more extended part of the optimizer.
  45. First of all, what is a basic block? A basic block is a sequence of statements
  46. guaranteed to be executed in that order, and that order alone. This is the case
  47. for a piece of code not containing any branches or jumps (except for the last
  48. statement).
  49. To divide the code into basic blocks, the ``leaders'' have to be found. A
  50. leading statement is a leader if it is either a jump or branch statement, or
  51. the target of such a statement. Each leader is the start of a new basic block.
  52. There are five types of optimizations performed on basic blocks in our
  53. implementation. Each is described individually below.
  54. \subsubsection{Standard peephole optimizations}
  55. These are optimizations that look for a certain statement or pattern of
  56. statements, and optimize these. For example,
  57. \begin{verbatim}
  58. mov $regA,$regB
  59. instr $regA, $regA,...
  60. \end{verbatim}
  61. can be optimized to:
  62. \begin{verbatim}
  63. instr $regA, $regB,...
  64. \end{verbatim}
  65. \texttt{\$regA} should contain the same value as \texttt{\$regB} after the move
  66. statement, so \texttt{\$regB} can be used by \texttt{instr}. Since
  67. \texttt{instr} overwrites \texttt{\$regA}, the move statement has not further
  68. effect after \texttt{instr} and can be removed.
  69. There are a few more of these cases, which are described on the practicum page
  70. \footnote{\url{http://staff.science.uva.nl/~andy/compiler/prac.html}} and in
  71. Appendix \ref{opt}.
  72. \subsubsection{Common subexpression elimination}
  73. A more advanced optimization is common subexpression elimination. This means
  74. that expensive operations like multiplications or additions are performed only
  75. once and the result is then `copied' into registers where needed.
  76. \begin{verbatim}
  77. addu $2,$4,$3 addu = $8, $4, $3 # $8 is free
  78. ... mov = $2, $8
  79. ... -> ...
  80. ... ...
  81. addu $5,$4,$3 mov = $4, $8
  82. \end{verbatim}
  83. A standard method for doing this is usage of a DAG or Directed Acyclic Graph.
  84. However, this requires either the code to be in Static single
  85. assignment
  86. form\footnote{\url{http://en.wikipedia.org/wiki/Static\_single\_assignment\_form}},
  87. or an advanced liveness check. Our implementation contains a (partially tested)
  88. implementation of DAG creation, but this is not used in the final
  89. implementation. However, our implementation does contain a simplified version
  90. of common subexpression elimination:
  91. The statement list of a block is traversed in reversed order, looking for
  92. instructions that are eligible for CSE (\texttt{addu}, for example). If such an
  93. instruction is found, it is marked and the rest of the statement list is
  94. traversed while marking all statements that are equal to the found instruction.
  95. If a statement assigns a register that is uses by the instruction, traversal
  96. stops.
  97. If more than one instruction have been marked, a new instruction is inserted
  98. above the first occurrence (the last occurrence in reversed order). This
  99. instruction performs the calculation and saves it in a free temporary register.
  100. Then, each occurrence is replaced by a \texttt{move} of the free register to
  101. its original destination register.
  102. This method is obviously less efficient method then the DAG. However, since
  103. the basic blocks are generally not very large and the execution time of the
  104. optimizer is not a primary concern, this is not a large problem.
  105. \subsubsection{Constant folding}
  106. Constant folding is an optimization where the outcome of arithmetics are
  107. calculated at compile time. If a register x is known to contain a constant
  108. value, all following uses of \texttt{x} can be replaced by that value until a
  109. redefinition of x.
  110. Arithmetics in Assembly are always performed between two registers or a
  111. register and a constant. If the current value of all used registers is known,
  112. The expression can be executed at-compile-time and the instruction can be
  113. replaced by an immediate load of the result. See \ref{opt} for an example.
  114. %In other words until the current definition of \texttt{x} becomes dead.
  115. %Therefore reaching definitions analysis is needed. Reaching definitions is a
  116. %form of liveness analysis, we use the liveness analysis within a block and not
  117. %between blocks.
  118. During the constant folding, so-called algebraic transformations are performed
  119. as well. When calculations are performed using constants, some calculations can
  120. be replaced by a load- or move-instruction. An example is the statement
  121. $x = y + 0$, or in Assembly: \texttt{addu \$1, \$2, 0}. This can be replaced by
  122. $x = y$ or \texttt{move \$1, \$2}. A list of transformations that are performed
  123. can be found in appendix \ref{opt}.
  124. \subsubsection{Copy propagation}
  125. Copy propagation replaces usage of registers that have been assigned the value
  126. of another register earlier. In Assembly code, such an assignment is in the
  127. form of a \texttt{move} instruction.
  128. This is not a direct optimization, but is often does create dead code (the
  129. \texttt{move} statement) that can be eliminated.
  130. To perform copy propagation within the same basic block, the block is traversed
  131. until a \texttt{move x, y} instruction is encountered. For each of these ``copy
  132. statements'', the rest of the block is traversed while looking for usage of the
  133. \texttt{move}'s destination address \texttt{x}. These usages are replaced by
  134. usages of \texttt{y}, until either \texttt{x} or \texttt{y} is re-assigned.
  135. %Copy propagation `unpacks' a move instruction, by replacing its destination
  136. %address with its source address in the code following the move instruction.
  137. %
  138. %This is not a direct optimization, but this does allow for a more effective
  139. %dead code elimination.
  140. %
  141. %The code of the block is traversed linearly. If a move operation is
  142. %encountered, the source and destination address of this move are stored. If a
  143. %normal operation with a source and a destination address are found, a number of
  144. %checks are performed.
  145. %
  146. %The first check is whether the destination address is stored as a destination
  147. %address of a move instruction. If so, this move instruction is no longer valid,
  148. %so the optimizations can not be done. Otherwise, continue with the second
  149. %check.
  150. %
  151. %In the second check, the source address is compared to the destination
  152. %addresses of all still valid move operations. If these are the same, in the
  153. %current operation the found source address is replaced with the source address
  154. %of the move operation.
  155. An example would be the following:
  156. \begin{verbatim}
  157. move $regA, $regB move $regA, $regB
  158. ... ...
  159. Code not writing $regA or $regB -> ...
  160. ... ...
  161. addu $regC, $regA, ... addu $regC, $regB, ...
  162. \end{verbatim}
  163. \texttt{\$regA} is replaced with \texttt{\$regB}. Now, the move instruction
  164. might have become useless. If so, it will be removed by dead code elimination.
  165. To also replace usages in successors of the basic block, a Reaching Definitions
  166. analysis is used: If a \texttt{move}-statement is in the $REACH_{out}$ set of
  167. the block, it is used in one of the block's successors. To be able to replace a
  168. usage, the definition must me the only definition reaching the usage. To
  169. determine this, copy propagation defines a new dataflow problem that yields the
  170. $COPY_{in}$ and $COPY_{out}$ sets. the successor The definition is the only
  171. reaching definition if it is in the successor's $COPY_{in}$ set. If this is the
  172. case, the usage van be replaced by the destination address of the
  173. \texttt{move}-statement. \\
  174. Note: Though we implemented the algorithm as described above, we did not
  175. encounter any replacements between basic blocks while optimizing the provided
  176. benchmark scripts. This might mean that our implementation of the copy
  177. propagation dataflow problem is based on the lecture slides, which only briefly
  178. describe the algorithm.
  179. \subsubsection{Dead code elimination}
  180. The final optimization that is performed is dead code elimination. This removes
  181. statements of which the result is never used.
  182. To determine if a register is used from a certain point in the code, liveness
  183. analysis is used. A variable is ``live'' at a certain point in the code if it
  184. holds a value that may be needed in the future. Using the $LIVE_{out}$ set
  185. that is generated by the analysis, we can check if a register is dead after a
  186. certain point in a basic block. Each statement that assigns a register which
  187. is dead from that point on is removed.
  188. \section{Implementation}
  189. We decided to implement the optimizations in Python. We chose this programming
  190. language because Python is an easy language to manipulate strings, work
  191. object-oriented etc..
  192. To implement the parser, we use a Python variant of Yacc and Lex named
  193. PLY(Python Lex-Yacc). By using this module instead of the regular C
  194. implementations of Yacc and Lex, we only use a single language in the entire
  195. project.
  196. The program has three steps:
  197. \begin{enumerate}
  198. \item Parsing the Assembly code to an Intermediate Representation (IR).
  199. \item Performing optimizations on the IR.
  200. \item Writing the IR back to Assembly code.
  201. \end{enumerate}
  202. Our code is provided with this report, and is also available on GitHub: \\
  203. \url{https://github.com/taddeus/peephole}
  204. \subsection{Structure}
  205. The folder structure of out implementation is as follows:
  206. \begin{itemize}
  207. \item \texttt{.}
  208. \begin{itemize}
  209. \item \texttt{benchmarks/} \\
  210. All benchmark files and their assembly versions (both unoptimized and
  211. optimized).
  212. \item \texttt{src/}
  213. \begin{itemize}
  214. \item \texttt{copy\_propagation.py} \\
  215. \texttt{liveness.py} \\
  216. \texttt{reaching\_definitions.py} \\
  217. Implementation of specific dataflow problems.
  218. \item \texttt{dag.py} \\
  219. DAG creation (unused).
  220. \item \texttt{dataflow.py} \\
  221. Common dataflow functions such as basic block-division and flow
  222. graph generation. Contains definition of \texttt{BasicBlock} class.
  223. \item \texttt{dominator.py} \\
  224. Dominator tree creation (unused, untested).
  225. \item \texttt{optimize\_advanced.py} \\
  226. Advanced optimizations on basic block-level: Common subexpression
  227. elimination, constant folding, copy propagation and dead code
  228. elimination.
  229. \item \texttt{optimize\_redundancies.py} \\
  230. Straight-forward replacement functions for statement sequences on
  231. global- and basic block-level.
  232. \item \texttt{parser.py} \\
  233. Assembly parser.
  234. \item \texttt{program.py} \\
  235. Definition of \texttt{Program} class.
  236. \item \texttt{statement.py} \\
  237. Definition of \texttt{Statement} and \texttt{Block} classes.
  238. \item \texttt{writer.py} \\
  239. Assembly writer.
  240. \end{itemize}
  241. \item \texttt{tests/} \\
  242. Unit tests.
  243. \item \texttt{main.py} \\
  244. Runs the optimizer.
  245. \item \texttt{run} \\
  246. Runs the optimizer and compares original statement list to the
  247. optimized one.
  248. \end{itemize}
  249. \end{itemize}
  250. \subsection{Parsing}
  251. The parser is implemented using PLY, which uses standard Lex-Yacc syntax in
  252. given function formats.
  253. The parser assumes that it is given valid Assembly code as input, so it does
  254. not validate whether, for example, command arguments are valid. This design
  255. decision was made because the optimizer uses the output of a compiler, which
  256. should produce valid Assembly code.
  257. The parser recognizes 4 types of ``statements'':
  258. \begin{itemize}
  259. \item \textbf{comment} Line starting with a `\#'.
  260. \item \textbf{directive} C-directive, used by the compiler. These are
  261. matched and treated in the same way as comments.
  262. \item \textbf{command} Machine instruction, followed 0 to 3 arguments and
  263. optionally an inline comment.
  264. \item \textbf{label} Line containing a \texttt{WORD} token, followed by a
  265. colon (`:').
  266. \end{itemize}
  267. Each statement is represented by a \texttt{Statement} object containing a type,
  268. a name, optionally a list of arguments and optionally a list of extra options
  269. (such as inline comments). The parsed list of statements forms a
  270. \texttt{Program} object, which is the return value of the parser.
  271. \subsection{Optimization loop}
  272. The optimizations are performed in a loop until no more changed are made. The
  273. optimization loop first performs global optimizations on the entire statement
  274. list of the program. Second, all dataflow analyses are performed (basic block
  275. creation, flow graph generation, liveness, reaching definitions, copy
  276. propagation). Finally, all basic block-level optimizations are executed. if
  277. either the global or one of the block optimizations yields a change in
  278. statements, another iteration is executed.
  279. \subsection{Writing}
  280. Once all the optimizations have been done, the IR needs to be rewritten to
  281. Assembly code. After this step, the xgcc cross compiler can make binary code
  282. from the generated Assembly code.
  283. The writer expects a list of statements, so first the blocks have to be
  284. concatenated again into a list. After this is done, the list is passed on to
  285. the writer, which writes the instructions back to Assembly and saves the file.
  286. We believe that the writer code is self-explanatory, so we will not discuss it
  287. in detail here.
  288. The writer has a slightly different output format than the xgcc compiler in
  289. some cases. Therefore, the main execution file has an option to also write the
  290. original statement list back to a file, differences in tabs, spaces and
  291. newlines do not show up when checking the differences between optimized and
  292. non-optimized files.
  293. \subsection{Execution}
  294. The main execution file (\emph{main.py}) is located in the root directory. It
  295. takes the file to be optimized as last argument, optionally preceded by one or
  296. more of the following options:
  297. \begin{itemize}
  298. \item \texttt{-o OUT\_FILE}
  299. Location to save the optimized statement list.
  300. \item \texttt{-i SOURCE\_OUT\_FILE}
  301. Location to save the original statement list.
  302. \item \texttt{-v VERBOSE\_LEVEL}
  303. The optimizer's verbose level (default 1). \\
  304. Possible verbose levels are: \\
  305. 0: No command-line output, no comments in output file. \\
  306. 1: Results in command-line output, inline optimization comments in
  307. output file. \\
  308. 2: Results and debug statements in command-line output, inline
  309. optimization comments in output file. \\
  310. \end{itemize}
  311. The root directory also contains an executable script: \texttt{run}. This
  312. script calls \texttt{main.py} and automatically starts the program
  313. \texttt{meld}. In \texttt{meld} it is easy to visually compare the original
  314. file and the optimized file. The usage of this script is as follows: \\
  315. \texttt{./run <benchmark name (e.g. whet)> [ <verbose level> ]}
  316. \section{Testing}
  317. Of course, it has to be guaranteed that the optimized code still functions
  318. exactly the same as the none-optimized code. To do this, testing is an
  319. important part of out program. We have two stages of testing. The first stage
  320. is unit testing. The second stage is to test whether the compiled code has
  321. exactly the same output.
  322. \subsection{Unit testing}
  323. For almost every piece of important code, unit tests are available. Unit tests
  324. give the possibility to check whether each small part of the program, for
  325. instance each small function, is performing as expected. This way bugs are
  326. found early and very exactly. Otherwise, one would only see that there is a
  327. mistake in the program, not knowing where this bug is. Naturally, this means
  328. debugging is a lot easier.
  329. The unit tests can be run by executing \texttt{make test} in the root folder of
  330. the project. This does require the \texttt{testrunner} module of Python.
  331. Also available is a coverage report. This report shows how much of the code has
  332. been unit tested. To make this report, the command \texttt{make coverage} can
  333. be run in the root folder. The report is than added as a folder \emph{coverage}
  334. in which a \emph{index.html} can be used to see the entire report.
  335. \subsection{Output comparison}
  336. In order to check whether the optimization does not change the functionality of
  337. the program, the output of the provided benchmark programs was compared to
  338. their output after optimization. If any of these outputs was not equal to the
  339. original output, either our optimizations are to aggressive, or there is a bug
  340. somewhere in the code.
  341. \section{Results}
  342. We have executed the optimizer on each of the benchmark files and compared the
  343. number of execution cycles of the optimized versions vs. the original versions.
  344. The results are displayed in the following table: \\
  345. \\
  346. \begin{tabular}{|c|c|c|c|c|c|}
  347. \hline
  348. Benchmark & Original & Removed & Original & Optimized & Performance \\
  349. & Instructions & instructions & cycles & cycles & boost(cycles) \\
  350. \hline
  351. pi & 94 & 2 & 1714468 & 1714260 & 0.012\% \\
  352. acron & 361 & 24 & 4435687 & 4314540 & 2.73\% \\
  353. dhrystone & 752 & 49 & 2887710 & 2852653 & 1.21\% \\
  354. whet & 935 & 33 & 2864526 & 2828487 & 1.26\% \\
  355. slalom & 4177 & 226 & 2879140 & 2870375 & 0.304\% \\
  356. clinpack & 3523 & 231 & 1543746 & 1457479 & 5.59\% \\
  357. \hline
  358. \end{tabular}
  359. \begin{tabular}{|c|c|c|c|c|c|}
  360. \hline
  361. Benchmark & Original & Removed & Original & Optimized & Performance \\
  362. & Instructions & instructions & cycles & cycles & boost(cycles)\\
  363. \hline
  364. pi & 94 & 2 & 1714468 & 1714362 & 0.006182676 \% \\
  365. acron & 361 & 19 & 4435687 & 4372825 & 1.417187462 \% \\
  366. dhrystone & 752 & 36 & 2887710 & 2742720 & 5.020933542 \% \\
  367. whet & 935 & 23 & 2864526 & 2840042 & 0.854731289 \% \\
  368. slalom & 4177 & 107 & 2879140 & 2876105 & 0.143480345 \% \\
  369. clinpack & 3523 & 49 & 1543746 & 1528406 & 1.353201887 \% \\
  370. \hline
  371. \end{tabular}
  372. \pagebreak
  373. \appendix
  374. \section{List of all optimizations}
  375. \label{opt}
  376. \textbf{Global optimizations}
  377. \begin{verbatim}
  378. beq ...,$Lx bne ...,$Ly
  379. j $Ly -> $Lx: ...
  380. $Lx: ...
  381. bne ...,$Lx beq ...,$Ly
  382. j $Ly -> $Lx: ...
  383. $Lx: ...
  384. j $Lx -> $Lx: ...
  385. $Lx: ...
  386. \end{verbatim}
  387. \textbf{Standard basic block optimizations}
  388. \begin{verbatim}
  389. mov $regA,$regA -> --- // remove it
  390. mov $regA,$regB -> instr $regA, $regB,...
  391. instr $regA, $regA,...
  392. instr $regA,... instr $4,...
  393. mov [$4-$7], $regA -> jal XXX
  394. jal XXX
  395. sw $regA,XXX -> sw $regA, XXX
  396. ld $regA,XXX
  397. shift $regA,$regA,0 -> --- // remove it
  398. add $regA,$regA,X -> lw ...,X($regA)
  399. lw ...,0($regA)
  400. \end{verbatim}
  401. \textbf{Advanced basic block optimizations}
  402. \begin{verbatim}
  403. # Common subexpression elimination
  404. addu $regA, $regB, 4 addu $regD, $regB, 4
  405. ... move $regA, $regD
  406. Code not writing $regB -> ...
  407. ... ...
  408. addu $regC, $regB, 4 move $regC, $regD
  409. # Constant folding
  410. li $2, 2 $2 = 2
  411. sw $2, 16($fp) 16($fp) = 2
  412. li $2, 3 $2 = 3
  413. sw $2, 20($fp) -> 20($fp) = 3
  414. lw $2, 16($fp) $2 = 16($fp) = 2
  415. lw $3, 20($fp) $3 = 20($fp) = 3
  416. addu $2, $2, $3 change to "li $2, 0x00000005"
  417. # Copy propagation
  418. move $regA, $regB move $regA, $regB
  419. ... ...
  420. Code not writing $regA, -> ...
  421. $regB ...
  422. ... ...
  423. addu $regC, $regA, ... addu $regC, $regB, ...
  424. # Algebraic transformations
  425. addu $regA, $regB, 0 -> move $regA, $regB
  426. subu $regA, $regB, 0 -> move $regA, $regB
  427. mult $regA, $regB, 1 -> move $regA, $regB
  428. mult $regA, $regB, 0 -> li $regA, 0
  429. mult $regA, $regB, 2 -> sll $regA, $regB, 1
  430. \end{verbatim}
  431. \end{document}