ModSim: Completed ass 1 question 1 and 2.

parent 840334b8
FLAGS=-Wall -Wextra -std=c99 -pedantic
FLAGS=-Wall -Wextra -std=c99 -pedantic -O0 -lm
TYPES=float double LD
OPS=ADD DIV MULT SQRT
all: fp speed
all: fp speed report.pdf
speed: speed.o
gcc $(FLAGS) -o $@ $^
%.pdf: %.tex
pdflatex $^
pdflatex $^
speed: speed.c
for t in $(TYPES); do \
for o in $(OPS); do \
sed "s#{TYPE}#$$t#" $^ | sed "s#{OP}#$$o#" > speed.$$t.$$o.c; \
gcc $(FLAGS) -o speed.$$t.$$o speed.$$t.$$o.c; \
rm speed.$$t.$$o.c; \
done; \
done;
touch $@
fp: floating_point.o
gcc $(FLAGS) -o $@ $^
%.o: %.c
gcc $(FLAGS) -o $@ -c $^
%.s: %.c
gcc $(FLAGS) -o $* $^
clean:
rm -vf *.o *.i *.s fp speed speed.*.* floating_point
for f in ./speed.[dfL]*; do
echo -n $f' ';
sleep 1;
sudo nice -n -20 time -f %U $f;
done
......@@ -9,5 +9,29 @@ int main(void) {
PRINT_SIZE(double);
PRINT_SIZE(long double);
/*
* C = 0.f; op = '+'; e = 1.4012984643248171e-45
* C = 0.f; op = '-'; e = 1.4012984643248171e-45
* C = 1.f; op = '+'; e = 1.0842021724855044e-19
* C = 1.f; op = '-'; e = 5.4210108624275222e-20
* C = -1.f; op = '+'; e = 5.4210108624275222e-20
* C = -1.f; op = '-'; e = 1.0842021724855044e-19
*/
//float e, old;
//for(e = 1.f; 1.f - e != 1.f; old = e, e /= 2);
//printf("epsilon: %e\n", old);
//printf("epsilon: %.80f\n", old);
// 0.1f = 0x3dcccccd = '0 01111011 10011001100110011001101'
float e = 1.f;
printf("our epsilon: %.12e\n", e);
printf("f range: [%e, %e]\n", FLT_MIN, FLT_MAX);
printf("d range: [%e, %e]\n", DBL_MIN, DBL_MAX);
printf("ld range: [%Le, %Le]\n", LDBL_MIN, LDBL_MAX);
printf("f epsilon: %e\n", FLT_EPSILON);
printf("d epsilon: %e\n", DBL_EPSILON);
printf("ld epsilon: %Le\n", LDBL_EPSILON);
return 0;
}
\documentclass[10pt,a4paper]{article}
\usepackage{float}
\title{ModSim assignment 1: Floating point arithmetic}
\author{Tadde\"us Kroes (6054129) \and Sander van Veen (6167969)}
\begin{document}
\maketitle
\section{Representation} % {{{
\label{sec:Representation}
We wrote a small C program to determine the properties of floating point numbers
(float, double and long double) on our working machine\footnote{Machine
info...}. To determine the size of the various data types, we used the
\texttt{sizeof} operator. The range of the mentioned data types can derived from
glibc's constants, like \texttt{FLT\_MAX}. Glibc also defines the machine precision
(epsilon) of each data type. \\
\\
The values we found are summarized in the table below:
\begin{table}[H]
\begin{tabular}{l|lll}
Data type & Bytes & Range & Epsilon \\
\hline
\texttt{float} & 4 & $[1.175494 \cdot 10^{38}, 3.402823 \cdot 10^{38}]$
& $1.192093 \cdot 10^{7}$ \\
\texttt{double} & 8 & $[2.225074 \cdot 10^{308}, 1.797693 \cdot 10^{308}]$
& $2.220446 \cdot 10^{16}$ \\
\texttt{long double} & 12 & $[3.362103 \cdot 10^{4932}, 1.189731 \cdot 10^{4932}]$
& $1.084202 \cdot 10^{19}$ \\
\end{tabular}
\caption{Floating point characteristics.}
\end{table}
We will explain the $\epsilon$ we found for the precision of the \texttt{float}
data type. First, we state that epsilon is the smallest representable number
greater than one (thus $a + \epsilon \neq a$, where $|a| \ge 1$). Given the
representation as defined in the lecture slides, we know that the 8-bit exponent
of $1$ is $01111111_2 = 127_{10}$, so $e = 127 - bias = 127 - 127 = 0$. The
mantissa are all zero except for the ``hidden bit'', which is 1. This gives the
exact number $1 \cdot 10^0 = 1.0$. The number closest to one can be made by
making the least significant mantissa `1'. If we apply the given formula, we get
the following decimal value:
$$ (-1)^{sign}(1 + \sum_{i=1}^{23} \ b_{i}2^{-i} )\cdot 2^{(e-127)}
= 1(1 + 1 \cdot 10^{-22}) \cdot 2^0 = 1.000000119209 = 1 + \epsilon $$
We noticed that the precision of numbers between -1 and 1 is much higher, as we
will show later in this report. We thought that the precision would be the same
as the $\epsilon$ which we calculated above, because the exponent is
$00000000_2$ which gives us $e = 0 - bias = -127$. There is no more hidden bit,
but since $2^{-126} = 2 \cdot 2^{-127}$ the precision should be the same. We
think that the higher precision is due to extra precision in the floating point
registers of our computer. Optimization is possible, because numbers between -1
and 1 are ``denormalized'', and therefore redundant.
% }}}
\section{Calculation speed} % {{{
\label{sec:Calculation speed}
We created one base source file, the executable benchmark files are generated
using the Makefile (which will substitute the variables). The benchmark can be
started using \texttt{./benchmark.bash}.
\begin{table}[H]
\begin{tabular}{l|ll}
Type & Operator & Million ops/sec \\
\hline
\texttt{float} & ADD & 311 \\
\texttt{double} & ADD & 296 \\
\texttt{long double} & ADD & 235 \\
\texttt{float} & DIV & 213 \\
\texttt{double} & DIV & 213 \\
\texttt{long double} & DIV & 190 \\
\texttt{float} & MULT & 9.57 \\
\texttt{double} & MULT & 9.58 \\
\texttt{long double} & MULT & 12.8 \\
\texttt{float} & SQRT & 190 \\
\texttt{double} & SQRT & 222 \\
\texttt{long double} & SQRT & 121 \\
\end{tabular}
\caption{Calculation speed of various mathematical operations.}
\end{table}
\noindent \textbf{Observations}
\begin{itemize}
\item We see that when the data type has a larger storage size, the addition
operation takes increasingly longer.
\item Division and multiplication performance are the same for the data
types \texttt{float} and \texttt{double}. However, division and
multiplication for the \texttt{long double} data type does take longer to
execute.
\item We notice that the square root operation is slower for the
\texttt{float} than for the \texttt{double} data type. Therefore, we think
that the \texttt{sqrt} function of glibc is optimised for the
\texttt{double} data type.
\end{itemize}
% }}}
\end{document}
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define ADD(a, b) (a += b)
#define DIV(a, b) (a /= b)
#define MULT(a, b) (a *= b)
// Macro expansion is on purpose here to suppress the `unused var b' warning.
#define SQRT(a, b) a = sqrt(a); b = b
#define LD long double
int main(void) {
int i;
for(i=0; i < 1e9; i++);
printf("i = %d\n", i);
int i, max = (int) 1e9;
{TYPE} a = 1.60654, b = 3.1285341;
for(i=0; i < max; i++)
{OP}(a, b);
return 0;
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment