XGC Technology

sales@xgc.com
Fax +44 1483 433312


GCC ERC32 Key Characteristics


For GCC-ERC32 Version 1.1
Prepared for ESTEC, contract 11935/NL/JG
Version 1.0


Contents

 

Introduction

The performance of a compilation system can be summed up by several "key characteristics", which enable users to make meaningful comparisons between competing products.

The three key characteristics we are interested in are:

  • Code size, in bytes per source line.
  • Compilation speed, in source lines per minute.
  • Execution time of the generated code.
  • Execution time of critical library functions, such as those in the POSIX Threads library.

Which Benchmarks

As far as we are aware there are no standard benchmark programs designed to measure the size of an executable program. This may be because for the majority of programmers the size of a program is not important. And for application programs that run on an operating system, it is far from clear what the size of a program is. Should it include the size of the operating system, or maybe just the sharable libraries?

For embedded systems, the meaning of size is much more tightly defined. But we still have this issue of the run-time system, and any library functions that are linked in. We believe that a programmer will be interested in the total size of the application program, run-time system and any library functions that are linked in.

The most commonly use benchmark programs for embedded systems timing are the following:

  • ackermann—which measures function call time.
  • whetstone—which is a scientific benchmark designed to measure floating point performance.
  • sieve—which measures integer and loop performance.
  • dhrystone—which measures integer performance.

All of these are available in a number of languages including C and Ada, and have been widely used over many years. We therefore have a wide basis for comparison.

We have found the Ada source for three of these in the Ada PIWG tests: the sieve benchmark was originally published in Byte Magazine in C. We have rewritten ackermann and whetstone, in C for this project.

The standard C version of the dhrystone benchmark is quite different from the original Ada version, so we have not used it.

Methods Used

The method used to measure some characteristic of a program often affects the results, and in every case, we were careful to select a method that would give meaningful and repeatable results. Many benchmarking methods are flawed, and based on our experience in this area, we constructed tests to give statistically significant and useful results.

Compilation Speed

To measure the speed of compilation we took several "typical" programs, compiled them with and without optimization and measured the total CPU time using the UNIX command ‘time’.

For example, to time the compilation of whetstone, we entered the following command:

bash$ time erc-elf-gcc -O2 -c whetstone.c 
real 5.3 
user 4.8 
sys 0.3

We then repeated the command to check if having the compiler already loaded in the operating system’s cache made an appreciable difference to the compilation time. For the measurements we made, this was not the case.

Code Size

For code size measurements, we used the compiler options "-O2 –c", which mean optimize at level 2, and don’t link. Level 2 optimizations include most of the optimizations we need but excludes in lining.

We then used the object code dump utility erc-elf-objdump to print the size of each segment. The total code size was then arrived at by summing the size of all ‘.text’ sections, and any read-only data sections.

Execution Time

Measuring execution time is considerably more error prone than measuring compilation time. In general, you cannot simply ‘time’ a high-level language statement. This is because the introduction of timing statements changes the execution time of the statement being timed. In addition, it not usual to have a clock resolution that is fine enough to time single high-level language statements.

Therefore, we try to measure the execution time of a large number of statements, then divide the total time by the number of statements. The statements may written one after the other, or maybe in a loop that is executed a number of times.

Unfortunately, both of these techniques introduce redundant expressions, and unless the source code is modified to defeat optimizations, the redundant expressions will be eliminated from the executable code, and the time measurements will be wrong. For the purposes of this key-characteristics report, we restrict the kind of statement we measure to those that cannot be eliminated by optimization. Moreover, in most cases the statement is a call to a function in another file.

In the loop technique, the loop overhead must be accounted for. One method, known as the dual loop method, begins with a dummy loop that has en empty body but iterates the same number of times as the test loop. Whether this results in accurate timings or not depends on the compiler and in our experience the results are poor.

A better method, based on the dual loop method, is to combine both of the above. That is we use two loops but both loops contain test statements. One loop contains more test statements than the other, and we can measure the time for the additional statement by timing both loops and computing the difference.

Results

This section presents the measurements we made using GCC-ERC32 Version 1.1.

The host computer used for this work was a Sun SPARC Station 5 running Solaris 2.5. The target computer was the Saab Ericsson Space DEM32 development system, which has a Temic ERC32 Revision CBA chipset running at 10MHz.

In all tests we compiled the test program with the optimization option "-O2".

Compilation Speed

Today, even entry level workstations are powerful enough to compile a small program in almost no time at all. This was not always the case, and in the past, the number of line per minute that a compiler could manage was an essential compiler characteristic. For comparison, early Ada compilers managed as little as 50 lines per minute, rising to 500 lines per minute as optimization technology improved.

To get a range of measurements, we compiled a null program, t0, the largest benchmark program, whetstone, and the math library test program testmath. The results are for a SPARC Station 5 running Solaris 2.4.

Note that whetstone and testmath include the header files <report.h> and <time.h>, and testmath also includes the header <math.h>. The lines in these included files re not counted in the total lines of code. The times do not include linking.

Table 1: Speed of Compilation

Source file
Lines of code
Options
Compilation time in seconds (user + sys)
Lines per minute
t0
3
-c
0.4
450
t0
3
-c –O2
0.5
360
whetstone
491
-c
2.0
9503
whetstone
491
-c –O2
5.3
8927
testmath
3819
-c
11.5
18629
testmath
3819
-c –O2  
12453

The results show that optimization is expensive in terms of compilation time. In the case of 'testmath', the time to compile with optimization is more than four times that with optimization off.

Code Size

In a previous report on the key characteristics of the MIL-STD-1750 compiler, we compared the size of the generated code with that from the SPARC compiler. Now we are able to take what is substantially the SPARC compiler and compare the size with results we got for the MIL-STD-1750.

Table 2: Generated Code Sizes

Test
erc-coff (bytes)
Source lines of code
Bytes per source line
ackermann
576
10
5.8
sieve
184
23
8.0
whetstone
5040
491
10.3

Code Size Comparisons

In order to compare the size of the code generated by GCC-ERC and other GCC compilers, we compiled a number of C language files where we could be sure that the results would be comparable. We therefore deliberately avoided statements that would result in in-line code with one compiler, and with a run-time system call in another. We checked all the generated code to make sure that the comparisons were valid.

These are the compilers we used:

  • erc-coff version 1.1
  • m1750-coff version 1.1
  • m68k-coff version 2.8.1

The code sizes reported in Table 3 are the sum of all code and constant data sections.

Table 3: Code Size Comparisons

Test

erc-coff (bytes)

m1750-coff (bytes)

m68k-coff (bytes)

ackermann

76

58

66

sieve

184

108

112

whetstone

5405

2804

5248

Figure 1 shows the results from Table 3 as a graph, where the ERC32 values are shown as 1.0, and where the others are scaled accordingly.

Figure 1 : Code Size Comparisons

Benchmark Results

Here are the timing results for the three benchmark programs ackermann, sieve and whetstone. For comparison, we present results for a Motorola MVME133 board, which has a MC68020 running at 12 MHz, and a floating point coprocessor.

Table 4: Benchmark Results

 

Benchmark DEM32 result MVME133 result Time ratio
ackermann 1.13 seconds 1.2 seconds 1.1
sieve 0.29 seconds 1.1 seconds 3.8
whetstone 3125 KWIPS 476 KWIPS 6.57

The poor performance of the ERC32 on the Ackermann benchmark is due to the register window mechanism. According to the statistics from the simulator, almost all function calls result in a window overflow. Otherwise the ERC32 compares well with a CISC computer running at approximately the same clock speed.

In the following table we present results for a simulated MA31750 computer, and calculate the speed up in moving from the MA31750 to the ERC32.

Table 5 : Benchmark Results Comparison

 

Benchmark MA31750 result (simulated 10MHz clock) DEM32 result (10MHz clock) Time ratio
ackermann 0.98 seconds 1.13 0.87
sieve 0.65 seconds 0.29 2.5
whetstone 1428 KWIPS 3125 2.19

Results for the Math Library--libm

The library libm is the ANSI standard math library. In GCC-ERC32 the math library is written as a number of C language files, some of which have assembly language inserts. Hardware floating point arithmetic is used where possible. In most cases it is extended precision floating point that is used because in the C language, floating point operations default to the type ‘double’ and all the functions in libm take double arguments and return double results. We ran the libm execution time tests on the DEM32 target.

Table 6: Execution Times for libm Functions

Function call
Time in Microseconds
acos (0.5)
34
asin (0.5)
32
atan (0.5)
36
atan2 (1.0, 2.0)
53
ceil (1.5)
12
cos (1.0)
55
cosh (1.0)
67
exp (1.0)
58
fabs (1.0)
0
floor (1.5)
8
fmod (100.0, 3.0)
19
frexp (1024.0, &itmp)
5
ldexp (1.0, 10)
5
log (100.0)
40
log10 (100.0)
42
modf (1.5, &xtmp)
14
pow (10.0, 10.0)
128
sin (1.0)
54
sinh (1.0) tbs
sqrt (2.0)
4
tan (1.0)
57
tanh (1.0)
73

Results for POSIX Threads

For POSIX Threads, we used some of the benchmark programs included with the MIT distribution and some that were entirely new. These make simple measurements of a number of characteristics including the time for a context switch, the time to local and unlock a mutex and so on.

Table 7: Results for POSIX Threads

Measurement

Result in Microseconds

Thread creation

205

Initialize Mutex

3

Lock and unlock

12

Interrupt latency

28

Interrupt latency using cond_wait

178

Interrupt latency using intwait

257

pthread_cond_signal

4

Accuracy of Delays

We measured the accuracy of delay statements using the simulator and then again on the DEM32 target computer. The test program began by measuring the actual delay for a requested delay of one second, then measured smaller delays down to 100 microseconds, using as many iterations as necessary to achieve a total delay time of about two seconds per test case.

The kink in the graph near a requested delay of 10 mSec is an artifact of the benchmarking process and could be eliminated by randomizing the start time of each measured delay.

The results are shown in Figure 2.

Figure 2: Measured delay v Requested delay

Figure 2 shows the measured delay for each of several calls of nanosleep. The graph clearly slows the lower limit of 1/100 second, which is the interrupt period of our real time clock.

Conclusion

The target computer used for the benchmark tests was the Saab Ericsson Space DEM32 board. This has the ERC32 chipset from Temic and runs at 10MHz.

Compilation Speed

The speed of compilation is very good. Most small files compile in about one second, and big files compile in less than a minute. As expected, the rate of compilation for large files is much greater than for small files.

Code Size

The size of the generated code is bigger than for other computers we have looked at. This is mainly due to the lack of instructions to operate on memory. For example, to increment a memory location requires 16 bytes of code on the ERC32, and only 4 bytes on the 1750. Also the minimum instruction size is 32 bits.

For the examples we compiled, the size of the executable image is dominated by the size of the interrupt vector. This accounts for 4K bytes of read-only memory. 

The measurements we made suggest an expansion ratio of about 8 bytes of code per line of source. Of course, the statistical significance of this result is quite low, and the actual values for an application program will be dependent on programming style and many other factors.

We compared the size of a number of example programs with the size using GCC-1750 and the GGC compiler for Motorola MC68020. In all tests GCC-ERC generates more instructions and larger executable images. For a typical application program, the increase will not be a problem.

Execution Time

Comparing the execution time on the 10MHz ERC32 with a simulation of a 10MHz MA31750, the ERC32 appears to be about twice as fast. The exact ratio will depend on instruction mix, with function calls taking much longer on the ERC32 and floating point instructions much faster.