Loop unrolling

Loop unrolling, also known as loop unwinding, is a

loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.^[1]

The goal of loop unwinding is to increase a program's speed by reducing or eliminating instructions that control the loop, such as

computational overhead, loops can be re-written as a repeated sequence of similar independent statements.^[4]

Loop unrolling is also part of certain formal verification techniques, in particular bounded model checking.^[5]

Advantages

The overhead in "tight" loops often consists of instructions to increment a pointer or index to the next element in an array (

pointer arithmetic), as well as "end of loop" tests. If an optimizing compiler or assembler is able to pre-calculate offsets to each individually referenced array variable, these can be built into the machine code

instructions directly, therefore requiring no additional arithmetic operations at run time.

Significant gains can be realized if the reduction in executed instructions compensates for any performance reduction caused by any increase in the size of the program.

Branch penalty is minimized.^[6]
If the statements in the loop are independent of each other (i.e. where statements that occur earlier in the loop do not affect statements that follow them), the statements can potentially be executed in parallel.
Can be implemented dynamically if the number of array elements is unknown at compile time (as in Duff's device).

Optimizing compilers will sometimes perform the unrolling automatically, or upon request.

Disadvantages

Increased Code Size: Unrolling increases the number of instructions, leading to larger program binaries.
- Higher Storage Requirements: The expanded code takes up more memory, which can be problematic for microcontrollers or embedded systems with limited storage.
- Instruction Cache
  Pressure: The unrolled loop consumes more space in the instruction cache. If it exceeds the cache size, frequent cache misses can occur, which can cause severe performance degradation due to costly memory accesses.
Reduced Code Readability: If loop unrolling is done manually instead of by an optimizing compiler, the code can become harder to understand and maintain.
Conflict with Function
Inlining
: When the loop body contains function calls, unrolling may prevent inlining due to excessive code expansion, leading to a trade-off between these two optimizations.
Increased
in-order superscalar execution), unrolling may require additional registers to store temporary variables across iterations, limiting register reuse.^[7]

Branch Prediction: Modern CPUs use branch prediction to try to guess which way a branch will go. If the prediction is correct, the CPU can continue executing instructions without waiting for the branch to resolve. However, if the prediction is incorrect, the CPU has to flush the pipeline and start executing the correct instructions, which can be a performance penalty. Loop unrolling can increase the number of branches in the code, which could lead to more branch mispredictions and lower performance. ^[8]

Static/manual loop unrolling

Manual (or static) loop unrolling involves the programmer analyzing the loop and interpreting the iterations into a sequence of instructions which will reduce the loop overhead. This is in contrast to dynamic unrolling which is accomplished by the compiler.

Simple manual example in C

A procedure in a computer program is to delete 100 items from a collection. This is normally accomplished by means of a for-loop which calls the function delete(item_number). If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to those for the delete(x) function, unwinding can be used to speed it up.

Normal loop	After loop unrolling
int x; for (x = 0; x < 100; x++) { delete(x); }	int x; for (x = 0; x < 100; x += 5) { delete(x); delete(x + 1); delete(x + 2); delete(x + 3); delete(x + 4); }

As a result of this modification, the new program has to make only 20 iterations, instead of 100. Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal benefit, no variables should be specified in the unrolled code that require

pointer arithmetic. This usually requires "base

plus offset" addressing, rather than indexed referencing.

On the other hand, this manual loop unrolling expands the source code size from 3 lines to 7, that have to be produced, checked, and debugged, and the compiler may have to allocate more registers to store variables in the expanded loop iteration ^[dubious – discuss]. In addition, the loop control variables and number of operations inside the unrolled loop structure have to be chosen carefully so that the result is indeed the same as in the original code (assuming this is a later optimization on already working code). For example, consider the implications if the iteration count were not divisible by 5. The manual amendments required also become somewhat more complicated if the test conditions are variables. See also Duff's device.

Early complexity

In the simple case, the loop control is merely an administrative overhead that arranges the productive statements. The loop itself contributes nothing to the results desired, merely saving the programmer the tedium of replicating the code a hundred times which could have been done by a pre-processor generating the replications, or a text editor. Similarly, if-statements and other flow control statements could be replaced by code replication, except that code bloat can be the result. Computer programs easily track the combinations, but programmers find this repetition boring and make mistakes. Consider:

Normal loop

After loop unrolling

for i := 1:8 do
    if i mod 2 = 0 then do_even_stuff(i) 
                   else do_odd_stuff(i);
    next i;

do_odd_stuff(1); do_even_stuff(2);
do_odd_stuff(3); do_even_stuff(4);
do_odd_stuff(5); do_even_stuff(6);
do_odd_stuff(7); do_even_stuff(8);

But of course, the code performed need not be the invocation of a procedure, and this next example involves the index variable in computation:

Normal loop	After loop unrolling
x(1) := 1; for i := 2:9 do x(i) := x(i - 1) * i; print i, x(i); next i;	x(1) := 1; x(2) := x(1) * 2; print 2, x(2); x(3) := x(2) * 3; print 3, x(3); x(4) := x(3) * 4; print 4, x(4); ... etc.

which, if compiled, might produce a lot of code (print statements being notorious) but further optimization is possible. This example makes reference only to x(i) and x(i - 1) in the loop (the latter only to develop the new value x(i)) therefore, given that there is no later reference to the array x developed here, its usages could be replaced by a simple variable. Such a change would however mean a simple variable whose value is changed whereas if staying with the array, the compiler's analysis might note that the array's values are constant, each derived from a previous constant, and therefore carries forward the constant values so that the code becomes

print 2, 2;
print 3, 6;
print 4, 24;
...etc.

In general, the content of a loop might be large, involving intricate array indexing. These cases are probably best left to optimizing compilers to unroll. Replicating innermost loops might allow many possible optimisations yet yield only a small gain unless n is large.

Unrolling WHILE loops

Consider a pseudocode WHILE loop similar to the following:

Normal loop

After loop unrolling

Unrolled & "tweaked" loop

WHILE (condition) DO
    action
ENDWHILE
.
.
.
.
.
.

WHILE (condition) DO
    action
    IF NOT(condition) THEN GOTO break
    action
    IF NOT(condition) THEN GOTO break
    action
ENDWHILE
LABEL break:
.

IF (condition) THEN
    REPEAT
        action
        IF NOT(condition) THEN GOTO break
        action
        IF NOT(condition) THEN GOTO break
        action
    WHILE (condition)
LABEL break:

In this case, unrolling is faster because the ENDWHILE (a jump to the start of the loop) will be executed 66% less often.

Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.

Dynamic unrolling

Since the benefits of loop unrolling are frequently dependent on the size of an array—which may often not be known until run time—JIT compilers (for example) can determine whether to invoke a "standard" loop sequence or instead generate a (relatively short) sequence of individual instructions for each element. This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling. In this situation, it is often with relatively small values of n where the savings are still useful—requiring quite small (if any) overall increase in program size (that might be included just once, as part of a standard library).

Assembly language programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient branch tables. Here, the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded).

Assembler example (IBM/360 or Z/Architecture)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Loop_unrolling&oldid=1276558448"

This page is based on the copyrighted Wikipedia article: Loop unrolling. Articles is available under the CC BY-SA 3.0 license; additional terms may apply.Privacy Policy

[lkml-0008.2/0171-1] Tso, Ted (August 22, 2000). "Re: [PATCH] Re: Move of input drivers, some word needed from you". lkml.indiana.edu. Linux kernel mailing list. Retrieved August 22, 2014. Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.

[2] ISBN 0-201-10073-8
.

[3] Petersen, W.P., Arbenz, P. (2004). Introduction to Parallel Computing. Oxford University Press. p. 10.{{cite book}}: CS1 maint: multiple names: authors list (link)

[4] OCLC 14638257. {{cite journal}}: Cite journal requires |journal= (help
)

[5] Model Checking Using SMT and Theory of Lists

[fog2012-6] Fog, Agner (2012-02-29). "Optimizing subroutines in assembly language" (PDF). Copenhagen University College of Engineering. p. 100. Retrieved 2012-09-22. 12.11 Loop unrolling

[7] S2CID 3353104
.

[8] Adam Horvath "Code unwinding - performance is far away"

[9] "Loop Unrolling". University of Minnesota.

[1]

[4]

[5]

[7]

[8]

[9]

v t e Compiler optimizations
Basic block	Peephole optimization Local value numbering
Loop	Automatic parallelization Automatic vectorization Induction variable Loop fusion Loop-invariant code motion Loop inversion Loop interchange Loop nest optimization Loop splitting Loop unrolling Loop unswitching Software pipelining Strength reduction
Data-flow analysis	Available expression Common subexpression elimination Constant folding Dead store elimination Induction variable recognition and elimination Live-variable analysis Upwards exposed uses Use-define chain Reaching definitions
SSA-based	Global value numbering Sparse conditional constant propagation
Code generation	Instruction scheduling Instruction selection Register allocation Rematerialization
Functional	Deforestation Tail-call elimination
Global	Interprocedural optimization
Other	Bounds-checking elimination Compile-time function execution Dead-code elimination Expression templates Inline expansion Jump threading Partial evaluation Profile-guided optimization
Static analysis	Alias analysis Array-access analysis Control-flow analysis Data-flow analysis Dependence analysis Escape analysis Pointer analysis Shape analysis Value range analysis

Loop unrolling

Advantages

Disadvantages

Static/manual loop unrolling

Simple manual example in C

Early complexity

Unrolling WHILE loops

Dynamic unrolling

Assembler example (IBM/360 or Z/Architecture)

C example

C to MIPS assembly language loop unrolling example

Converting to MIPS assembly language

Unrolling the Loop in MIPS

See also

References

Further reading

External links