Intel ARCHITECTURE IA-32 User Manual

Page 161

Advertising

General Optimization Guidelines

2-89

Use

movapd

as an alternative; it writes all 128 bits. Even though this

instruction has a longer latency, the

μops for

movapd

use a different

execution port and this port is more likely to be free. The change can
impact performance. There may be exceptional cases where the latency
matters more than the dependence or the execution port.

Assembly/Compiler Coding Rule 54. (M impact, ML generality) Avoid
introducing dependences with partial floating point register writes, e.g. from
the

movsd xmmreg1, xmmreg2

instruction. Use the

movapd xmmreg1,

xmmreg2

instruction instead.

The

movsd xmmreg, mem

instruction writes all 128 bits and breaks a

dependence.

The

movupd

from memory instruction performs two 64-bit loads, but

requires additional

μops to adjust the address and combine the loads

into a single register. This same functionality can be obtained using

movsd xmmreg1, mem; movsd xmmreg2, mem+8;

unpcklpd xmmreg1,

xmmreg2

, which uses fewer

μops and can be packed into the trace cache

more effectively. The latter alternative has been found to provide several
percent of performance improvement in some cases. Its encoding
requires more instruction bytes, but this is seldom an issue for the
Pentium 4 processor. The store version of

movupd

is complex and slow,

so much so that the sequence with two

movsd

and a

unpckhpd

should

always be used.

Assembly/Compiler Coding Rule 55. (ML impact, L generality) Instead of
using

movupd xmmreg1, mem

for a unaligned 128-bit load, use

movsd

xmmreg1, mem; movsd xmmreg2, mem+8;

unpcklpd xmmreg1,

xmmreg2

If the additional register is not available, then use

movsd

xmmreg1, mem; movhpd xmmreg1, mem+8.

Assembly/Compiler Coding Rule 56. (M impact, ML generality) Instead of
using

movupd mem, xmmreg1

for a store, use

movsd mem, xmmreg1;

unpckhpd xmmreg1, xmmreg1; movsd mem+8, xmmreg1

instead.

Advertising