X86: haddps: Another patch from Vince Weaver

Information
Submitter:	Lisa Hsu
Repository:	gem5
Branch:
Bugs:
Depends On:
Reviewers
Groups:	Default
People:	ali, gblack, nate, stever

Description

X86:  haddps: Another patch from Vince Weaver

Testing Done

Since there have been no objections, I'm going to commit this.

src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py (Diff revision 1)

ext is no longer set to a raw bitvector that selects per instruction features like this since, as you can see, it's pretty opaque just looking at it. The maddf ext=1 becomes ext=Scalar. For msrli and mslli, ext=0 is the default and can be dropped. It would leave the ops as SIMD. Since they're already operating at the full width of the fp register type (a double) the value is especially redundant.

Marc Orr May 11, 2012, 10:29 a.m. (May 11, 2012, 10:29 a.m.)

I've address this in a newly submitted patch (since I can't update this one).

src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py (Diff revision 1)

This implementation is a bit inefficient, although not terribly so. You have to be careful since the two operands may be the same registers and you don't want to overwrite something you still need, but, for instance, the maddf one line above, this shift of ufp4 and the maddf on line 60 could all update xmmh since all "high" halves of xmm registers have been read and no faults can happen. The moves that read out xmmlm could be moved higher, and xmml could also be updated directly.

I think it -may- also be possible to do something clever and cut down the number of microops shifting things around to pack and unpack the results. I may have also suspected this was true when I wrote the much simpler 64 bit wide version of this instruction below this one where the components are whole registers and can be indexed directly, but then didn't come up with anything and punted for later.

Marc Orr May 11, 2012, 10:29 a.m. (May 11, 2012, 10:29 a.m.)

In a new patch (since I can't update this one), I removed a few redundant loads from each macroop. I didn't quite achieve the optimization that you are suggesting though.

src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py (Diff revision 1)

This microop is changing architecturally visible state and effectively committing to completing the op before all the possibly faulting ops have executed, specifically the following loads. There are 8 microcode fp registers so you can just use the others and leave ufp3 around until the end.

Marc Orr May 11, 2012, 10:29 a.m. (May 11, 2012, 10:29 a.m.)

I've address this in a newly submitted patch (since I can't update this one).

src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py (Diff revision 1)

Like above, this can't happen before the loads.

Marc Orr May 11, 2012, 10:29 a.m. (May 11, 2012, 10:29 a.m.)

I've address this in a newly submitted patch (since I can't update this one).

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Status: Closed (submitted)