x86 ISA: Implement the sse3 haddps instruction.
Review Request #1189 - Created May 11, 2012 and submitted
| Information | |
|---|---|
| Marc Orr | |
| gem5 | |
| default | |
| Reviewers | |
| Default | |
Changeset 8981:7742f432fc1d --------------------------- x86 ISA: Implement the sse3 haddps instruction. This patch is a revised version of Vince Weaver's patch from 592.
Wrote a little program that uses haddps. All 3 haddps versions were tested (XMM_XMM, XMM_M, and XMM_P).
Posted (May 13, 2012, 12:25 p.m.)
Thanks for stepping up and taking a shot at this. I saw some possible improvements to your implementation and, after play with it a bit, came up with this:
shuffle ufp1, xmml, xmmh, ext=((0 << 0) | (2 << 2)), size=4
shuffle ufp2, xmml, xmmh, ext=((1 << 0) | (3 << 2)), size=4
shuffle ufp3, xmmlm, xmmhm, ext=((0 << 0) | (2 << 2)), size=4
shuffle ufp4, xmmlm, xmmhm, ext=((1 << 0) | (3 << 2)), size=4
maddf xmml, ufp1, ufp2, size=4
maddf xmmh, ufp3, ufp4, size=4
The memory versions follow naturally. It works/should work by moving the input values to the position they'll be in the answer with the "shuffle" microop, and then adding them together in parallel. I've verified that this compiles but haven't functionally tested it. Could you please use your test program to do that?
Also, the HADDPS_XMM_P version is basically the same as HADDPS_XMM_M, it just uses RIP relative addressing for the memory operand. The microcode for those typically read the RIP into microcode register t7 and then use the riprel address computation shorthand but are otherwise the same as the normal memory version. That addressing mode is only available in 64 bit mode, and to make sure you're using the version you want (RIP relative versus regular) you may have to encode the instruction manually.
Review request changed
Updated (May 14, 2012, 11:23 a.m.)
Description: |
|
||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diff: |
Revision 2 (+38 -2) |
Review request changed
Updated (May 14, 2012, 11:27 a.m.)
Testing Done: |
|
|---|
Posted (May 14, 2012, 1:16 p.m.)
-
It would be better to put these loads earlier for two reasons. First, it helps hide their latency by allowing other things to run while they wait for their results. Second, it makes it easier to put the ufp-s in an order (say, ascending) so that it's easier to spot a typo. I'm glad to see this apparently worked, though.
Review request changed
Updated (May 15, 2012, 10:43 a.m.)
Description: |
|
||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diff: |
Revision 3 (+39 -2) |
Looks good. The only thing I'd change is to remove the space between the two groups of shuffle microops to be consistent with the register only one and to not reuse ufp1 and ufp2. There micro floating point registers go up to ufp7, and by spreading it out we'll make things easier for a dumber pipeline that isn't good at resolving register dependencies. I'll make those two changes and get this in the repository. Thanks!
