x86 ISA: Implement the sse3 haddps instruction.

Information
Submitter:	Marc Orr
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 8981:7742f432fc1d
---------------------------
x86 ISA: Implement the sse3 haddps instruction.

This patch is a revised version of Vince Weaver's  patch from 592.

Testing Done

Wrote a little program that uses haddps. All 3 haddps versions were tested (XMM_XMM, XMM_M, and XMM_P).

Issue Summary

Description	From	Last Updated	Status
It would be better to put these loads earlier for two reasons. First, it helps hide their latency by allowing ...	Gabe Black	May 15, 2012, 10:44 a.m.	Resolved

Thanks for stepping up and taking a shot at this. I saw some possible improvements to your implementation and, after play with it a bit, came up with this:


    shuffle ufp1, xmml, xmmh, ext=((0 << 0) | (2 << 2)), size=4
    shuffle ufp2, xmml, xmmh, ext=((1 << 0) | (3 << 2)), size=4
    shuffle ufp3, xmmlm, xmmhm, ext=((0 << 0) | (2 << 2)), size=4
    shuffle ufp4, xmmlm, xmmhm, ext=((1 << 0) | (3 << 2)), size=4

    maddf xmml, ufp1, ufp2, size=4
    maddf xmmh, ufp3, ufp4, size=4


The memory versions follow naturally. It works/should work by moving the input values to the position they'll be in the answer with the "shuffle" microop, and then adding them together in parallel. I've verified that this compiles but haven't functionally tested it. Could you please use your test program to do that?

Also, the HADDPS_XMM_P version is basically the same as HADDPS_XMM_M, it just uses RIP relative addressing for the memory operand. The microcode for those typically read the RIP into microcode register t7 and then use the riprel address computation shorthand but are otherwise the same as the normal memory version. That addressing mode is only available in 64 bit mode, and to make sure you're using the version you want (RIP relative versus regular) you may have to encode the instruction manually.

Gabe Black May 13, 2012, 12:33 p.m. (May 13, 2012, 12:33 p.m.)

And actually, if you wanted to write up little test programs for other instructions we're missing or even go as far as putting them into a new regression, that would be great. It would be a great tool for you (or anyone else) to get and keep those instructions working, and it would be nice to fill in some of the gaps in our x86 instruction support.

Marc Orr May 14, 2012, 11:28 a.m. (May 14, 2012, 11:28 a.m.)

In the near term, I could certainly put the test program that I have for haddps somewhere in the repo or make it into a regression if that would be useful.

Description:

~		Changeset 8981:bd580154c720
	~	Changeset 8981:463aba906774

		x86 ISA: Implement the sse3 haddps instruction.

		This patch is a revised version of Vince Weaver's patch from 592.

Diff:

Revision 2 (+38 -2)

Show changes

	src/arch/x86/isa/decoder/two_byte_opcodes.isa
	src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py

Testing Done:

~		Wrote a little program that uses haddps. I was able to test both the XMM_XMM version and the XMM_M version. I don't understand what the XMM_P version is so I was not able to test it.
	~	Wrote a little program that uses haddps. All 3 haddps versions were tested (XMM_XMM, XMM_M, and XMM_P).

src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py (Diff revision 2)

It would be better to put these loads earlier for two reasons. First, it helps hide their latency by allowing other things to run while they wait for their results. Second, it makes it easier to put the ufp-s in an order (say, ascending) so that it's easier to spot a typo. I'm glad to see this apparently worked, though.

Show all issues

Description:

~		Changeset 8981:463aba906774
	~	Changeset 8981:7742f432fc1d

		x86 ISA: Implement the sse3 haddps instruction.

		This patch is a revised version of Vince Weaver's patch from 592.

Diff:

Revision 3 (+39 -2)

Show changes

	src/arch/x86/isa/decoder/two_byte_opcodes.isa
	src/arch/x86/isa/insts/simd128/floating_point/arithmetic/horizontal_addition.py

Looks good. The only thing I'd change is to remove the space between the two groups of shuffle microops to be consistent with the register only one and to not reuse ufp1 and ufp2. There micro floating point registers go up to ufp7, and by spreading it out we'll make things easier for a dumber pipeline that isn't good at resolving register dependencies. I'll make those two changes and get this in the repository. Thanks!

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Issue Summary

Description:

Changeset 8981:bd580154c720

Changeset 8981:463aba906774

Diff:

Testing Done:

Description:

Changeset 8981:463aba906774

Changeset 8981:7742f432fc1d

Diff:

Status: Closed (submitted)