O3 LSQ: Implement TSO

Information
Submitter:	Nilay Vaish
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 8702:9a5651e7bd5b
---------------------------
O3 LSQ: Implement TSO
This patch makes O3's LSQ maintain total order between stores. Essentially
only the store at the head of the store buffer is allowed to be in flight.
Only after that store completes, the next store is issued to the memory
system.

Testing Done

Does this really implement a post-retirement store buffer, or is this just a pre-retirement store queue?  I don't really know anything about this code, but based on my brief scan the store queue looks like it tracks pre-retirement stores.  If so, that is very different than a post-retirement store buffer.  Furthermore, you need to be careful whether the head of the pre-retirement store queue is speculative or non-speculative.

Nilay Vaish Nov. 15, 2011, 7:25 a.m. (Nov. 15, 2011, 7:25 a.m.)

Brad, the store queue maintains both speculative and committed stores.
My understanding is that once the flag canWB is set to true, the instruction
is now free to issue the store to the memory (that is the instruction is now
committed).

The loop in which I added the condition for checking whether there is a
store in flight, is the one used for issuing stores to the memory. Since the
loop condition checks whether canWB is true, therefore only non-speculative
stores would be issued to the memory system.

src/cpu/o3/lsq_unit_impl.hh (Diff revision 1)

How about we have a parameter to the CPU and have a check that it's set for x86.

Nilay Vaish Nov. 15, 2011, 8:23 a.m. (Nov. 15, 2011, 8:23 a.m.)

Or we can have it as a trait of an isa just like we have one flag for
memory access alignment.

Ali Saidi Nov. 15, 2011, 1:51 p.m. (Nov. 15, 2011, 1:51 p.m.)

But you might want to see how performance changes if you enable it on an isa with a weaker memory model.

Gabe Black Nov. 15, 2011, 9:13 p.m. (Nov. 15, 2011, 9:13 p.m.)

Having #ifdefs is usually bad, and making this depend on x86 is probably not right. It's really whether we have or don't have a certain restriction on how memory accesses are handled, and x86 just happens to be where it's needed, right? Moving this out of the preprocessor would be great. Sometimes it's really annoying to do that, but in the long run it's a good idea.

src/cpu/o3/lsq_unit_impl.hh (Diff revision 1)

Similarly.

src/cpu/o3/lsq_unit_impl.hh (Diff revision 1)

Again.

Diff:

Revision 2 (+25 -1)

Show changes

	src/arch/alpha/isa_traits.hh
	src/arch/arm/isa_traits.hh
	src/arch/mips/isa_traits.hh
	src/arch/power/isa_traits.hh
	src/arch/sparc/isa_traits.hh
	src/arch/x86/isa_traits.hh
	src/cpu/o3/lsq_unit.hh
	src/cpu/o3/lsq_unit_impl.hh

I have added a trait hasTSO with each of the ISA. This is set to true only
for x86 and false for the rest.

Diff:

Revision 3 (+25 -1)

Show changes

	src/arch/alpha/isa_traits.hh
	src/arch/arm/isa_traits.hh
	src/arch/mips/isa_traits.hh
	src/arch/power/isa_traits.hh
	src/arch/sparc/isa_traits.hh
	src/arch/x86/isa_traits.hh
	src/cpu/o3/lsq_unit.hh
	src/cpu/o3/lsq_unit_impl.hh

src/arch/alpha/isa_traits.hh (Diff revision 3)

Guys, I dont think this should be enabled/disabled through the ISA TRAITS. 

Then, if you are doing any architectural exploration on the impact of memory models you are going to have to recompile the code or generate different binaries.

Instead, this should be a parameter to the CPU model. The CPU model can then choose to implement relaxed ordering, TSO, or whatever flavor someone wants. 

I dont think there is anything that explicitly ties an ISA to a memory model in all cases, so we would be creating a dependency we would regret later (for instance, SPARC can have a relaxed model or TSO right?)

And IMO, it will be cleaner to just give the CPU a "memory_model" parameter that is a Enum, then the LSQ can by if statement choose what to do and then when someone reads the code it will be obvious what's taking place and why.

What do people think about that?

Gabe Black Jan. 7, 2012, 6:14 a.m. (Jan. 7, 2012, 6:14 a.m.)

I agree with your conclusion and your reasoning. By extension it could be implemented either be an enum or a collection of bools depending on whether the models can cleanly be broken down into a set of selectable behaviors like doing stores one at a time. Then we could avoid things like having if (a || b || f || g) do_foo(); where a, b, f, and g all have some well defined property requiring foo.

Steve Reinhardt Jan. 9, 2012, 4:21 a.m. (Jan. 9, 2012, 4:21 a.m.)

I agree that someone might want to enable TSO on ARM, or enable SC on x86, or might even come up with a weakly consistent variant of x86.  But anyone that's serious about this will probably have a set of specific models they want to compare with, and will be considering the impact of the cache coherence protocol as well, so it's not a trivial thing like increasing the cache size.  And how many people are really going to do this?

I think what's going on here is clear enough, such that if someone really wanted to do this study they could add a CPU model parameter to override TheISA::HasTSO if they wanted to (or if they weren't that ambitious at coding, they could just change the constant and deal with having two binaries).

So sure, ideally it would be great if the CPU model supported all possible consistency models in a cleanly factored way such that each independent behavioral aspect has its own flag, with the default for each ISA being the most aggressive model that that ISA supports, but with parameters exposed to override it with a stronger model or even a weaker model, but making sure to print out adequate warnings in the latter case.  If someone actually does this I hope they contribute it back.  But I think it's pretty serious overkill to make Nilay do this just to get x86 TSO support in so that x86 O3 MP works correctly.

Korey Sewell Jan. 9, 2012, 4:51 a.m. (Jan. 9, 2012, 4:51 a.m.)

Another "are we naively generalizing for the sake of generalizing?" question that we are so adept at debating :)

Although I think it's pretty easy to add the memory model option to the CPU, the point about the "dependencies" between a proposed memory model and the coherence protocol study is a good one. That probably makes the usefulness of a general parameter minimal if its just used in isolation. The "clean" way you (Steve) describe of parameter overrides and warnings would truly be overkill for something that no one would immediately find useful.

For the case where there is clearly only two options and a 3rd is far away (e.g. branch delay slots or not, TSO or not, etc.), then I could see how this would be viable. If you get to the point where you have more than two options you are setting/unsetting, then appropriate parameters should be in place.

Nilay, I won't stand in the way of your change, it seems reasonable that as long as gem5 is currently "relaxed" or "TSO" that it's not that big of a deal to do this the way you propose.

Korey Sewell Jan. 9, 2012, 3:41 a.m. (Jan. 9, 2012, 3:41 a.m.)

Brad, 
would you be against enabling TSO through a CPU parameter option rather then an ISA characteristic?

Ali Saidi Jan. 9, 2012, 8:20 a.m. (Jan. 9, 2012, 8:20 a.m.)

I'm neutral on this, either way is fine, but I think the name HasTSO isn't ideal. An ISA is TSO, but I don't know that it has TSO.

Ali

Diff:

Revision 4 (+18 -1)

Show changes

	src/cpu/o3/O3CPU.py
	src/cpu/o3/cpu.hh
	src/cpu/o3/cpu.cc
	src/cpu/o3/lsq_unit.hh
	src/cpu/o3/lsq_unit_impl.hh

Description:

	+	Changeset 8702:9a5651e7bd5b
	+
		O3 LSQ: Implement TSO
		This patch makes O3's LSQ maintain total order between stores. Essentially
		only the store at the head of the store buffer is allowed to be in flight.
		Only after that store completes, the next store is issued to the memory
		system.

Diff:

Revision 5 (+18 -1)

Show changes

	src/cpu/o3/O3CPU.py
	src/cpu/o3/cpu.hh
	src/cpu/o3/cpu.cc
	src/cpu/o3/lsq_unit.hh
	src/cpu/o3/lsq_unit_impl.hh

Looks good... thanks, Nilay.

src/cpu/o3/O3CPU.py (Diff revision 5)

Seems like we should do something like:

  needsTSO = Param.Bool(buildEnv['TARGET_ISA'] == 'x86', "Memory model")

Korey Sewell Jan. 13, 2012, 11:31 a.m. (Jan. 13, 2012, 11:31 a.m.)

Also, this description should be changed from memory model to "Enforce Total Store Ordering (TSO)".

And thanks Nilay for taking the time to get this done. Your work is appreciated!

src/cpu/o3/lsq_unit_impl.hh (Diff revision 5)

Here,
we should use a member variable inside the LSQUnit instead of the pointer reference to the CPU's member variable.

That way we get:
if (needsTSO)
   storeInFlight=false

vs.

if (cpu->needsTSO)
   storeInFlight=false

Nilay Vaish Jan. 13, 2012, 1:21 p.m. (Jan. 13, 2012, 1:21 p.m.)
```
Why does this matter?
```

Korey Sewell Jan. 13, 2012, 1:49 p.m. (Jan. 13, 2012, 1:49 p.m.)

In terms of overall runtime of the simulator, it probably wont matter considering all the other things with larger overheads (e.g. event scheduling, memory allocation, etc.)

But isolating just this code snippet, the "cpu->" part of that condition is going to cause unnecessary overhead of looking up that pointer to get to the value of "needsTSO". 

It caught my eye because you are doing that 3 times per store:
- once to check if a store's in flight
- once to mark a store in flight
- once to unmark a store in flight

Ideally, things that are actually changing values are worth the pointer reference (e.g. the interruptPending flag) whereas things that are generally constant parameters (e.g. needTSO) aren't worth it.

Again, it's not going to be the biggest deal in terms of overall impact and I'm definitely nitpicking. Just something to consider as you go forward with these O3 patches...

src/cpu/o3/lsq_unit_impl.hh (Diff revision 5)

Here,
we should use a member variable inside the LSQUnit instead of the pointer reference to the CPU's member variable.

That way we get:
if (needsTSO)
   storeInFlight=false

vs.

if (cpu->needsTSO)
   storeInFlight=false

(same for line 773)

src/cpu/o3/lsq_unit_impl.hh (Diff revision 5)

same as above.

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Diff:

Diff:

Diff:

Description:

Changeset 8702:9a5651e7bd5b

Diff:

Status: Closed (submitted)