ruby: Fixed pipeline squashes caused by aliased requests

Information
Submitter:	Tony Gutierrez
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests

This patch was created by Bihn Pham during his internship at AMD.

This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.

Testing Done

Issue Summary

Description	From	Last Updated	Status
Whitespace on this line looks too wide.	Joel Hestness	May 21, 2015, 12:52 p.m.	Open
Minor: This declaration introduces yet another member variable naming convention (m_UpperCamel). Can you please change the name of this table ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
Minor: Please follow the local_variable naming convention that is already used in this function. These should be aliased_it and aliased_it_end	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
Minor: Ibid	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
Can you please clarify, since it hasn't been stated explicitly yet: Current L0/L1 cache controllers cannot handle multiple outstanding accesses ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
This is a symptom of never sending the aliased request to the controller (at which point it's initialRequestTime would have ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
This breaks the no-RFO property of some L1s (and is particularly relevant for GPUs). For instance, this would fail with ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
There is a similar issue for decoalescing loads here in the Sequencer (also in zero time!). Specifically, by queuing accesses ...	Joel Hestness	May 29, 2015, 8:01 a.m.	Open
Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?	Joel Hestness	July 31, 2015, 4:13 p.m.	Open
I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is ...	Joel Hestness	July 31, 2015, 4:13 p.m.	Open

From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.

It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.

I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.

Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.

Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?

Brad Beckmann May 18, 2015, 4:49 p.m. (May 18, 2015, 4:49 p.m.)

I'm glad you appreciate the importance of this patch. I'm not deeply familar with the Classic model but I'm pretty sure it does not allow a CPU to issue multiple outstanding access to a single cache line.

What is the complete set of Ruby-working CPU cores and protocols? What tests does one use to ensure correct consistency? We can certainly assert for the correctness of x86. I believe that all existing Ruby protocols implement coherence and the evictionCallbacks the same way, so I'm not expecting any differences. What makes you think there may be an issue with a certain combination of cores and protocols?

Joel Hestness May 21, 2015, 12:53 p.m. (May 21, 2015, 12:53 p.m.)

If, in fact, the classic memory model blocks concurrent accesses to the same line, does the O3CPU also suffer from O3 pipeline squashes there? According to the status matrix http://gem5.org/Status_Matrix, this could be tested, since you don't need to use atomic memory accesses for this. If the classic memory model suffers from this issue, it doesn't sound like O3 pipeline flushing should be fixed in Ruby, but instead by fixing the O3 CPU. Why aren't we doing that instead?

On the other hand, if the classic hierarchy doesn't suffer from pipeline squashes, I'd prefer some exposition about why it differs and why we can't implement Ruby's blocking the same way (read: this patch's description and code comments leave a ton to be desired). A common cache interface between the two memory models is very desirable to allow some certainty that cores should work with both. We need to know if this patch makes classic and Ruby cache interfaces differ.

If the classic memory hierarchy does allow multiple outstanding requests to the same line, then there are possible consistency implications to be explored. Here is a simple example (specifically with MOESI_CMP_directory) that might suffer from broken TSO with this patch:
1) A core issues store 1 to cache line A
2) A core tries to issue store 2 to cache line A. Without this patch, Ruby would block, which can help the LSQ preserve TSO. With this patch, Ruby now accepts store 2, and both store 1 and 2 end up in the L1 controller's mandatory queue.
3) Suppose the cache line is in a transient state being grabbed from the cache so that store 1 gets recycled in the mandatory queue (which is unordered).
Now store 2 is effectively ahead of store 1 in the cache controller, and their order of visibility to the rest of the system could be upset. This is a simple example, which witnesses the issue pretty easily due to recycling the store queue. However, other protocols must currently assume a single outstanding access per line at a time (otherwise the RubyPort would allow them), so they do not enforce ordering if it is required.

For tests, consistency litmus. For working systems, we should start with those working in the current status matrix. Unfortunately, the matrix appears to be out of date, because I know that x86 O3CPU works with at least MOESI_hammer in both SE and FS mode. We currently depend on the O3CPU+Ruby to enforce correct x86 consistency for our heterogeneous processor tests, so I'd prefer to be assured this change won't break that. It seems likely that other configurations should also be tested (e.g. Marco just posted a review suggesting he is testing consistency of MESI_Two_Level and O3: http://reviews.gem5.org/r/2840/).

Joel Hestness May 21, 2015, 1:02 p.m. (May 21, 2015, 1:02 p.m.)

UPDATE: Apologies, I see now that the Sequencer does not actually issue aliased requests, but instead buffers them in the sequencer (the subject of your thread with Nilay). The access ordering in cache controllers is not of concern, then.

I've also inspected the O3 CPU's LSQ, and it appears to block itself from issuing multiple stores when enforcing TSO. x86 O3 CPU should be fine with this change.

I'd still like us to discuss the cache interface differences between classic and Ruby.

I have asked this question before when Steve posted this patch several months ago.
I am going to ask it again? Is it all right to buffer requests in the Sequencer?
Do we know of CPU designs that do so? What problems do we face when we push through
requests for same address to the cache controllers?

Brad Beckmann May 18, 2015, 4:55 p.m. (May 18, 2015, 4:55 p.m.)

Yes, it is ok to buffer requests in the Sequencer. We are pretty satisfied with the correlation results against our hardware using Ruby. The Sequencer is a simplification of much more complicated buffering in real hardware. That simplification is a very good thing.

I believe it is pretty universal across most designs that they don't allow a single CPU to issue multiple requests for the same cache line that go out throughout the memory system. That has huge power and complexity implications.

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

Whitespace on this line looks too wide.

Show all issues

Description:

~		Changeset 10844:0848038fe1d8
	~	Changeset 10817:d010e6a8e783

		ruby: Fixed pipeline squashes caused by aliased requests

		This patch was created by Bihn Pham during his internship at AMD.

		This patch fixes a very significant performance bug when using the O3 CPU model
		and Ruby. The issue was Ruby returned false when it received a request to the
		same address that already has an outstanding request or when the memory is
		blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
		instructions. This fix merges readRequestTable and writeRequestTable in
		Sequencer into a single request table that keeps track of all requests and
		allows multiple outstanding requests to the same address. This prevents O3
		from squashing the pipeline.

Diff:

Revision 2 (+260 -269)

Show changes

	src/mem/ruby/system/Sequencer.hh
	src/mem/ruby/system/Sequencer.cc

On 5/19 Nilay wrote: "I need some reference on this. I talked to Prof. Wood about it and he said that he is not aware of any CPUs that do this.'

Is the challenge that CPUs do not use complicated buffering to properly implement memory models? If so, I don't see how someone could disagree with that statement. For instance, see Butler et al. IEEE Micro paper from 2011. There are three different L/S units that can generate requests to the L1 I and D caches. Do you not think there is complicated buffering to support this?

On 5/19 Nilay also wrote: "Note that the L0/L1 controller would serve those requests from the same cache block. So requests would not be sent out beyond the L0/L1 controller. And as much as I understand the protocols currently in gem5, if we completely remove the aliasing support from sequencer, the L0/L1 controllers would either merge or block aliased requests."

I would be very careful with trying such a solution. Most SLICC protocols rely on stalling or recycling to deal with conflicting requests. A lot of stalls and recycles can significantly impact performance.

In general, we need to get past the expectation that contributors are going to completely re-implement their entire approach when posting a patch. What you are suggesting goes well beyond this patch.

I was buffering these comments, because this change has large implications and we still haven't discussed the differences between Ruby and classic memory interfaces. The patch submitter should be able to address questions like the ones I've raised so far (I don't feel any clarification has been provided yet).

Once again, I did more digging: It appears that classic caches completely model MSHR queuing and write buffering, and they block requests from the core when these are full. This indicates that they accept multiple concurrent accesses to a single line (it would be nice to get confirmation on that from a classic user). Thus, it makes sense to allow the Ruby sequencer to accept multiple outstanding accesses to a single line concurrently to be consistent with classic.

Overall, I have significant complaints about this patch, because it introduces buffering into the sequencer, which has the effect of implementing MSHR queuing that should probably be in the caches to be consistent with classic. This breaks the sequencer's thin shim abstraction between cores and cache controllers, appears to break no-RFO L1 caches, and side-steps cache resource constraints. Detailed comments are below.

Steve Reinhardt June 1, 2015, 7:04 p.m. (June 1, 2015, 7:04 p.m.)

To follow up on Joel's high-level question: the classic cache model does indeed combine requests to the same cache block using a finite set of MSHRs, and blocks when the MSHRs are full. Note that this is useful at all levels of the hierarchy, e.g., in a shared L2 you can end up combining requests from different L1s that happen to target the same cache block. The general rule in the classic hierarchy is that a single cache will have at most one outstanding request for a given cache block at any point in time. When a response arrives, the cache will satisfy as many of the buffered requests for the block as it can. Note that the timing is sketchy here on the classic side too; the latency of handling these (potentially multiple) requests is not modeled. It's not clear how large of an effect that is.

Note that the set of requests to a given cache block are processed in order, both in the classic cache and (I believe) in this patch as well. (Binh wrote this code originally, but I spent a fair amount of time cleaning it up before it was posted the first time, though that was long enough ago I don't remember all the details off the top of my head.) If stronger ordering guarantees are required by the consistency model, it's the responsibility of the CPU to hold off on issuing requests to the cache appropriately (as Joel saw with the O3 CPU).

Joel Hestness June 2, 2015, 1:26 p.m. (June 2, 2015, 1:26 p.m.)

Thanks for the confirmation on that, Steve.

I've thought more about this patch, and while I have issue with adding buffering in the Sequencer, I think this patch is headed in an appropriate direction if we figure out how to make it more flexible for protocol authors. Specifically, I feel that we should move the aliased access buffering out to a separate Ruby component, and slim the sequencer to pass all requests directly through to the top-level cache controller. This would allow a protocol author the following options:
A) Use the new buffering component to block accesses like the current mainline sequencer
B) Use the new buffering component to buffer them like this patch (but fix incorrect assumptions about line ownership)
C) Remove the new component and send all accesses straight through to the cache controller, where the protocol author can do MSHR handling/queuing/coalescing as desired

A major difference between the classic memory hierarchy and Ruby is that Ruby supports writing new coherence protocols rather than having a static protocol. To ease writing new coherence protocols, Ruby (and SLICC) makes significant and useful abstractions that decrease a protocol author's effort required to get an apparently-working coherence protocol. Examples of this include disallowing multiple in-flight requests to one cache line in the L1 caches (eliminates need to reason about MSHR queuing in L1s immediately), and providing a (optional) functionally coherent backing store (eliminates the need to make protocols 'data-correct' immediately). We need the mainline codebase to uphold and improve these sorts of simplifying Ruby abstractions.

By adding a new component that parameterizes how to buffer aliased accesses, we would - again - be aiding the protocol author by introducing a path toward a working, high-performance protocol. By blocking aliased requests, the author can choose to set up such buffering in the requesting core and keep the L1 cache controller simple as a first cut when writing a protocol. By using buffering like this patch, the author might be able to get the performance of relaxed consistency without a complicated L1 cache (i.e. by allowing fake, unconstrained buffering). However, an author should be allowed to 'take the training wheels off' and handle multiple outstanding requests per line in the L1 controller. This aggressive protocol development is currently not even allowed by Ruby.

::comment:: I have realized that my gem5-gpu patches are, in effect, allowing multiple outstanding requests per line to the VI_hammer GPU L1s, but doing so in hacky ways that circumvent the sequencer callbacks. If I had the ability to send requests to the same line through the sequencer, I would fix the VI_hammer GPU L1 to do real MSHR queuing, since we know that is how NVIDIA parts work. Mainline gem5 needs to offer this to protocol authors.

Finally, I think it makes most sense that this fake buffering be in a separate component: Aside from being 'training wheels' for a protocol author, providing buffering in a separate component can provide flexible comparisons between hierarchies. Currently, there is no good way to compare performance of a system using Ruby against the classic memory hierarchy, because they lack a common interface. For example, it would be interesting to know if classic (MESI) performs similarly to MESI_Two_Level, but MESI_Two_Level assumes a single outstanding L1 access per line. However, MESI_Two_Level would probably perform terribly for single-element-strided heap access from the inorder or O3 CPUs. By having a separate buffering component that can be used with any core and any cache hierarchy, we can enforce the same interfaces on both hierarchies to reasonably compare them.

Would this new buffering component (or at least sequencer buffer parameterization) be a reasonable way to proceed with this patch? If so, I'll offer to help make it happen.

src/mem/ruby/system/Sequencer.hh (Diff revision 1)

Minor: This declaration introduces yet another member variable naming convention (m_UpperCamel). Can you please change the name of this table so it at least matches some other variables? m_lowerCamel looks most common, and gem5's style guideline suggests it should be lowerCamel (i.e. without 'm_').

(I understand a lot of Ruby code is bad about this, but we can keep from making it worse)

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

Minor: Please follow the local_variable naming convention that is already used in this function. These should be aliased_it and aliased_it_end

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

Minor: Ibid

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

Can you please clarify, since it hasn't been stated explicitly yet: Current L0/L1 cache controllers cannot handle multiple outstanding accesses to a single line? If so, can you elaborate on why? What breaks?

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

This is a symptom of never sending the aliased request to the controller (at which point it's initialRequestTime would have been set appropriately). I feel that all requests to the Sequencer need to be sent through to the controller, which would fix this.

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single line (in zero time) without their data ever being sent to the cache. In other words this effectively hoists MSHRs either up from the top-level cache controller (or pulls them down from the LSQ?). By doing this, repeated stores are never subject to resource constraints like cache porting or available MSHRs in the controller, and cache/MSHR access stats are incorrect.

You need to send requests through to the cache controller.

Show all issues

Andreas Hansson Sept. 4, 2015, 9:26 a.m. (Sept. 4, 2015, 9:26 a.m.)

Following up on this, I am of the opinion that we should probably do: 1) read/write combining in the LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

This breaks the no-RFO property of some L1s (and is particularly relevant for GPUs). For instance, this would fail with gem5-gpu's VI_hammer protocol, because the cache line is not returned on writes. You can't assume that the current data will be available when you've queued loads behind a store to a line. This is a major signal that MSHR queuing should not happen in the sequencer (which must be protocol-independent).

You need to send requests through to the cache controller.

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

There is a similar issue for decoalescing loads here in the Sequencer (also in zero time!). Specifically, by queuing accesses to a line in the Sequencer, the cache is never made aware of the data in the line that is needed for each of the queued accesses. This means that it does not know what portion of the line should be returned to the core, and this again dodges potential resource constraints like the amount of data that can be pulled from the line per cycle to respond to each outstanding access. Can the data from a cache line really be fanned out to multiple separate requests in the same cycle when queuing MSHRs (note: with strided cache access from O3 CPU, this could be 16, 32 or more concurrent accesses in a single cycle)?

You really need to send requests through to the cache controller.

Show all issues

Description:

~		Changeset 10817:d010e6a8e783
	~	Changeset 10887:1e05089bc991

		ruby: Fixed pipeline squashes caused by aliased requests

		This patch was created by Bihn Pham during his internship at AMD.

		This patch fixes a very significant performance bug when using the O3 CPU model
		and Ruby. The issue was Ruby returned false when it received a request to the
		same address that already has an outstanding request or when the memory is
		blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
		instructions. This fix merges readRequestTable and writeRequestTable in
		Sequencer into a single request table that keeps track of all requests and
		allows multiple outstanding requests to the same address. This prevents O3
		from squashing the pipeline.

Diff:

Revision 3 (+307 -266)

Show changes

	src/mem/ruby/system/Sequencer.py
	src/mem/ruby/system/Sequencer.hh
	src/mem/ruby/system/Sequencer.cc

Thanks for updating this and apologies for the delayed review.

Besides the comments below, this looks a lot more agreeable.

Brad Beckmann July 31, 2015, 5:03 p.m. (July 31, 2015, 5:03 p.m.)

We are just hours away from checking our code in, so we won't be able to address your comments. We did put a lot effort into adding the coalesce_reqs flag as you requested in your last review.

src/mem/ruby/system/Sequencer.cc (Diff revision 3)

Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?

Show all issues

Brad Beckmann July 31, 2015, 5:03 p.m. (July 31, 2015, 5:03 p.m.)

We'll have to get it next time we check in patches. Or you're welcome to update yourself.

src/mem/ruby/system/Sequencer.cc (Diff revision 3)

I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.

Show all issues

Brad Beckmann July 31, 2015, 5:03 p.m. (July 31, 2015, 5:03 p.m.)

There is a lot of code in the Sequencer that assumes RfO (example the block on calls to support RMW).

Joel Hestness Aug. 1, 2015, 6:51 a.m. (Aug. 1, 2015, 6:51 a.m.)

This is not cool. First, I added this request in my original review here, so it should have been addressed previously.

Second, your reasoning here for not addressing my request doesn't hold water: RMW operations are atomics that require ownership prior to performing the operation. Otherwise, the results would be nonsense. Even if there is other code in the Sequencer that assumes RFO, we don't need to perpetuate these assumptions.

Hi guys,

I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.

Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.

Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.

Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

Brad Beckmann Sept. 2, 2015, 3:59 p.m. (Sept. 2, 2015, 3:59 p.m.)

We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.

I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

Nilay Vaish Sept. 2, 2015, 6:58 p.m. (Sept. 2, 2015, 6:58 p.m.)

I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
merge this patch. I think we should split this patch into two parts:

a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
has no objections to that part. So should be possible to merge that patch any time.

b. the second part will provide each entry in the combined request table with a list of packets instead of a
single packet. This part of the patch is what we need to discuss.

-- I think we should drop the max outstanding request variable all together. As both of you agree
the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
maintain its own limit.

-- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
controller.

Brad Beckmann Sept. 4, 2015, 1:51 p.m. (Sept. 4, 2015, 1:51 p.m.)

I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?

I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

Joel Hestness Sept. 4, 2015, 3:43 p.m. (Sept. 4, 2015, 3:43 p.m.)

From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.

While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.

If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?

Nilay Vaish Sept. 4, 2015, 5:28 p.m. (Sept. 4, 2015, 5:28 p.m.)

I have posted a set of five patches that break this patch. The last patch is different
from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
go, the first secondary request (and hence now the primary request) is issued to the
cache controller after the original primary request has completed.

On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
This would force protocols / networks / testers / cores to do flow control.

Brad Beckmann Sept. 11, 2015, 12:45 p.m. (Sept. 11, 2015, 12:45 p.m.)

Joel, I thought we took care of the prior open issues with this patch. I understand your concern with merging with a patch that touches a lot of performance critical code. We've had to many of those types of merges as well. I feel like I've personally put a lot of effort in trying to address your concerns. So far, once I address your concerns, you seem to just have more concerns. Please list out your current "hard questions" and "open issues". I will try again to address them as much as I can. I just ask you to not add more items to the list and please keep in mind that we have very specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real hardware that strongly justify this patch.

Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/, http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing feature?

I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

We absolutely should not force protocols to implement flow control. In particular, the tester is designed to stress the system in ways that are not practical with realistic flow control. Flow control is and should be an option for those who want more accuracy.

Nilay Vaish Sept. 11, 2015, 2:13 p.m. (Sept. 11, 2015, 2:13 p.m.)

Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/,
http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing
feature?

394 and 3095. 3099 implements disable feature. But now I think we should not have disable at all since the controller is going to receive all the requests.

I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all > requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

The main problem, as I understand, is that aliased requests were nacked at the sequencer.
This would not happen any longer. So you should not see pipline squashes. I do not think we can
assume that the l1 controller would be ok with not seeing some of the requests. So I am not in favor
of fulfilling multiple requests in one go from the sequencer. I would be fine with fulfilling
multiple requests together in the l1 controller.

Joel Hestness Sept. 12, 2015, 8:27 a.m. (Sept. 12, 2015, 8:27 a.m.)

Hi guys. Can we organize a bit here? There are a lot of loose ends in this review, so it's really difficult to know how to help move this forward.

@Nilay: I appreciate splitting this patch into orthogonal parts, since I feel it would move things along. Unfortunately, though, it's not clear whether we should continue reviewing this patch or your set. One tricky thing about reviewing your patches is that we may end up needing you to do performance validation unless Brad/AMD is willing to take your patches and merge them into their queue. Can you state your preferences on your involvement in these changes? Which patches do you prefer we review?

@Brad: I'm sorry I do not yet feel comfortable with this patch. I was under the impression that this patch was going to change O3 CPU pipeline squashing behavior, but my tests show that current mainline O3 squashing behavior is far better than older code (e.g. chgset 10238). My microbenchmarking shows that the performance differences between this patch and current mainline appear to be trivial (when the sim'd system is appropriately configured). I'd prefer to know how this change is being validated.

Beyond performance validation, Nilay's new patches make it difficult to know what we should be reviewing. I would assume that your preference would be to finish fixing up this patch and you/AMD validates it, rather than trying to mix in Nilay's patches, which may be hard to merge. Can you let us know your preference for how to proceed with these, so we can know where to spend effort reviewing? If you'd prefer continuing with this review over Nilay's, can you please close/drop issues that you feel have been resolved?

My comment seems to have gotten lost in all the different threads going on...bad S/N. Anyways, here it is:

I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?

This solution would also translate very well between both Ruby and the classic memory system.

Joel Hestness Sept. 12, 2015, 7:27 a.m. (Sept. 12, 2015, 7:27 a.m.)

Hi Andreas. I think we all agree with you about where coalescing should happen. It appears that (1) is available from particular cores (e.g. the O3 CPU). The problem currently is that getting (2) in Ruby would require very non-trivial modification to the L1 controllers (to do combining) in each of the existing protocols (7 in gem5 + at least 2 not yet in gem5). To avoid all this protocol modification, this patch is AMD's proposal to do L1 MSHR-like combining within the sequencer. This proposed solution should be viewed as a stopgap on the road to MSHR combining in the L1 controllers.

Andreas Hansson Sept. 12, 2015, 7:31 a.m. (Sept. 12, 2015, 7:31 a.m.)

I see. Thanks for the clarification.

I am fairly convinced the O3 CPU is not doing any coalescing at the moment. Are you sure? If not I'd say that's the place to start. Coalescing in the LSQ is probably as important as coalescing of MSHR targets, if not more so.

Joel Hestness Sept. 12, 2015, 7:49 a.m. (Sept. 12, 2015, 7:49 a.m.)

Sorry. Inspecting the O3 LSQ code, I believe you are correct that it does not do any coalescing.

You have a pending review.

Review Board 2.0.15

Screenshots

Files

Issue Summary

Description:

Changeset 10844:0848038fe1d8

Changeset 10817:d010e6a8e783

Diff:

Description:

Changeset 10817:d010e6a8e783

Changeset 10887:1e05089bc991

Diff: