ruby: Fixed pipeline squashes caused by aliased requests
Review Request #2787 - Created May 11, 2015 and updated
| Information | |
|---|---|
| Tony Gutierrez | |
| gem5 | |
| default | |
| Reviewers | |
| Default | |
Changeset 10887:1e05089bc991 --------------------------- ruby: Fixed pipeline squashes caused by aliased requests This patch was created by Bihn Pham during his internship at AMD. This patch fixes a very significant performance bug when using the O3 CPU model and Ruby. The issue was Ruby returned false when it received a request to the same address that already has an outstanding request or when the memory is blocked. As a result, O3 unnecessary squashed the pipeline and re-executed instructions. This fix merges readRequestTable and writeRequestTable in Sequencer into a single request table that keeps track of all requests and allows multiple outstanding requests to the same address. This prevents O3 from squashing the pipeline.
Issue Summary
From experience with the O3 CPU, this is a VERY important change for simulated CPU performance. I appreciate the effort to finally fix this.
It would be nice for the Ruby and gem5-classic memory hierarchies to provide the same access interface, but I think the consistency implications of this patch need to be discussed.
I'm worried that this patch seems likely to upset consistency models for cores that may have relied on Ruby to block aliased memory accesses. Specifically, if a core was blocking multiple outstanding accesses to a single line as a way to enforce consistency to data in that line (e.g. TSO), but now the accesses could be concurrently issued to Ruby, seems like it would now be the responsibility of the sequencer and maybe even the coherence protocol to ensure that those accesses remain ordered as required.
Given the behavior of the O3 CPU, perhaps the classic memory hierarchy allows multiple outstanding accesses to a single line. However, it handles transient coherence states with atomic coherence updates, which make it much easier to guarantee access ordering to a single line, so I'm not clear that it exposes the same interface as this patch provides.
Are you sure that all Ruby-working CPU cores and existing protocols still enforce correct consistency?
I have asked this question before when Steve posted this patch several months ago.
I am going to ask it again? Is it all right to buffer requests in the Sequencer?
Do we know of CPU designs that do so? What problems do we face when we push through
requests for same address to the cache controllers?
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
Whitespace on this line looks too wide.
Description: |
|
|||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diff: |
Revision 2 (+260 -269) |
On 5/19 Nilay wrote: "I need some reference on this. I talked to Prof. Wood about it and he said that he is not aware of any CPUs that do this.'
Is the challenge that CPUs do not use complicated buffering to properly implement memory models? If so, I don't see how someone could disagree with that statement. For instance, see Butler et al. IEEE Micro paper from 2011. There are three different L/S units that can generate requests to the L1 I and D caches. Do you not think there is complicated buffering to support this?
On 5/19 Nilay also wrote: "Note that the L0/L1 controller would serve those requests from the same cache block. So requests would not be sent out beyond the L0/L1 controller. And as much as I understand the protocols currently in gem5, if we completely remove the aliasing support from sequencer, the L0/L1 controllers would either merge or block aliased requests."
I would be very careful with trying such a solution. Most SLICC protocols rely on stalling or recycling to deal with conflicting requests. A lot of stalls and recycles can significantly impact performance.
In general, we need to get past the expectation that contributors are going to completely re-implement their entire approach when posting a patch. What you are suggesting goes well beyond this patch.
I was buffering these comments, because this change has large implications and we still haven't discussed the differences between Ruby and classic memory interfaces. The patch submitter should be able to address questions like the ones I've raised so far (I don't feel any clarification has been provided yet).
Once again, I did more digging: It appears that classic caches completely model MSHR queuing and write buffering, and they block requests from the core when these are full. This indicates that they accept multiple concurrent accesses to a single line (it would be nice to get confirmation on that from a classic user). Thus, it makes sense to allow the Ruby sequencer to accept multiple outstanding accesses to a single line concurrently to be consistent with classic.
Overall, I have significant complaints about this patch, because it introduces buffering into the sequencer, which has the effect of implementing MSHR queuing that should probably be in the caches to be consistent with classic. This breaks the sequencer's thin shim abstraction between cores and cache controllers, appears to break no-RFO L1 caches, and side-steps cache resource constraints. Detailed comments are below.
-
src/mem/ruby/system/Sequencer.hh (Diff revision 1) -
Minor: This declaration introduces yet another member variable naming convention (m_UpperCamel). Can you please change the name of this table so it at least matches some other variables? m_lowerCamel looks most common, and gem5's style guideline suggests it should be lowerCamel (i.e. without 'm_').
(I understand a lot of Ruby code is bad about this, but we can keep from making it worse)
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
Minor: Please follow the local_variable naming convention that is already used in this function. These should be aliased_it and aliased_it_end
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
Minor: Ibid
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
Can you please clarify, since it hasn't been stated explicitly yet: Current L0/L1 cache controllers cannot handle multiple outstanding accesses to a single line? If so, can you elaborate on why? What breaks?
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
This is a symptom of never sending the aliased request to the controller (at which point it's initialRequestTime would have been set appropriately). I feel that all requests to the Sequencer need to be sent through to the controller, which would fix this.
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single line (in zero time) without their data ever being sent to the cache. In other words this effectively hoists MSHRs either up from the top-level cache controller (or pulls them down from the LSQ?). By doing this, repeated stores are never subject to resource constraints like cache porting or available MSHRs in the controller, and cache/MSHR access stats are incorrect.
You need to send requests through to the cache controller.
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
This breaks the no-RFO property of some L1s (and is particularly relevant for GPUs). For instance, this would fail with gem5-gpu's VI_hammer protocol, because the cache line is not returned on writes. You can't assume that the current data will be available when you've queued loads behind a store to a line. This is a major signal that MSHR queuing should not happen in the sequencer (which must be protocol-independent).
You need to send requests through to the cache controller.
-
src/mem/ruby/system/Sequencer.cc (Diff revision 1) -
There is a similar issue for decoalescing loads here in the Sequencer (also in zero time!). Specifically, by queuing accesses to a line in the Sequencer, the cache is never made aware of the data in the line that is needed for each of the queued accesses. This means that it does not know what portion of the line should be returned to the core, and this again dodges potential resource constraints like the amount of data that can be pulled from the line per cycle to respond to each outstanding access. Can the data from a cache line really be fanned out to multiple separate requests in the same cycle when queuing MSHRs (note: with strided cache access from O3 CPU, this could be 16, 32 or more concurrent accesses in a single cycle)?
You really need to send requests through to the cache controller.
Description: |
|
|||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diff: |
Revision 3 (+307 -266) |
Thanks for updating this and apologies for the delayed review.
Besides the comments below, this looks a lot more agreeable.
-
src/mem/ruby/system/Sequencer.cc (Diff revision 3) -
Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?
-
src/mem/ruby/system/Sequencer.cc (Diff revision 3) -
I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.
Hi guys,
I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.
Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.
Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.
Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.
My comment seems to have gotten lost in all the different threads going on...bad S/N. Anyways, here it is:
I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?
This solution would also translate very well between both Ruby and the classic memory system.
