Review Board 2.0.15


ruby: Fixed pipeline squashes caused by aliased requests

Review Request #2787 - Created May 11, 2015 and updated

Information
Tony Gutierrez
gem5
default
Reviewers
Default
Changeset 10887:1e05089bc991
---------------------------
ruby: Fixed pipeline squashes caused by aliased requests

This patch was created by Bihn Pham during his internship at AMD.

This patch fixes a very significant performance bug when using the O3 CPU model
and Ruby. The issue was Ruby returned false when it received a request to the
same address that already has an outstanding request or when the memory is
blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
instructions. This fix merges readRequestTable and writeRequestTable in
Sequencer into a single request table that keeps track of all requests and
allows multiple outstanding requests to the same address. This prevents O3
from squashing the pipeline.

   

Issue Summary

11 11 0 0
Description From Last Updated Status
Whitespace on this line looks too wide. Joel Hestness May 21, 2015, 12:52 p.m. Open
Minor: This declaration introduces yet another member variable naming convention (m_UpperCamel). Can you please change the name of this table ... Joel Hestness May 29, 2015, 8:01 a.m. Open
Minor: Please follow the local_variable naming convention that is already used in this function. These should be aliased_it and aliased_it_end Joel Hestness May 29, 2015, 8:01 a.m. Open
Minor: Ibid Joel Hestness May 29, 2015, 8:01 a.m. Open
Can you please clarify, since it hasn't been stated explicitly yet: Current L0/L1 cache controllers cannot handle multiple outstanding accesses ... Joel Hestness May 29, 2015, 8:01 a.m. Open
This is a symptom of never sending the aliased request to the controller (at which point it's initialRequestTime would have ... Joel Hestness May 29, 2015, 8:01 a.m. Open
The Sequencer should not be handling stores in this manner. This effectively coalesces and completes concurrent stores to a single ... Joel Hestness May 29, 2015, 8:01 a.m. Open
This breaks the no-RFO property of some L1s (and is particularly relevant for GPUs). For instance, this would fail with ... Joel Hestness May 29, 2015, 8:01 a.m. Open
There is a similar issue for decoalescing loads here in the Sequencer (also in zero time!). Specifically, by queuing accesses ... Joel Hestness May 29, 2015, 8:01 a.m. Open
Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')? Joel Hestness July 31, 2015, 4:13 p.m. Open
I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is ... Joel Hestness July 31, 2015, 4:13 p.m. Open
Review request changed
Updated (July 22, 2015, 11:15 a.m.)

Description:

~  

Changeset 10817:d010e6a8e783

  ~

Changeset 10887:1e05089bc991

   
   

ruby: Fixed pipeline squashes caused by aliased requests

   
   

This patch was created by Bihn Pham during his internship at AMD.

   
   

This patch fixes a very significant performance bug when using the O3 CPU model

    and Ruby. The issue was Ruby returned false when it received a request to the
    same address that already has an outstanding request or when the memory is
    blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
    instructions. This fix merges readRequestTable and writeRequestTable in
    Sequencer into a single request table that keeps track of all requests and
    allows multiple outstanding requests to the same address. This prevents O3
    from squashing the pipeline.

Diff:

Revision 3 (+307 -266)

Show changes

Posted (July 31, 2015, 4:13 p.m.)

Thanks for updating this and apologies for the delayed review.

Besides the comments below, this looks a lot more agreeable.

  1. We are just hours away from checking our code in, so we won't be able to address your comments. We did put a lot effort into adding the coalesce_reqs flag as you requested in your last review.

src/mem/ruby/system/Sequencer.cc (Diff revision 3)
 
 

Can you adjust this comment to be consistent with the check (i.e. 'check for any outstanding requests to the line')?

  1. We'll have to get it next time we check in patches. Or you're welcome to update yourself.

src/mem/ruby/system/Sequencer.cc (Diff revision 3)
 
 

I'm still not convinced that this should be allowed, since this assumes the L1 cache is RFO. My preference is that the read request be issued to the caches instead (mirroring the readCallback). If the caches are RFO, then issuing reads here should complete in the next cycle, so this shouldn't affect performance much.

  1. There is a lot of code in the Sequencer that assumes RfO (example the block on calls to support RMW).

  2. This is not cool. First, I added this request in my original review here, so it should have been addressed previously.

    Second, your reasoning here for not addressing my request doesn't hold water: RMW operations are atomics that require ownership prior to performing the operation. Otherwise, the results would be nonsense. Even if there is other code in the Sequencer that assumes RFO, we don't need to perpetuate these assumptions.

Posted (Aug. 30, 2015, 11:01 a.m.)

Hi guys,

I'm not sure of the status of or plans for this patch, but I wanted to test it out and provide some feedback. I've merged and tested this with the current gem5 head (11061). First, there are a number of things broken with this patch, and if we're still interested in checking it in, it needs plenty of work. It took a fair amount of debugging before I was able to run with it.

Second, after merging, I microbenchmarked a few common memory access patterns. The O3 CPU with Ruby certainly performs better than older versions of gem5 (I was testing changeset 10238). It appears that prior to this patch, the O3 CPU has been modified to fix the memory access squashing caused by Ruby sequencer blocking (not sure which changes fixed that), so the execution time of the microbenchmarks is now comparable between Ruby and classic without this patch.

Further, I found this patch actually introduces many confusing issues and can reduce performance by up to 60%. It was very difficult to track down why performance suffered: By coalescing requests in the sequencer, the number of cache accesses changes, so the first problem was figuring out what an 'appropriate' change in number of cache accesses might be. After coming to an ok conclusion on that, I then found that the sequencer max_outstanding_requests needs to be configured appropriately to manage sequencer coalescing well. Specifically, if the max_outstanding_requests is less than the LSQ depth, the sequencer won't block the LSQ from issuing accesses to the same line, but will block when it is full. This reduces the MLP exposed to the caches compared to when the LSQ is blocked on outstanding lines and forced to expose accesses to separate lines. Setting the max_outstanding_requests greater than the LSQ depth fixes this, but this performance bug indicates that the coalescing in the sequencer introduces more non-trivial configuration to get reasonable MLP.

Overall, I feel that this patch needs to be dropped: It does not appear to be necessary for performance, and it actually introduces problems with performance and debugging due to the cache access effects.

  1. We still hope to check it in. The last time we ran our tests, this patch proved to be very important for O3 performance. Even if the squashing behavior is fixed, I think we saw benefits from avoiding unecessary mandatory queue latency for aliased requests. We will rerun our tests to confirm.

    I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

    The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

  2. I am acutally in favor of the idea of coalescing requests. This is something that classic memory does as well.
    And I read some paper on MSHRs maintaining primary and secondary requests. I would like that we ultimately
    merge this patch. I think we should split this patch into two parts:

    a. the first patch will combine the read and the write request tables maintained by the sequencer. I think Joel
    has no objections to that part. So should be possible to merge that patch any time.

    b. the second part will provide each entry in the combined request table with a list of packets instead of a
    single packet. This part of the patch is what we need to discuss.

    -- I think we should drop the max outstanding request variable all together. As both of you agree
    the LSQ size should decide how many requests should be pending, there is no need for the sequencer to
    maintain its own limit.

    -- We also need to decide on how the statistical variables are updated. With this patch, the requests buffered
    in the sequencer will not update the stats maintained by the cache controller. This becomes a problem
    when one tries to compare the loads and stores issued by the cpu and the loads and stores seen by the cache
    controller.

  3. I appreciate the concerns with statistics. At one point, this patch actually included statistics for merged requests so that one could do the math to verify that no requests were missing. Those stats where removed when we dropped ruby.stats and they were never added back. Would you be ok with adding counts in the Sequencer/RubyPort rather than trying to update the controller stats?

    I will try to add those stats, but can we keep this patch as one rather than split up? It has been a real pain to maintain and there is a lot of interdependent parts to it. Splitting it up would essentially require re-implementing it. We'll be lucky if we can find someone who as the time to add the stats as you requested.

  4. From Brad: I'm surprised you encountered issues with the latest version of the patch. I personally put a lot of effort into adding the disable feature that you requested and I tested the enable/disable bit with the CPU models and Ruby. It worked for me.

    Most problems with this patch as-is are merging issues with the current head, where a handful of sequencer things have changed (e.g. LLSC and L1/sequencer latency). Other problems include list iteration, dead variables that should be stripped out, and the coalesce disable feature. I could open these as issues for revision, but there are already a ton of open issues on this patch.

    From Brad: The issue you point out with max_outstanding_requests seems to be a problem with its prior default value. Why do you think increasing the value is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests equal to the LSQ size?

    The configuration is non-trivial in that there is nothing connecting these parameters (LSQ depth with sequencer max_outstanding_requests), and performance debugging this issue requires substantial understanding of both the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, and I know both of these pieces of code.

    From Nilay: I think we should drop the max outstanding request variable all together. As both of you agree the LSQ size should decide how many requests should be pending, there is no need for the sequencer to maintain its own limit.

    I disagree with removing max_outstanding_requests: Currently there are a number of places in Ruby without flow control, so removing max_outstanding_requests could bloat various unbounded buffers and cause plenty of other problems.

    While Nilay accurately captures my feelings about the specific changes, my overall opinion is that this patch introduces too many hard questions and too much uncertainty about performance correctness due to the CPU<->sequencer interface change. I apologize for being blunt, but given that AMD may be too busy to add stats (i.e. just a few lines of code), I don't trust that this patch will be adequately performance validated before commit. Further, I don't feel it should be the community's responsibility to fix any issues that it may introduce.

    If we're going to decide to push these changes, can we at least figure out a way for the patch to be shepherded to commit (e.g. someone volunteer to pick up the ball and carry it through adequate revision and testing)?

  5. I have posted a set of five patches that break this patch. The last patch is different
    from AMD is doing in this patch. Instead of fulfilling all the secondary requests in one
    go, the first secondary request (and hence now the primary request) is issued to the
    cache controller after the original primary request has completed.

    On max_outstanding_requests: I think we should not do control flow in the Sequencer. Instead,
    we should abort() when the number of requests in the sequencer exceeds max_outstanding_requests.
    This would force protocols / networks / testers / cores to do flow control.

  6. Joel, I thought we took care of the prior open issues with this patch. I understand your concern with merging with a patch that touches a lot of performance critical code. We've had to many of those types of merges as well. I feel like I've personally put a lot of effort in trying to address your concerns. So far, once I address your concerns, you seem to just have more concerns. Please list out your current "hard questions" and "open issues". I will try again to address them as much as I can. I just ask you to not add more items to the list and please keep in mind that we have very specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real hardware that strongly justify this patch.

    Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/, http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing feature?

    I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

    We absolutely should not force protocols to implement flow control. In particular, the tester is designed to stress the system in ways that are not practical with realistic flow control. Flow control is and should be an option for those who want more accuracy.

  7. Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part of this patch: http://reviews.gem5.org/r/3096/,
    http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are the other 2 patches? Do they implement the disable coalescing
    feature?

    394 and 3095. 3099 implements disable feature. But now I think we should not have disable at all since the controller is going to receive all the requests.

    I believe that an important part of this patch is ensuring that merged requests are completed in back-to-back cycles. If we require that all > requests must go across the mandatory queue to the L1 controller, I believe that adds unecessary delay. We will rerun our tests to confirm.

    The main problem, as I understand, is that aliased requests were nacked at the sequencer.
    This would not happen any longer. So you should not see pipline squashes. I do not think we can
    assume that the l1 controller would be ok with not seeing some of the requests. So I am not in favor
    of fulfilling multiple requests in one go from the sequencer. I would be fine with fulfilling
    multiple requests together in the l1 controller.

  8. Hi guys. Can we organize a bit here? There are a lot of loose ends in this review, so it's really difficult to know how to help move this forward.

    @Nilay: I appreciate splitting this patch into orthogonal parts, since I feel it would move things along. Unfortunately, though, it's not clear whether we should continue reviewing this patch or your set. One tricky thing about reviewing your patches is that we may end up needing you to do performance validation unless Brad/AMD is willing to take your patches and merge them into their queue. Can you state your preferences on your involvement in these changes? Which patches do you prefer we review?

    @Brad: I'm sorry I do not yet feel comfortable with this patch. I was under the impression that this patch was going to change O3 CPU pipeline squashing behavior, but my tests show that current mainline O3 squashing behavior is far better than older code (e.g. chgset 10238). My microbenchmarking shows that the performance differences between this patch and current mainline appear to be trivial (when the sim'd system is appropriately configured). I'd prefer to know how this change is being validated.

    Beyond performance validation, Nilay's new patches make it difficult to know what we should be reviewing. I would assume that your preference would be to finish fixing up this patch and you/AMD validates it, rather than trying to mix in Nilay's patches, which may be hard to merge. Can you let us know your preference for how to proceed with these, so we can know where to spend effort reviewing? If you'd prefer continuing with this review over Nilay's, can you please close/drop issues that you feel have been resolved?

Posted (Sept. 12, 2015, 6:54 a.m.)

My comment seems to have gotten lost in all the different threads going on...bad S/N. Anyways, here it is:

I am of the opinion that we should probably 1) do read/write combining in the core LSQ before sending out a packet, and 2) combining of MSHR targets in the L1 before propagating a miss downwards. I am not sure why we would ever do it in the Sequencer. Am I missing something?

This solution would also translate very well between both Ruby and the classic memory system.

  1. Hi Andreas. I think we all agree with you about where coalescing should happen. It appears that (1) is available from particular cores (e.g. the O3 CPU). The problem currently is that getting (2) in Ruby would require very non-trivial modification to the L1 controllers (to do combining) in each of the existing protocols (7 in gem5 + at least 2 not yet in gem5). To avoid all this protocol modification, this patch is AMD's proposal to do L1 MSHR-like combining within the sequencer. This proposed solution should be viewed as a stopgap on the road to MSHR combining in the L1 controllers.

  2. I see. Thanks for the clarification.

    I am fairly convinced the O3 CPU is not doing any coalescing at the moment. Are you sure? If not I'd say that's the place to start. Coalescing in the LSQ is probably as important as coalescing of MSHR targets, if not more so.

  3. Sorry. Inspecting the O3 LSQ code, I believe you are correct that it does not do any coalescing.