Review Board 2.0.15

ruby: fix MESI_Three_Level protocol

Review Request #3441 - Created April 4, 2016 and updated

Brandon Potter
Changeset 11443:34fcc3f9f957
ruby: fix MESI_Three_Level protocol


Issue Summary

1 1 0 0
Posted (April 13, 2016, 11:17 a.m.)

Before this patch this protocol was broken? From the changes, it doesn't even look like it would have compiled. Is this true? In one way I'm shocked this happend. In another I'm not surprised at all... :(

  1. It doesn't make it through the slicc-C++ code generation in its current state. "Exception: Error: Duplicate machine name: MachineType:L0Cache:"

    The public regression tester really needs to be updated to cover all of the supported protocols.

  2. Would you be willing to add a regression for this? If not, I'll look into it, maybe after this burst of activity dies down.

  3. I want to remain noncommittal on it, if that's OK for now?

    Definitely, it needs to be done and I'd like to take a crack at it. However, I need finish tasks in my work queue first. Multi-instance regressions need to be added too to ensure that no one inadvertently breaks this current work. If you do not mind waiting for a few weeks, I can probably add the MESI_Three_Level regression at the same time that I add in the multi-instance regressions.

Posted (April 13, 2016, 12:08 p.m.)

I'm not clear how this change fixes the exception noted above. Can you elaborate?

Also, the diff here doesn't apply cleanly to the current mainline head. Is there another patch that needs to be applied before this?

  1. This change fixes the L0Cache enumeration problem mentioned above:

    All of the patches are posted to ReviewBoard. This review request has a description of the patches as they're applied in my repository (applied in-order top to bottom):

  2. Maybe you could split this out into a couple of patches. One to fix the current bug and one to add in the new features? Or is part of the bugfix?

  3. The patch is the bugfix that allows MESI_Three_Level to compile. Below, both Joel and I confirm that 3440 will allow MESI_Three_Level to compile and run correctly. Sorry for the confusion, I should have made it clearer in the changeset message for 3440 that 3440 fixed the build problem.

    This patch (which is not 3440 :P) is required because of either 3433, 3436, or 3437. I post below with a complete explanation of why that's the case and subsequently add that reasoning to the changeset message for this patch.

configs/ruby/ (Diff revision 1)

This change connects the L0 caches to the network rather than connecting them directly to the L1 caches. Since both are private to the connected core, it seems superfluous to run their activity through the network; They only communicate with each other through these MessageBuffers.

Is there a bigger need for this that I'm not seeing?

  1. The simple network works fine as it is without this change. However, running Garnet causes a problem because the network wants the message buffers associated with the L0 controllers. It requires the message buffer during GarnetNetwork_d::init(), otherwise it fails out on the following assert:

    gem5.opt: build/MESI_Three_Level/mem/ruby/network/garnet/fixed-pipeline/ virtual void GarnetNetwork_d::init(): Assertion `it != m_toNetQueues.end()\' failed.

    We exposed the controller to the network to appease Garnet. There might be another way to solve this problem, but this is my solution. You said that it was "superfluous" so I'm gathering that it doesn't affect correctness, right?

    If you fix the enumeration problem, you can probably confirm that Garnet does not work on the tip of the public repo. I do not think that my other changes affect it (although I am not %100 sure; were quite a few changes in my patches).

  2. Follow up, the Garnet issue is a result of my patches. It's related to something done in the patches which precede it (probably the component mapping patches). If you apply L0Cache-enumeration patch on the public repository by itself, then Garnet works just fine.

    I'll have to figure out what it is in the component mapping changes (or a different patch) that required this due to the message buffer issue.

  3. I take responsibility for this change. I am the one that suggested it to Brandon. It seemed far easier and elegant than hacking a customized fix in Garnet for this scenario. There is precedent in Ruby to have private L1 to private L2 connections go through the network rather than use direct messagebuffers. I'm not sure if any current gem5 protocols do this prior to Brandon's patch, but there certainly were several GEMS protocols that did so.

  4. So, I'm not sure I understand this completely. I think I see how this change is just to hack around a problem with the interface between Ruby controllers and Garnet...

    Will all network models know that these L0 and L1 caches are localized together and just move messages from one buffer to the other, or is it possible that communication between the caches might traverse the network (e.g. multiple hops, depending on the network topology)? If it's the latter, I feel that this is an inappropriate solution, because it's not modeling real system behavior.

  5. I just applied , compiled MESI_Three_Level, and ran with Garnet fixed and flexible networks. I'm not able to trigger the toNetQueues assertion failure you list above. Can you please provide a command line that fails?