Review Board 2.0.15

ruby: pass in block size to ENTRY objects with block size

Review Request #3428 - Created April 4, 2016 and updated

Brandon Potter
Changeset 11430:ad5965016b8e
ruby: pass in block size to ENTRY objects with block size

Posted (April 6, 2016, 2:46 p.m.)

For the classic memory system, all caches before the point of coherency have the same line size, and it is a property of the system. Is this change departing from this assumption for Ruby? What is the bigger picture here?

  1. TL;DR: Never uses global variables, static class members, or static variable declarations in function bodies. If you do, someone will come knocking on your office door in the future; it might be me.

    The bigger picture is the desire to run two or more "systems" inside the same gem5 process; ideally the systems could be independent from one another i.e. completely different coherence protocols, topologies, system block sizes, etc.. (Without Ruby, it was already possible to run multiple simulated systems in the same process. We only realized that Ruby had this problem when we started trying to enable it in our simulations.) Being able to run two simulated systems together in the same process is important if you want them to communicate with one another through mutually shared objects. I am aware of the dist-gem5 changeset (1640dd68b0a4); it is a nice addition and could supplant this whole effort by allowing us to create two seperate gem5 processes each with a single system invocation and then communicate through the dist-gem5 interface. (Some folks here at AMD already use dist-gem5 for their research.) However, there's no reason why we shouldn't also be able to achieve a multi-instance run without having to create different processes. With a single process, we could theoretically take advantage of shared memory with multi-threading (on the host) with different real hardware threads servicing gem5 event queues. (Preliminary work on multi-threaded and multiple event queues has already been done and checked in. We have internal patches that we would eventually like to push to the public tree to extend it.) Ideally, dist-gem5 and this work could even be complementary. We might run a multi-threaded gem5 process with multiple simulated systems coexisting together in the process and then scale that out to multiple processes on different host nodes which communicate with one another through dist-gem5; in this way, we could harness a cluster and use the threading within nodes.

    Ruby problems: The issues are related to global variables and static class members. Since the variables are visible to the entire gem5 process, they're used by each independent simulated system even though that's not really what we want. For example, the deal breaker (which caused initial failed assertions) was how the component mappings were being determined post python configuration. There were global counters tracking how many times the controller classes had been instantiated to figure out how many controllers (cache, directory, etc.) were being created. This information was used to determine how to route messages between the controllers. With multiple simulated systems being created, the extra class invocations were throwing things off and tripping assertions because it was seeing too many controller registrations.

    Block size: I agree with what you're saying that it should be the same within a system; the alternative does not make sense to me. (I'm going to subject you to a bit of Ruby background to try to explain why the block size changes look they way that they do.) There are System objects and RubySystem objects and there is a 1:1 correlation between the two. Essentially, they both serve as an aggregation point for all system wide attributes. The only thing that's special about the RubySystem object is that it contains Ruby specific parameters. So you might ask, "Why not place the block size (or other attributes) in RubySystem and access it (them) indirectly through that object?" I tried that with a set of patches last year; it was the most obvious initial solution. The patches were not well received -- partly because of the code organization/quality, partly because of Ruby specific eccentriticites, and partly because of the following connundrum: the block size is needed by everything. I need to pass a parameter through deeply-nested function parameter lists from within controller objects that are created independently in Python configuration scripts and dynamically created entry objects which exist independently and which are owned by the controller objects and messages. Should all of those objects be exposed to a centralized object like RubySystem?

    Why is that necessary to pass the block size around like this? I still don't get it. Type some more!: It is necessary because the controllers call functions that rely on knowing the block size. (For instance, the makeLineAddress function needs to know the block size to figure out the base cache line address; makeLineAddress is used everywhere.) Each controller needs either a RubySystem handle to access it indirectly or a block size provided directly as a parameter. You might say, "Well, the indirect access still doesn't seem like a big deal with the controllers!" Correct. RubySystem objects are accessed indirectly through several "top level" objects already; that's the point. The problem is that passing the RubySystem pointer around to the other objects that I mentioned is nasty (and Nilay claims inefficient due to the extra dereferences on so many of the accesses). The entries are an incredible pain to deal with; they are templated types that may or may not have DataBlks or WriteMasks. (The DataBlk class is essentially a vector that holds data inside messages, TBEs (MSHRs), caches, and other assorted fun objects. WriteMasks were introduced with the recent AMD GPU patches as a way of copying only portions of a DataBlk. Both rely on the block size to know how to size the two objects. Previously, they both just used the global in their constructor so that they were always sized appropriately; the global went away so fun.) Not only are the entries templated types, but it was previously not possible to pass the data block in the constructor so there's a patch in this list that makes that possible. Otherwise, we'd have to create the DataBlk or WriteMask as an empty vector and then subsequently initialize it after the entry was created... which is what we do with messages... fun! Message do not have a constructor available to them directly within the slicc language. They are created as part of the "enqueue" keyword and their fields are initialized subsequently. (It might be possible to pass in the data block implicitly by modifiying "enqueue" but I didn't bother with it.) So there, that's why.

    Reality: Currently, we can run two or more simulated systems in the same process with almost identical configurations; pretty much everything stays the same except the workloads/kernels which are provided to each system. For now, this is fine for us here at AMD, but I can imagine that someone else will want homogenous systems. One problem is writing a configuration script which allows us to specify two completely different systems in a clean way. Right now, we go through the process of making one system and then duplicate that system. Another problem is how the PROTOCOL is specified at build time; I have not put much thought into solving this yet, but it seems like a hard problem. Maybe we can make a meta-protocol that encompasses all of the existing ones? Seems crazy. Finally, I don't think that I have completely solved the problem of statics/globals. I found that the number of virtual networks is still a global definition somewhere in the code the other day. If the PROTOCOL issue gets solved, it will need to be changed as well. There are also probably other things that I am overlooking.

  2. Ok. Could you perhaps add a short note in the patch summary explaining that this is to allow multiple systems with different cache line size.

  3. If the only benefit of this set of patches is to allow multiple systems with different block sizes in the same gem5 process, then I don't think they are worth the pain they will cause to others. As it is, anyone with a SLICC protocol will have a significant headache to update with these patches. Additionally, they ask the programmer to write more easy to mess up code (e.g., every out_msg must call DataBlk.alloc(...)).

    I agree that static variables are bad... but I don't see how getting rid of them are worth this headache.

    On another note: How does the classic memory system deal with this? Maybe you could look there to see how the block size is parameterized. However, I have a feeling a lot of this is that Ruby has so much baggage that even seemingly simple things like removing static variables causes a ripple effect.

  4. Jason, I completely understand the reluctance to merge with this change. However, we all agree this is an improvement. The static variables have been a major issue for a while. It will be great when we robustly support multi-node Ruby simulations within an single gem5 process. A lot of our research at AMD depends on it (our batch system only supports single process simulations) and I suspect other researchers will benefit as well.

    As Brandon points out, this is the second time he's implemented this change. Nilay did not like the first implementation and we all agreed that Nilay's proposal was better. Thus Brandon spent months implementing and debugging this new approach. He's done all the work on the public protocols (and AMD's internal ones). It is not that hard to follow his example.

  5. The only benefit is not allowing multiple systems with different block sizes in the same gem5 process. The overall goal here to to allow ruby systems to be created with completely different configuration in the same gem5 process. There shouldn't be any constraint on how the independent systems are configured. Now I admit that these changes are not all that's needed to get us there, but this is a major step in that direction.

    If DataBlk.alloc() is an unacceptable problem with messages, help me figure out how to supply the block size through enqueue keyword associated with messages implicitly. I'd hinted above that it might be possible, but I didn't go through the trouble of figuring out how to do it. If your argument is that the extra DataBlk.allocs() are error prone or too burdensome, let's fix the problem by hiding it inside of enqueue.

    I'm not thrilled about how extensive these changes are either; I empathize with anyone who has a slicc protocol that needs to merge with these changes. Understand that Nilay and I went through the process of figuring out how to fix this problem and then I fixed and tested it with all of the publicly available protocols. That's after my initial attempt which was rejected in favor of this set of patches; the first attempt wasn't a trivial effort either. If it's really too much to ask, publish your protocol and I will make the changes for you. Again if that's still unacceptable, spin your own patches to fix this problem and I will review all of them for you. After seeing my first set of patches, Nilay wrote this version so there's precedent for coming up with your own solution.

    For anyone creating subsequent protocols, I doubt that they'd object to these changes. People tend to copy/paste sections of the protocols anyways; at least, it appears that way from inspecting protocol code. They will be copy/pasting the extra parameters without ever realizing that it was once unnecessary.

    As an aside, I really do appreciate the reviews. I hope that I'm not coming off as argumentative or ungrateful for the reviews, but I refuse to be put into a position where I need to maintain these patches internally. They fix an outstanding problem (that should have never existed in the first place) and it's near impossible to maintain them for all of the public protocols. We can make the change here once to rectify the old problem and move on. There must be a compromise here that can be made that everyone can live with.