ruby: Fix block_on behavior

Information
Submitter:	Joel Hestness
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 11169:0a83a8d08c9e
---------------------------
ruby: Fix block_on behavior

Ruby's controller block_on behavior aimed to block MessageBuffer requests into
SLICC controllers when a Locked_RMW was in flight. Unfortunately, this
functionality only partially works: When non-Locked_RMW memory accesses are
issued to the sequencer to an address with an in-flight Locked_RMW, the
sequencer may pass those accesses through to the controller. At the controller,
a number of incorrect activities can occur depending on the protocol. In
MOESI_hammer, for example, an intermediate IFETCH will cause an L1D to L2
transfer, which cannot be serviced, because the block_on functionality blocks
the trigger queue, resulting in a deadlock. Further, if an intermediate store
arrives (e.g. from a separate SMT thread), the sequencer allows the request
through to the controller, and the atomicity of the Locked_RMW may be broken.

To avoid these problems, disallow the Sequencer from passing any memory
accesses to the controller besides Locked_RMW_Write when a Locked_RMW is in-
flight.

Testing Done

Considered many other potential solutions on gem5-gpu email list, which seem

unlikely to function as desired:

https://groups.google.com/forum/#!topic/gem5-gpu-dev/RQv4SxIKv7g
Found reproducible version of the IFETCH bug with gem5 11139:bd894d2bdd7c (using the xsave disable patch in this email thread:  http://comments.gmane.org/gmane.comp.emulators.m5.devel/24558 )
Reproducible command line: ../build/X86_MOESI_hammer/gem5.opt --outdir=$outdir ../configs/example/fs.py --ruby --cpu-type=detailed --caches --num-cpus=4 --script ../configs/boot/hack_back_ckpt.rcS --kernel ../../full_system_files/binaries/x86_64-vmlinux-2.6.28.4-smp
Verified that the patch fixes the reproducible bug and tested that booting

Linux works with O3CPU and numerous other system configurations.

Issue Summary

Description	From	Last Updated	Status
I prefer this check being done just before insertRequest() is called.	Nilay Vaish	April 14, 2016, 12:15 p.m.	Resolved
There's no reason that I can see why this should be a std::map. The mandatory_q_ptr that is passed in is ...	Brandon Potter	Oct. 15, 2015, 11:56 a.m.	Dropped
Can you add a comment to the code here (or somewhere in this function) that describes how the RequestStatus_Aliased actually ...	Brandon Potter	April 15, 2016, 10:34 a.m.	Resolved
Can you add a comment here that the 1st level controller is blocked for all message buffers except the mandatory ...	Brandon Potter	April 15, 2016, 10:34 a.m.	Resolved

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

I prefer this check being done just before insertRequest() is called.

Show all issues

Joel Hestness Oct. 10, 2015, 2:58 p.m. (Oct. 10, 2015, 2:58 p.m.)

I considered your request, but I don't understand why this check should be moved. To check whether the line is blocked requires making a line address out of the request address, which happens just before this, and this check is consistent with other checks in insertRequest. Namely, this checks if there is an outstanding operation for the line. Sequencer::makeRequest() just does the translation of the packet to a RubyRequest, so returning from there with RequestStatus_Aliased is inconsistent with the functionionality of that method. Can you clarify why you think it should be moved?

Nilay Vaish Oct. 11, 2015, 10:34 a.m. (Oct. 11, 2015, 10:34 a.m.)

insertRequest() is meant for deciding whether to put this request into
either of the read / write tables. And I think that's what it should be
doing and nothing more. It should not be the responsibility of the insertRequest()
to check if there is some RMW operation going on the address.

I considered your request, but I don't understand why this check should be moved. To check whether the line is blocked requires making a > line address out of the request address, which happens just before this,

Making line address is just one operation, other than the fact that we use a global pointer to get the block size.

and this check is consistent with other checks in insertRequest.

The only check in insertRequest() is which table to put the request into.

Namely, this checks if there is an outstanding operation for the line. Sequencer::makeRequest() just does the translation
of the packet to a RubyRequest, so returning from there with RequestStatus_Aliased is inconsistent with the functionionality of that
method. Can you clarify why you think it should be moved?

If you are worried about the return type, add a new one.

Joel Hestness Oct. 11, 2015, 3:44 p.m. (Oct. 11, 2015, 3:44 p.m.)

insertRequest() is meant for deciding whether to put this request into either of the read / write tables. And I think that's what it should be doing and nothing more. It should not be the responsibility of the insertRequest() to check if there is some RMW operation going on the address.

Sorry, I'm still not clear about what you're arguing. This new isBlocked check is also checking whether to put the request into the read or write tables. It does so based on what accesses are currently in progress (i.e. an in-flight Locked_RMW), so it is consistent with the checks already in insertRequest. As I said, makeRequest only does translation from Packets to RubyRequests, so it has very different functionality than this new isBlocked check. To me, it doesn't make sense to move the isBlocked check into makeRequest.

Also of note, the isBlocked condition should only evaluate to true very infrequently, so there isn't any need to optimize the check's location for performance.

Brandon Potter Oct. 14, 2015, 10:15 a.m. (Oct. 14, 2015, 10:15 a.m.)

Joel, if you decide to keep this inside insertRequest() then you might move it up two lines so that it is called before the default_entry is created. It's really minor, but it doesn't make sense to create the default_entry before the check.

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

This is the place I am asking the check be moved to.

src/mem/ruby/system/Sequencer.cc (Diff revision 1)

There's no reason that I can see why this should be a std::map. The mandatory_q_ptr that is passed in is not used by anything at all. (The only checks now are on the address and there is no action taken on the mandatory_q. Maybe I am missing something?)

Can we change the tracking structure to reflect that this is really a vector of addresses for the memory controller that we should be checking instead of a map? Actually, there's only a single address with this new modification since the request will be nack'd back to the requester.

Show all issues

Brandon Potter Oct. 15, 2015, 11:45 a.m. (Oct. 15, 2015, 11:45 a.m.)

Sorry, disregard that last part of the issue where I was suggesting that there only needs to be a single address which can be blocked on. We'd still need a vector or something equivalent to track multiple addresses to block on.

Change Summary:

Move default_entry creation after check per Brandon's suggestion.

Description:

~		Changeset 11157:65d87638830f
	~	Changeset 11169:0a83a8d08c9e

		ruby: Fix block_on behavior

		Ruby's controller block_on behavior aimed to block MessageBuffer requests into
		SLICC controllers when a Locked_RMW was in flight. Unfortunately, this
		functionality only partially works: When non-Locked_RMW memory accesses are
		issued to the sequencer to an address with an in-flight Locked_RMW, the
		sequencer may pass those accesses through to the controller. At the controller,
		a number of incorrect activities can occur depending on the protocol. In
		MOESI_hammer, for example, an intermediate IFETCH will cause an L1D to L2
		transfer, which cannot be serviced, because the block_on functionality blocks
		the trigger queue, resulting in a deadlock. Further, if an intermediate store
		arrives (e.g. from a separate SMT thread), the sequencer allows the request
		through to the controller, and the atomicity of the Locked_RMW may be broken.

		To avoid these problems, disallow the Sequencer from passing any memory
		accesses to the controller besides Locked_RMW_Write when a Locked_RMW is in-
		flight.

Diff:

Revision 2 (+17)

Show changes

	src/mem/ruby/slicc_interface/AbstractController.hh
	src/mem/ruby/slicc_interface/AbstractController.cc
	src/mem/ruby/system/Sequencer.cc

Ship It!

src/mem/ruby/system/Sequencer.cc (Diff revision 2)

Can you add a comment to the code here (or somewhere in this function) that describes how the RequestStatus_Aliased actually fixes the problem.

I looked through the code and eventually figured it out, but it took some time and was a bit convoluted.

This return goes back up into makeRequest with Aliased which eventually becomes visible in RubyPort::MemSlavePort::recvTimingResp and it re-enqueues the port for a retry and returns false to the owner of the port (CPU) which denotes that it failed. The (owner) CPU is expected to wait for the RubyPort to trigger the retry (which happens when the port because available again when RubyPort executes RubyPort::ruby_hit_callback).

The confusing thing was the the O3CPU model has seperate ports for instruction and data which is what allows this to work with the IFETCH. The MemSlavePort corresponding to the i-cache is what gets stalled, but the MemSlavePort associated with the d-cache is still open and allows the RMW_Write to proceed.

I don't know what a good summary would look like, but it would be nice if there was something here to hint at what's happening.

Show all issues

src/mem/ruby/system/Sequencer.cc (Diff revision 2)

Can you add a comment here that the 1st level controller is blocked for all message buffers except the mandatory queue?

The name of the function, blockOnQueue, can be misleading depending on how the reader interprets it; one might just glance at this code and think that the mandatory queue gets blocked when in reality it's all of the other message buffers.

Show all issues

Thank you Joel for fixing this bug and posting this patch.  Those of us at AMD have encountered this bug as well and some of us are already using a version of this patch.  It would be great if you could check this in soon so that we do not have to maintain your fix locally.  Please check in this patch as soon as you can.

Thanks!

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Issue Summary

Change Summary:

Description:

Changeset 11157:65d87638830f

Changeset 11169:0a83a8d08c9e

Diff:

Status: Closed (submitted)