mem: implement x86 locked accesses in timing-mode classic cache

Information
Submitter:	Steve Reinhardt
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands.  In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.

Testing Done

Issue Summary

Description	From	Last Updated	Status
the changed order of the write buffer allocation and dealing with the writebacks is important?	Andreas Hansson	May 6, 2015, 3:21 p.m.	Dropped
Since I just spent a few good hours hunting memory allocation bugs... where will these get deallocated again? It is ...	Stephan Diestelhorst	May 6, 2015, 3:21 p.m.	Resolved
Again.. I'd like to see a comment of where these guys (or the originals) are deallocated.	Stephan Diestelhorst	May 6, 2015, 3:21 p.m.	Resolved
Why don't you break and drop the early_exit?	Stephan Diestelhorst	May 6, 2015, 3:38 p.m.	Dropped
This should really be: if (!mshr.hasTargets()) and all the stuff below has too deep an (or two) indentation level, IMHO. ...	Stephan Diestelhorst	April 8, 2015, 1:43 p.m.	Open
later on we test pkt->isLockedRMW and then read or write. Could we not do the same here rather and avoid ...	Andreas Hansson	May 7, 2015, 7:32 a.m.	Resolved
status should be 0 even, or could it still have the dirty flag set? the readable flag is essentially valid, ...	Andreas Hansson	May 7, 2015, 12:55 a.m.	Open
Why is this MSHR needed? It was faked, was it not?	Andreas Hansson	May 7, 2015, 12:55 a.m.	Open
Is it safe to call this here and ignore the return value? It would be nice if we do not ...	Andreas Hansson	May 7, 2015, 7:59 a.m.	Dropped
potentially still dirty, or always came from "Exclusive", i.e. Readable \| Writeable?	Andreas Hansson	May 7, 2015, 12:55 a.m.	Open
Do we need the actual LockedRMW... or could if just be a flag on the existing read and write req/resp?	Andreas Hansson	May 7, 2015, 8:02 a.m.	Dropped

src/mem/cache/cache_impl.hh (Diff revision 1)

the changed order of the write buffer allocation and dealing with the writebacks is important?

Show all issues

Steve Reinhardt March 18, 2015, 3:30 p.m. (March 18, 2015, 3:30 p.m.)

Good question. I reordered the two blocks only so I could put the tempBlock clause inside of 'if (!early_exit)' and leave the writebacks code outside. It's necessary to do the writebacks before exiting the function, since that's a local variable and otherwise the writebacks would be lost. In contrast, tempBlock is part of the cache object so it will stick around, and while I haven't thought it through carefully, I assume there could be cases where you're doing a locked RMW in tempBlock, so you don't want to kick it out prematurely.

Looking at handleFill(), it seems unlikely (dare I say impossible) that you would have both writebacks and the use of tempBlock at the same time, since the only time you use tempBlock is if you can't use any existing cache block frame, and if you had bothered to do a writeback then by definition you would be clearing out an existing cache block frame. So by that reasoning the order shouldn't matter.

Great to see this functionality added. I am not 100% convinced of the way it is done, but perhaps there is no "neater" way of doing it. Especially the new creation of requests and packets is making me nervous. Who is deleting these?

Also, how does this interact with existing MSHRs? Targets and deferred targets? Similarly, now we fake an MSHR even if the block is present in the cache. Would it make sense to flush/downgrade somehow and then treat this like a miss?

src/mem/cache/cache_impl.hh (Diff revision 1)

Spelling mistake.

Get the idea, but I am burnt when it comes to debugging request + packet deep copies.

src/mem/cache/cache_impl.hh (Diff revision 1)

Since I just spent a few good hours hunting memory allocation bugs... where will these get deallocated again? It is not obvious, so maybe chuck in a comment, here?

Show all issues

src/mem/cache/cache_impl.hh (Diff revision 1)

Again.. I'd like to see a comment of where these guys (or the originals) are deallocated.

Show all issues

src/mem/cache/cache_impl.hh (Diff revision 1)

Why don't you break and drop the early_exit?

Show all issues

Steve Reinhardt May 6, 2015, 3:56 p.m. (May 6, 2015, 3:56 p.m.)

[Hmm, I had a comment here but reviewboard silently dropped it when I posted.]

Basically break doesn't work here because you still need to exit the code after the if/else-if/else-if... sequence, but before the break. That is, the set of lines starting with:
tgt_pkt->makeTimingResponse();

src/mem/cache/cache_impl.hh (Diff revision 1)

This should really be:
if (!mshr.hasTargets())
and all the stuff below has too deep an (or two) indentation level, IMHO.  I seriously consider goto an appropriate way to improve readability here.

Show all issues

Steve Reinhardt May 6, 2015, 3:53 p.m. (May 6, 2015, 3:53 p.m.)

Changing the test to 'if (!mshr.hasTargets())' is equivalent (I believe) but it's not clear to me that it's better... the only reason that condition could be true would be because early_exit is false, and early_exit is the more meaningful cause, I'd say.

I could use a goto... as I mentioned above, I seriously considered it, but rejected it since it wouldn't completely replace early_exit. I'd have no problem with doing it here, but others might, so I was doing it as an if-block to avoid the controversy. I will post another version of the patch with a goto for comments on this specifically.

Change Summary:

Updated to tip of tree. Fixed memory leak and spelling error.

Note that I had to remove the recently added 'const' modifier from the pkt field of the MSHR::Target object as part of the update. As mentioned in the comment, we really need to use the existing Packet object for the response and substitute in a dummy packet, since the former has the allocated data pointer for the response data. I tried the option of creating a new packet for the response, but there's no clean way to move an dynamically allocated data pointer from an existing packet to a new one. We could add methods to Packet to do this, but since it's all just to avoid removing 'const' from the pkt field, it seemed easier just to remove 'const'. If anyone has a better idea on how to handle this, I'm listening.

Diff:

Revision 2 (+178 -45)

Show changes

	src/mem/packet.hh
	src/mem/packet.cc
	src/mem/cache/cache_impl.hh
	src/mem/cache/mshr.hh

Change Summary:

This is an alternative version of the patch that uses a goto to narrow the scope of the early_exit flag and reduce indentation levels. I'm not necessarily replacing the prior patch with this one; I really just want to post two alternatives, and this seemed easier than creating a whole new patch to review. You probably want to focus just on the diff between this revision and the previous one.

Diff:

Revision 3 (+149 -14)

Show changes

	src/mem/packet.hh
	src/mem/packet.cc
	src/mem/cache/cache_impl.hh
	src/mem/cache/mshr.hh

I vote for the non-goto verison.

Is there really no way we can achieve this without the extra packets going back and forth? It seems the shortcircuiting we do with packets we should be able to do without them if we tapped in one level below the recvTimingReq/recvTimingResp.

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)

Thanks for the review! I'm not sure what you mean by "one level below" though.

There are lots of ways this can be done; I spent a lot of time thinking about how to do this before I started on it. The key is that you have to check whether a block is locked when snoops or other requests come in, defer those snoops and other accesses in some kind of list during the RMW window, then process that list of deferred snoops/requests when the RMW completes.

Given that we already have this entire mechanism already in place with the MSHRs, it seems obvious to me that we should leverage it. Faking the packets is a little weird but allows us to make the most use of existing code paths and minimizes changes. This is less than 60 net lines of non-comment code added to cache_impl.hh, which I think is pretty good for a feature that I initially expected would be much more complex.

I'm sure there are ways we could do this without the fake packets, at the cost of some significant restructuring (e.g., taking the 'while (mshr->hasTargets())' loop out of recvTimingResp() and making it callable from multiple places) and probably more special-casing in the existing code paths. For example, we could have a flag in MSHR::Target that indicates that the block is locked, and leave the packet pointer null, but then we'd probably end up adding null pointer checks in several places. That was more than I really wanted to bite off though. I suppose I could give it a shot but it might be a while until I have time.

src/mem/cache/cache_impl.hh (Diff revision 3)

later on we test pkt->isLockedRMW and then read or write. Could we not do the same here rather and avoid looking at the actual command, i.e.
pkt->isLockedRMW() && pkt->isWrite()

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)
```
Sure.
```

src/mem/cache/cache_impl.hh (Diff revision 3)

status should be 0 even, or could it still have the dirty flag set?

the readable flag is essentially valid, so here we invent a new state

Readable = 0
Dirty = ?
Writeable = 0

Still the block is "valid".

This scares me a bit. Partly because the binary encoding of cache states is not that easy to follow, and partly because we now have MOESI + some new state.

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)

It could still have the dirty bit set, though practically speaking that's moot, since it will be set by the write part of the RMW regardless. There are other bits like BlkHWPrefetched and BlkSecure that we probably don't want to mess with though.

It is obviously not a MOESI state, but that's OK, because coherence protocol operations will never observe this state, as all snoops are also being deferred. Its only purpose is to make all CPU-side accesses (read or write) go down the miss path, where we can defer them until the RMW completes. There is precedent for this; we also clear the Readable bit on valid blocks in the window after a write miss (while the upgrade request is outstanding); see the comment around line 788 of this file.

src/mem/cache/cache_impl.hh (Diff revision 3)

Why is this MSHR needed? It was faked, was it not?

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)

Not necessarily... if the read part of the RMW was an actual cache miss, this is still the non-fake MSHR that was allocated at that time. The initial target of that MSHR will have a fake packet at this point, but there could be real deferred targets waiting behind it.

src/mem/cache/cache_impl.hh (Diff revision 3)

Is it safe to call this here and ignore the return value?
It would be nice if we do not assume that the response is always accepted.
Perhaps schedule an event and put it in a queue rather?

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)

recvTimingResp() doesn't have a return value... responses are always accepted.

Andreas Hansson Jan. 1, 2016, 6:57 a.m. (Jan. 1, 2016, 6:57 a.m.)

Actually responses are not always accepted directly. They are always guarnateed to be sunk without dependencies, but there are modules that use flow control on responses (simply for rate regulation reasons).

src/mem/cache/cache_impl.hh (Diff revision 3)

potentially still dirty, or always came from "Exclusive", i.e. Readable | Writeable?

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)
```
BlkDirty may have been set in handleFill()
```

src/mem/packet.hh (Diff revision 3)

Do we need the actual LockedRMW... or could if just be a flag on the existing read and write req/resp?

Show all issues

Steve Reinhardt May 7, 2015, 8:34 a.m. (May 7, 2015, 8:34 a.m.)

The way MemCmd works, you need a new command for every combination of flags (it's a dense encoding not a sparse encoding, so you look up the flags via the commandInfo[] array). Confusing since it's different from how Request flags are handled, I know.

Change Summary:

This revision doesn't address any reviewer concerns, but it does update the old implementation to work with the latest tree (for those who are still downloading this manually). It may also fix a bug; there was a bug in one of our internal versions but I'm not sure if that bug made it out onto reviewboard or not.

Diff:

Revision 4 (+192 -59)

Show changes

	src/mem/packet.hh
	src/mem/packet.cc
	src/mem/cache/cache.cc
	src/mem/cache/mshr.hh

Can you please update it?
It neither applies cleanly nor does it compile afterwards.

Steve Reinhardt April 14, 2016, 10:43 p.m. (April 14, 2016, 10:43 p.m.)
```
Done.
```

Change Summary:

Updated again to work with latest head (rev 11443:df24b9af42c7)

Description:

~		Changeset 10745:9b84d1b570e3
	~	Changeset 11444:8a1419dbbfa6

		mem: implement x86 locked accesses in timing-mode classic cache

		Add LockedRMW(Read\|Write)(Req\|Resp) commands. In timing mode,
		use a combination of clearing permission bits and leaving
		an MSHR in place to prevent accesses & snoops from touching
		a locked block between the read and write parts of an locked
		RMW sequence.

Diff:

Revision 5 (+192 -59)

Show changes

	src/mem/packet.hh
	src/mem/packet.cc
	src/mem/cache/cache.cc
	src/mem/cache/mshr.hh

Is there anything holding this up from being shipped?

Andreas Hansson Jan. 5, 2017, 8:43 a.m. (Jan. 5, 2017, 8:43 a.m.)

In my view the patch needs two things:

1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.

2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.

It would be great if someone could dig into these issues.

Steve Reinhardt Jan. 5, 2017, 11:26 a.m. (Jan. 5, 2017, 11:26 a.m.)

It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).

It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.

As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.

If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.

Nikos Nikoleris Jan. 6, 2017, 7:31 a.m. (Jan. 6, 2017, 7:31 a.m.)

I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?

Steve Reinhardt Jan. 6, 2017, 10:07 a.m. (Jan. 6, 2017, 10:07 a.m.)

Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.

Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.

Nikos Nikoleris Jan. 9, 2017, 2:20 a.m. (Jan. 9, 2017, 2:20 a.m.)

I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.

Tony Gutierrez Jan. 16, 2017, 10:03 a.m. (Jan. 16, 2017, 10:03 a.m.)

The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?

Nikos Nikoleris Jan. 16, 2017, 2:02 p.m. (Jan. 16, 2017, 2:02 p.m.)

Tony, I only wanted to point out that all the complexity in Steve's implementation was due to the split of the operation in two packets. Implementing the lock prefix with a single packet in the form of an atomic would simplify things a lot.

I took the AtomicOpFunctor as given as it is already commited. I am missing the context here as I haven't followed the discussion on the AtomicOpFunctor but if there are thoughts to change the interface I would imagine that the mechanism/functionality would be pretty much the same. As for the overall idea, I am pretty sure that we are interested in a mechanism for (far)-atomic operations in gem5.

Steve Reinhardt Jan. 16, 2017, 4:05 p.m. (Jan. 16, 2017, 4:05 p.m.)

[I thought I had replied to Nikos's earlier comment, but didn't see it in the email thread... apparently I did not click "publish" when I was done. Fortunately the draft was still in reviewboard. So the text below is actually a reply to Nikos's comment of Jan 9.]

Nikos, you are correct about how AtomicOpFunctor operates. The function is opaque to the memory system and is not tied to any particular ISA feature. My point was not about the feature itself, but that a substantial number of functor instances would have to be defined to cover all the possible instructions and data types that can take the LOCK prefix, and that modifying the x86 ISA support to use those functors in place of the current microcode definitions would also be a substantial amount of work.

Two other things I have thought of/noticed since my previous comment:
- AtomicOpFunctor is apparently only defined for Ruby (I believe because it's only currently used by the AMD GPU model, which doesn't work with classic caches anyway). So to use this solution, another prerequisite would be to port AtomicOpFunctor support to the classic cache. That's probably something that should be done anyway, but it is yet more work that no one is signed up to do. There is a similar feature in the classic cache that supports only compare and swap (see Cache::cmpAdnSwap()), so I think the right approach would be to generalize this path to accept AtomicOpFunctors and use that to replace the existing compare-and-swap support.
- More critically, x86 requires atomic operations to work on accesses that cross cache-line boundaries. This works with the current patch (I believe, not sure I explicitly tested this) because the CPU will load and lock both lines into the cache, do the operation, then unlock both lines after they are updated. It's not clear to me how to make this work with AtomicOpFunctor, since the whole cache interface assumes single-cache-line accesses. (All x86 unaligned operations are managed by the CPU, not by the cache.) This could be the ultimate reason why we want to stick with the current approach rather than moving to AtomicOpFunctor.

Tony Gutierrez Jan. 18, 2017, 8:42 a.m. (Jan. 18, 2017, 8:42 a.m.)

Hi Nikos,

To give some context we had a lot of resistance when we first introduced the AtomicOpFunctors because, if I am remembering correctly, it was deemed too opaque, and that it would cause issues when connecting gem5 to TLM. We had the owner of that patch here at AMD try to refactor the code to not need to rely on the AtomicOpFunctor, but this was more difficult than we originally thought and it was never completed. The AtomicOpFunctor seems like a more natural solution to the problem, however, so if there is a way we can keep them that would be preferred. Otherwise we need to add support for every possible atomic type to the request, as well as space for the additional atomic data and some specifiers to mark which type of atmoic operation the request is carryning. Not too long ago Andreas mentioned that ARM were workign on something to replace them internally as well, but I haven't heard the status of that.

You have a pending review.

Review Board 2.0.15

Screenshots

Files

Issue Summary

Change Summary:

Diff:

Change Summary:

Diff:

Change Summary:

Diff:

Change Summary:

Description:

Changeset 10745:9b84d1b570e3

Changeset 11444:8a1419dbbfa6

Diff: