O3CPU: Revive cachePorts per-cycle dcache access limit

Information
Submitter:	Erik Tomusk
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.

What this patch does NOT do:
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.

It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.

Testing Done

When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.

Issue Summary

Description	From	Last Updated	Status
Can this just made to be unsigned so <0 isn't possible?	Ali Saidi	June 4, 2013, 10:52 p.m.	Resolved
Data ports?	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open
const unsigned int?	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open
== rather than >=?	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open
When is this ever decremented? How do we link a decrement of this with getting things going again (or do ...	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open
\n at the end?	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open
Is the port not also "used" on a blocked request that is waiting for a retry?	Andreas Hansson	June 6, 2013, 3:10 a.m.	Open

Thanks for this Erik,

My only issue with the way this is handled, is if the CPU runs out of cache ports, I'm pretty sure it will squash the entire pipeline and start re-fetching, which doesn't seem right to me. Anyone else?

Thanks,
Ali

Erik Tomusk June 4, 2013, 11:13 p.m. (June 4, 2013, 11:13 p.m.)

Looking at a trace, I think you're right--when the cache blocks, DefaultIEW<Impl>::executeInsts() calls squashDueToMemBlocked(). I don't quite understand why instructions should get squashed on a blocked cache, but I think it makes sense to do the same thing for both a blocked cache and when the ports limit is reached.

Steve Reinhardt June 6, 2013, 1:09 a.m. (June 6, 2013, 1:09 a.m.)

I haven't looked at the code in detail, but it seems like there are two different situations here:
1. When the LSQ knows ahead of time that it can't send any (more) accesses to the dcache
2. When the LSQ finds out after sending an access to the dcache that the access won't succeed

In case #1, there's no reason for any squashing, you just don't send the access to the cache in the first place.  This should apply both to the port limits (if you have 2 ports, just send 2 accesses, don't try to send a 3rd) and to the case where the cache is already blocked at the beginning of a cycle and you're waiting for a retry.

Case #2 is the only one where you have a squash, I think, and I believe the only scenario where it applies is when you first find out that the cache is blocked... since you find out by trying to send an access and getting a 'false' response.

src/cpu/o3/lsq_unit_impl.hh (Diff revision 1)

Can this just made to be unsigned so <0 isn't possible?

Show all issues

Description:

~		Changeset 9722:b5b229a8d919
	~	Changeset 9722:7026fe0f45b4

		O3CPU: Revive cachePorts per-cycle dcache access limit
		This is a stop-gap patch to place a limit on the number of dcache requests the
		LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
		requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
		for each hardware thread, so the cachePorts limit is enforced on a per-thread
		basis.

		What this patch does NOT do:
		Limit icache accesses
		Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
		L2)

		I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
		messy. It would be helpful to get feedback on whether what it does is
		functionally correct before I do.
-
-		It would also be helpful if someone who understands split memory accesses
-		could check if that bit of code is correct, since I don't know how to test
-		it.

Diff:

Revision 2 (+52 -19)

Show changes

	src/cpu/o3/O3CPU.py
	src/cpu/o3/iew_impl.hh
	src/cpu/o3/lsq_unit.hh
	src/cpu/o3/lsq_unit_impl.hh

Description:

		Changeset 9722:7026fe0f45b4

		O3CPU: Revive cachePorts per-cycle dcache access limit
		This is a stop-gap patch to place a limit on the number of dcache requests the
		LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
		requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
		for each hardware thread, so the cachePorts limit is enforced on a per-thread
		basis.

		What this patch does NOT do:
		Limit icache accesses
		Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
		L2)

		I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
		messy. It would be helpful to get feedback on whether what it does is
		functionally correct before I do.
	+
	+	It would also be helpful if someone who understands split memory accesses
	+	could check if that bit of code is correct, since I don't know how to test
	+	it.

src/cpu/o3/lsq_unit.hh (Diff revision 2)

This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense?  And if it does, why don't we check up front that we have two free ports before sending the first half?

Offhand, I don't see any reason for this restriction.  I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).

Andreas Hansson June 6, 2013, 2:56 a.m. (June 6, 2013, 2:56 a.m.)

It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.

Steve Reinhardt June 6, 2013, 3:23 a.m. (June 6, 2013, 3:23 a.m.)

It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access.  I agree that this is good.

My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports.  The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.

Ali Saidi June 6, 2013, 4:19 a.m. (June 6, 2013, 4:19 a.m.)

I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created.

Erik Tomusk June 11, 2013, 2:49 a.m. (June 11, 2013, 2:49 a.m.)

The problem is that there isn't an elegant way of enforcing a cache ports limit here without refactoring the code that handles split accesses.

Is there consensus on whether split accesses are currently handled correctly or not? If not, is there a consensus on how they should be handled? If so, should those changes also be part of this patch?

Ali Saidi July 22, 2013, 8:38 p.m. (July 22, 2013, 8:38 p.m.)

Sorry for the long delay, this one seemed to slip through...

I don't know what the consensus is or how these sort of split accesses should be handled. I guess it's no worse than the previous code, so I doesn't need to be part of this patch. 

My biggest concern is I can't really get my head around why the CPU needs to squash a bunch of instructions when one of the cache ports blocks. Did you ever look at Steve's comment above. Is it the case that the squashing only happens in the second case?


Sorry again about forgetting about this patch for so long and thanks for posting it,

Ali

Erik Tomusk Nov. 26, 2013, 6:06 a.m. (Nov. 26, 2013, 6:06 a.m.)

Sorry for the even longer delay. Things just kept coming up...

Mitch Hayenga gave a good explanation to squashing on a blocked cache here: http://www.mail-archive.com/gem5-users@gem5.org/msg08525.html . Basically, a blocked cache breaks the instruction schedule. It seems reasonable that the same would apply when ports run out.

Given this squashing behavior, I'm starting to doubt whether it's even possible to implement some sort of cache ports limit. Any data-intensive application would end up with a very low IPC from all the squashing. There must be a way this is done in the real world. Does anyone know what that is?

-Erik

Erik Tomusk Nov. 26, 2013, 6:14 a.m. (Nov. 26, 2013, 6:14 a.m.)

[empty comment--bad email settings made last comment bounce back from gem5-dev]

Just a random thought. This patch is a good step in the right direction, but why don't we simply use a vector master port for the D side and cycle through them round robin (or pick which ever is free)?

Erik Tomusk June 11, 2013, 2:49 a.m. (June 11, 2013, 2:49 a.m.)

Wouldn't that mean that for every tick we'd have to keep track of not just how many ports have been used, but also which ones? And we might have to include some code next to every dcache access to do a search for unused ports.

src/cpu/o3/O3CPU.py (Diff revision 2)

Data ports?

Show all issues

Erik Tomusk June 11, 2013, 2:48 a.m. (June 11, 2013, 2:48 a.m.)

I'm not sure either is better. "Cache ports" gets used in the literature, but also means something different in the gem5 context.

src/cpu/o3/lsq_unit.hh (Diff revision 2)

const unsigned int?

Show all issues

src/cpu/o3/lsq_unit.hh (Diff revision 2)

== rather than >=?

Show all issues

src/cpu/o3/lsq_unit.hh (Diff revision 2)

When is this ever decremented?

How do we link a decrement of this with getting things going again (or do we simply keep on trying and fail until the number is reduced)?

Show all issues

Erik Tomusk June 11, 2013, 2:48 a.m. (June 11, 2013, 2:48 a.m.)
```
It's reset in tick().
```

src/cpu/o3/lsq_unit_impl.hh (Diff revision 2)

\n at the end?

Show all issues

src/cpu/o3/lsq_unit_impl.hh (Diff revision 2)

Is the port not also "used" on a blocked request that is waiting for a retry?

Show all issues

Erik Tomusk June 11, 2013, 2:48 a.m. (June 11, 2013, 2:48 a.m.)

Practically, I don't think it matters--if the cache is blocked in a given cycle, nothing's going in or out regardless of the number of free ports. In terms of real hardware, I think it depends on what caused the cache to block. This patch tries to model the fact that there are only so many physical wires going into each memory cell, and once they've carried some data in a cycle, there's no more time to carry more. If the cache is blocked because of some buffering somewhere, then I guess the answer is no.

You have a pending review.

Review Board 2.0.15

Screenshots

Files

Issue Summary

Description:

Changeset 9722:b5b229a8d919

Changeset 9722:7026fe0f45b4

Diff:

Description:

Changeset 9722:7026fe0f45b4