mem: MSHR livelock bug fix

Information
Submitter:	Tony Gutierrez
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 10841:b015addd7b9d
---------------------------
mem: Ruby response timing

This patch ensures that Ruby responses to the CPU core are not unnecessarily
delayed. The original code delays Ruby responses by a tick, causing the core
to receive them a cycle later, rather than in the same cycle. Hence, the
throughput of back-to-back stores that hit in the L1 are reduced by
half because the O3 must wait for the acknowledgement of a prior store before
issuing the next store. This patch eliminates the performance bug.

This patch was created by Bihn Pham during his internship at AMD.

Testing Done

While Andreas Hansson should have a better understanding of whether this correct or not,
the explanation does not read like a live lock to me.  It just seems that the second store
is being delayed.  Livelock means that some superficial work is going on even though actual
progress is not being made.

I do not understand the livelock scenario based on the description. Is the patch trying to allow multiple loads/stores per cycle, or is it fixing an issue?

Yasuko Eckert May 14, 2015, 11:33 a.m. (May 14, 2015, 11:33 a.m.)

This patch fixes a performance issue. It does not allow a prefetch request to seize an MSHR slot before a dependent memory instruction gets a chance to issue in the next cycle.

Andreas Hansson May 14, 2015, 3:21 p.m. (May 14, 2015, 3:21 p.m.)

I am still a bit hesitant since I fear this has performance implications in quite a few places, and would potentially throw off any corrrelation work already done. When is this a problem? What components are involved?

If the goal is to ensure the CPU gets a chance before any prefetch "at the same time" then I'd say we should deal with that not by fiddling with the time, but rather have an event for the next cycle that picks one. Makes sense?

Yasuko Eckert May 14, 2015, 7:51 p.m. (May 14, 2015, 7:51 p.m.)

This patch does have performance implications and that is intentional. We found this performance bug while correlating O3 to AMD's x86 hardware.

The problem is when there are back-to-back stores and a hardware prefetcher has prefetches to issue. An older store A issued a write and is waiting for an ACK from Ruby. In cycle T, Ruby schedules sending an ACk to store A in cycle T+1 (i.e., curTick()+1). In the same cycle T, the prefetcher issues a prefetch request and successfully grabs an MSHR entry. The MSHR becomes full at this point. In cycle T+1, store A receives the ACK. The younger store B is ready to issue in the same cycle but cannot because the MSHR is full. The last available entry was taken by the earlier prefetch that was issued in cycle T. Hence, back-to-back execution of stores fails.

To fix this problem, this patch sends an ACK to store A in cycle T, not T+1. This ensures that store B is executed back-to-back after A, even in the presence of prefetches.

Nilay Vaish May 15, 2015, 7:20 a.m. (May 15, 2015, 7:20 a.m.)

To me this appears to be problem with the policy followed by the memory system
in deciding who should get preference, the cpu or the prefetcher.  What if, in
the new scenario, the prefetch is issued at T-1 instead?  store B would still not
be able to go ahead.  I think the solution is either have separate MSHRs for demand
and prefetch accesses, or if there is demand access waiting, block MSHR in advance.

Description:

~		Changeset 10841:b015addd7b9d
	~	Changeset 10841:b015addd7b9d
	+	---------------------------
	+	mem: Ruby response timing

~		mem: MSHR livelock bug fix
	~	This patch ensures that Ruby responses to the CPU core are not unnecessarily
	+	delayed. The original code delays Ruby responses by a tick, causing the core
	+	to receive them a cycle later, rather than in the same cycle. Hence, the
	+	throughput of back-to-back stores that hit in the L1 are reduced by
	+	half because the O3 must wait for the acknowledgement of a prior store before
	+	issuing the next store. This patch eliminates the performance bug.

		This patch was created by Bihn Pham during his internship at AMD.
-
-		This bug fix prevents a case in which a prefetcher uses up all remaining MSHR
-		entries before demand requests get a chance to, causing a livelock.
-		This happens because events scheduled at curTick() + 1 are evaluated in the
-		next cycle, not in the current cycle.
-
-		A specific case that caused this livelock situation is the following:
-		There are back-to-back stores and the second store cannot be sent to the cache
-		until the first store receives an ACK. When the ACK is scheduled at curTick() +
-		1, meaning that the ACK is to be sent in the next cycle, there is an open MSHR
-		entry in the current cycle. A prefetcher grabs the entry by issuing a prefetch
-		request in the current cycle before the second store gets a chance to issue in
-		the next cycle. The second store stalls because the MSHR is already full by
-		that time.

This is dangerous.

The whole idea is that we do not send things in 0 time (infinite throughput). Admittedly the +1 is a poor-mans version of a delta-delay, but I fear this interacts with a lot of things. What is the impact on (classic) cache performance, the other CPUs etc?

You have a pending review.

Review Board 2.0.15

Screenshots

Files

Description:

Changeset 10841:b015addd7b9d