Review Board 2.0.15


O3CPU: Revive cachePorts per-cycle dcache access limit

Review Request #1872 - Created May 16, 2013 and updated

Information
Erik Tomusk
gem5
default
Reviewers
Default
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.

What this patch does NOT do:
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.

It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.

Issue Summary

7 6 1 0
Review request changed
Updated (June 4, 2013, 11:13 p.m.)

Description:

   

Changeset 9722:7026fe0f45b4

   
   

O3CPU: Revive cachePorts per-cycle dcache access limit

    This is a stop-gap patch to place a limit on the number of dcache requests the
    LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
    requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
    for each hardware thread, so the cachePorts limit is enforced on a per-thread
    basis.

   
   

What this patch does NOT do:

    Limit icache accesses
    Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
    L2)

   
   

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very

    messy. It would be helpful to get feedback on whether what it does is
    functionally correct before I do.

  +
  +

It would also be helpful if someone who understands split memory accesses

  + could check if that bit of code is correct, since I don't know how to test
  + it.

Posted (June 6, 2013, 2:49 a.m.)



  
src/cpu/o3/lsq_unit.hh (Diff revision 2)
 
 
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense?  And if it does, why don't we check up front that we have two free ports before sending the first half?

Offhand, I don't see any reason for this restriction.  I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
  1. It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
  2. It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access.  I agree that this is good.
    
    My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports.  The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.
  3. I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created. 
  4. The problem is that there isn't an elegant way of enforcing a cache ports limit here without refactoring the code that handles split accesses.
    
    Is there consensus on whether split accesses are currently handled correctly or not? If not, is there a consensus on how they should be handled? If so, should those changes also be part of this patch?
  5. Sorry for the long delay, this one seemed to slip through...
    
    I don't know what the consensus is or how these sort of split accesses should be handled. I guess it's no worse than the previous code, so I doesn't need to be part of this patch. 
    
    My biggest concern is I can't really get my head around why the CPU needs to squash a bunch of instructions when one of the cache ports blocks. Did you ever look at Steve's comment above. Is it the case that the squashing only happens in the second case?
    
    
    Sorry again about forgetting about this patch for so long and thanks for posting it,
    
    Ali
    
  6. Sorry for the even longer delay. Things just kept coming up...
    
    Mitch Hayenga gave a good explanation to squashing on a blocked cache here: http://www.mail-archive.com/gem5-users@gem5.org/msg08525.html . Basically, a blocked cache breaks the instruction schedule. It seems reasonable that the same would apply when ports run out.
    
    Given this squashing behavior, I'm starting to doubt whether it's even possible to implement some sort of cache ports limit. Any data-intensive application would end up with a very low IPC from all the squashing. There must be a way this is done in the real world. Does anyone know what that is?
    
    -Erik
  7. [empty comment--bad email settings made last comment bounce back from gem5-dev]
Posted (June 6, 2013, 3:10 a.m.)
Just a random thought. This patch is a good step in the right direction, but why don't we simply use a vector master port for the D side and cycle through them round robin (or pick which ever is free)?
  1. Wouldn't that mean that for every tick we'd have to keep track of not just how many ports have been used, but also which ones? And we might have to include some code next to every dcache access to do a search for unused ports.
src/cpu/o3/O3CPU.py (Diff revision 2)
 
 
Data ports?
  1. I'm not sure either is better. "Cache ports" gets used in the literature, but also means something different in the gem5 context.
src/cpu/o3/lsq_unit.hh (Diff revision 2)
 
 
const unsigned int?
src/cpu/o3/lsq_unit.hh (Diff revision 2)
 
 
== rather than >=?
src/cpu/o3/lsq_unit.hh (Diff revision 2)
 
 
When is this ever decremented?

How do we link a decrement of this with getting things going again (or do we simply keep on trying and fail until the number is reduced)?
  1. It's reset in tick().
src/cpu/o3/lsq_unit_impl.hh (Diff revision 2)
 
 
\n at the end?
src/cpu/o3/lsq_unit_impl.hh (Diff revision 2)
 
 
Is the port not also "used" on a blocked request that is waiting for a retry?
  1. Practically, I don't think it matters--if the cache is blocked in a given cycle, nothing's going in or out regardless of the number of free ports. In terms of real hardware, I think it depends on what caused the cache to block. This patch tries to model the fact that there are only so many physical wires going into each memory cell, and once they've carried some data in a cycle, there's no more time to carry more. If the cache is blocked because of some buffering somewhere, then I guess the answer is no.