BranchPred: Take the branch predictor out of O3CPU and make it a stand-alone

Information
Submitter:	Timothy Jones
Repository:	gem5
Branch:
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

BranchPred: Take the branch predictor out of O3CPU and make it a stand-alone
SimObject.  This then allows the same branch predictor to be shared amongst
several CPUs.

This patch is unfinished.  I would like to take the branch predictor out of
the inorder CPU as well, but want comments on whether this is the best
approach to take first.

Testing Done

Overall I'm in favor of what you're doing here... I think most of my comments are more about general cleanup of the bpred object unrelated to your relocation of it.  But I've never really looked closely at the code before and it seems like this is a good opportunity to clean it up.

As far as I can tell, the templating is only used to provide a MaxThreads value.  I'd like to get rid of the template entirely (and the _impl.hh function and the empty base class), just pass max threads in as a parameter, and dynamically allocate the couple of arrays that depend on it.

src/cpu/o3/fetch_impl.hh (Diff revision 1)

Why do we need to clear the state here?  Do we need this function at all?

src/cpu/o3/fetch_impl.hh (Diff revision 1)

I don't see the point in this function either... I'd just get rid of the empty function and this call.

src/cpu/pred/bpred_unit.hh (Diff revision 1)

It doesn't look like this function or BTBLookup or BTBUpdate get called anywhere... let's get rid of them.  Even if they do get called, I don't see the point of wrapping the indirection.

src/cpu/pred/builder.cc (Diff revision 1)

I think this function should go in bpred_unit.cc and this file should go away.

OK, looking a little closer I see that Impl is also used to get the DynInst type... but many of the methods called on DynInst (isUncondControl(), isCall(), isReturn()) are also StaticInst methods, and some of the others (like readPC()) are just used for DPRINTFs (and how is readPC() different from the PC that's passed in?).  The only exception I see is seqNum.

I would suggest passing in a StaticInstPtr instead of a DynInstPtr, and add another argument for seqNum. This makes the predictor much more independent of the CPU model.  Then I'm pretty sure you can get rid of the template.

I actually got started down this path looking at how you had to add an include of static_inst.hh for the warm() method, and wondering how the bpred got away without knowing what a static inst was prior to this...

Timothy Jones July 11, 2010, 4:15 a.m. (July 11, 2010, 4:15 a.m.)

Ok, I can have a go at simpifying the branch predictor like this.  I don't think that should be too much work and would probably help when it comes to integration with the in-order CPU too.

I'll also get rid of those unnecessary calls and things you've mentioned above.

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

Shouldnt this *not* be hardcoded to 16, but instead the true RAS size?

Timothy Jones July 11, 2010, 4:09 a.m. (July 11, 2010, 4:09 a.m.)
```
Seems reasonable to me, I'll fix this.
```

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

What happens if the RAS is full? is it right to just overwrite the next entry?

Timothy Jones July 11, 2010, 4:10 a.m. (July 11, 2010, 4:10 a.m.)

Presumably that's what a real processor would do, otherwise the depth of the call stack would be limited?

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

If you miss on a BTB lookup and you've predicted taken, should you not instead stall (or have the option to stall) the CPU until branch resolution???

(I've got a patch for this option in another tree if people think that's useful)

Timothy Jones July 11, 2010, 4:11 a.m. (July 11, 2010, 4:11 a.m.)

It would be a good thing to have the option of doing that, in my opinion.

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

I'm not a fan of this localBP vs tournamentBP comparison we do everytime we need to lookup or update the predictor...

Why isnt there a base predictor class that the localBP and the tournamentBP derive from?

Eventually if someone wanted to substitute the predictor or implement something else (e.g. no predictor at all and always predict false), all you would need to do is pass that as a SimObject(?) to the branch prediction unit and it would be plug-n-play.

Timothy Jones July 11, 2010, 4:12 a.m. (July 11, 2010, 4:12 a.m.)

I agree, and I can have a go at making the branch predictor like that, if you like?

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

What if there is no valid entry at the top here? (the call stack is greater than RAS size)...

In the later comment, I discuss options for this...

src/cpu/pred/bpred_unit_impl.hh (Diff revision 1)

But isnt the depth of a stack limited by the size of the RAS anyway?

Say we have 16 entries, then the call stack is limited by 16 no?

I have 2 concerns here (and for surrounding RAS code):
(1) If the RAS does overflow and we want to just keep overwriting, then are we always overwriting the to the right entry?

(2) If there is no valid RAS entry (in the case of the call stack being greater than the RAS size), then the branch prediction needs to either return false or simply stall until resolution (similar to the BTB concern I had)

Steve Reinhardt July 11, 2010, 6:45 a.m. (July 11, 2010, 6:45 a.m.)

Hi Korey... this is the way RAS's typically work.  If your call stack is deeper than your RAS size, you obviously want to keep the most recent entries rather than the older ones, so you just let it wrap around an overwrite the oldest entries.  Once you pop off all the valid entries, you will get bad predictions from re-using the overwritten slots, but the conventional wisdom is that the penalty of doing that (both in terms of the frequency of it happening and the cost of each occurrence) is low enough that it's not worth trying to track when it happens.  The conventional wisdom may or may not be correct, but it is what it is...

Diff:

Revision 2 (+729 -1034)

Show changes

	src/cpu/o3/O3CPU.py
	src/cpu/o3/SConscript
	src/cpu/o3/bpred_unit.hh
	src/cpu/o3/bpred_unit.cc
	src/cpu/o3/bpred_unit_impl.hh
	src/cpu/o3/cpu_builder.cc
	src/cpu/o3/cpu_policy.hh
	src/cpu/o3/fetch.hh
	12 more

With this new revision I have removed the template from the branch predictor and fixed it to use StaticInstPtr instead of a dynamic one.  I've removed the functions that were never called and made a couple of the other fixes that were mentioned in the reviews.  I've also taken the local and tournament predictors and made them inherited classes of the predictor that override key functions.

If everyone is happy with this, I'll work on removing the code from the in-order CPU and making it use this predictor instead.

Hi Tim,

Sorry for the long delay on this... this looks great to me, with one minor issue: I see you changed the lookup/update method names to BPLookup/BPUpdate etc.  The lower-case names are the ones we want to keep though, since they follow the M5 style (http://m5sim.org/wiki/index.php/Coding_Style#Naming); the BPFoo names were violating the style to begin with.

I think there were two different interfaces, an external one to the CPU and a more internal one, with different names, and since you are getting rid of that layer of indirection you had to pick one, and you just happened to pick the worse one instead of the better one :-).

Steve

Timothy Jones July 29, 2010, 2:30 a.m. (July 29, 2010, 2:30 a.m.)

Ah, typical!  I'll fix this when I've integrated the in-order changes too.

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Diff:

Status: Closed (submitted)