cpu: Add TraceCPU to playback elastic traces

Information
Submitter:	Curtis Dunham
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

This patch defines a TraceCPU that replays trace generated using the elastic
trace probe attached to the O3 CPU model. The elastic trace is an execution
trace with data dependencies and ordering dependencies annoted to it. It also
replays fixed timestamp instruction fetch trace that is also generated by the
elastic trace probe.

The TraceCPU inherits from BaseCPU as a result of which some methods need
to be defined. It has two port subclasses inherited from MasterPort for
instruction and data ports. It issues the memory requests deducing the
timing from the trace and without performing real execution of micro-ops.
As soon as the last dependency for an instruction is complete,
its computational delay, also provided in the input trace is added. The
dependency-free nodes are maintained in a list, called 'ReadyList',
ordered by ready time. Instructions which depend on load stall until the
responses for read requests are received thus achieving elastic replay. If
the dependency is not found when adding a new node, it is assumed complete.
Thus, if this node is found to be completely dependency-free its issue time is
calculated and it is added to the ready list immediately. This is encapsulated
in the subclass ElasticDataGen.

If ready nodes are issued in an unconstrained way there can be more nodes
outstanding which results in divergence in timing compared to the O3CPU.
Therefore, the Trace CPU also models hardware resources. A sub-class to model
hardware resources is added which contains the maximum sizes of load buffer,
store buffer and ROB. If resources are not available, the node is not issued.
The 'depFreeQueue' structure holds nodes that are pending issue.

Modeling the ROB size in the Trace CPU as a resource limitation is arguably the
most important parameter of all resources. The ROB occupancy is estimated using
the newly added field 'robNum'. We need to use ROB number as sequence number is
at times much higher due to squashing and trace replay is focused on correct
path modeling.

A map called 'inFlightNodes' is added to track nodes that are not only in
the readyList but also load nodes that are executed (and thus removed from
readyList) but are not complete. ReadyList handles what and when to execute
next node while the inFlightNodes is used for resource modelling. The oldest
ROB number is updated when any node occupies the ROB or when an entry in the
ROB is released. The ROB occupancy is equal to the difference in the ROB number
of the newly dependency-free node and the oldest ROB number in flight.

If no node dependends on a non load/store node then there is no reason to track
it in the dependency graph. We filter out such nodes but count them and add a
weight field to the subsequent node that we do include in the trace. The weight
field is used to model ROB occupancy during replay.

The depFreeQueue is chosen to be FIFO so that child nodes which are in
program order get pushed into it in that order and thus issued in the in
program order, like in the O3CPU. This is also why the dependents is made a
sequential container, std::set to std::vector. We only check head of the
depFreeQueue as nodes are issued in order and blocking on head models that
better than looping the entire queue. An alternative choice would be to inspect
top N pending nodes where N is the issue-width. This is left for future as the
timing correlation looks good as it is.

At the start of an execution event, first we attempt to issue such pending
nodes by checking if appropriate resources have become available. If yes, we
compute the execute tick with respect to the time then. Then we proceed to
complete nodes from the readyList.

When a read response is received, sometimes a dependency on it that was
supposed to be released when it was issued is still not released. This occurs
because the dependent gets added to the graph after the read was sent. So the
check is made less strict and the dependency is marked complete on read
response instead of insisting that it should have been removed on read sent.

There is a check for requests spanning two cache lines as this condition
triggers an assert fail in the L1 cache. If it does then truncate the size
to access only until the end of that line and ignore the remainder.
Strictly-ordered requests are skipped and the dependencies on such requests
are handled by simply marking them complete immediately.

The simulated seconds can be calculated as the difference between the
final_tick stat and the tickOffset stat. A CountedExitEvent that contains
a static int belonging to the Trace CPU class as a down counter is used to
implement multi Trace CPU simulation exit.

Testing Done

Issue Summary

Description	From	Last Updated	Status
Return type: const std::string &	Nilay Vaish	Nov. 5, 2015, 2:38 p.m.	Resolved
Stats::Formula instead of Stats::Scalar.	Nilay Vaish	Nov. 5, 2015, 1:46 p.m.	Resolved
Would it make sense to use either a heap or an orderded map? Assuming that you only do insertions and ...	Nilay Vaish	Dec. 6, 2015, 10:52 a.m.	Dropped
Why call a function for stores and handle loads in place when both need same amount of processing?	Nilay Vaish	Nov. 5, 2015, 1:48 p.m.	Resolved
I read the code on how curr and next elements are being set and used. If my understanding is correct, ...	Nilay Vaish	Nov. 5, 2015, 1:49 p.m.	Resolved
I think the two functions should behave in the same way. That is, both of them should return false in ...	Nilay Vaish	Nov. 5, 2015, 1:49 p.m.	Resolved
Reference	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
const ref	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
nullptr	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
BaseCPU::init();	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
Why is every cpu creating and scheduling this event? We should need only one. I am not sure this would ...	Nilay Vaish	Nov. 6, 2015, 6:20 a.m.	Dropped
Brackets around DTRACE call.	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
nullptr	Nilay Vaish	Nov. 20, 2015, 6:36 a.m.	Resolved
I don't think this is correct.	Nilay Vaish	Dec. 4, 2015, 2:43 a.m.	Resolved
space around /	Stephan Diestelhorst	Nov. 20, 2015, 6:35 a.m.	Resolved
I think trace.read(&currElement) would make it more obvious that currElement is an output operand here. (Other options would be currElement ...	Stephan Diestelhorst	Nov. 20, 2015, 6:36 a.m.	Resolved

I am yet to read TraceCPU.cc file.

src/cpu/trace/trace_cpu.hh (Diff revision 1)

I would either return 0 or a -ve value.

Radhika Jagtap Nov. 5, 2015, 11:04 a.m. (Nov. 5, 2015, 11:04 a.m.)

The patch will be updated soon, just adding the comments now. Fixed this.

src/cpu/trace/trace_cpu.hh (Diff revision 1)

So you plan to inherit from TraceCPU.

src/cpu/trace/trace_cpu.hh (Diff revision 1)

Return type: const std::string &

Show all issues

Radhika Jagtap Nov. 5, 2015, 11:04 a.m. (Nov. 5, 2015, 11:04 a.m.)
```
Fixing.
```

src/cpu/trace/trace_cpu.hh (Diff revision 1)

Stats::Formula instead of Stats::Scalar.

Show all issues

Radhika Jagtap Nov. 5, 2015, 11:04 a.m. (Nov. 5, 2015, 11:04 a.m.)
```
Good suggestion, fixed.
```

src/cpu/trace/trace_cpu.cc (Diff revision 1)

Would it make sense to use either a heap or an orderded map? Assuming that you only do insertions and deletions, and the trace is long running, so on average there are n nodes in the list, and you make m insertions and m deletions over the trace length and m >> n.
Then, the list would require O(mn) time for execution, while heap or ordered map would require O(m log n) time. Had you been using a vector instead of list, then cache hits due to contiguity can help, but both list and map/heap would typically miss in the cache, so I don't seen why we should be using a list.

Show all issues

Nilay Vaish Sept. 20, 2015, 11:10 a.m. (Sept. 20, 2015, 11:10 a.m.)

If I was writing this code, I would probably measure the time with vector, list, map and heap on some long running trace and see which one works the best. Just from the complexity analysis, I would probably go for a heap.

Radhika Jagtap Nov. 5, 2015, 11:05 a.m. (Nov. 5, 2015, 11:05 a.m.)

Not fixing this at this time. Thanks for the suggestion though - a definite to do the next time I update this.

Nilay Vaish Nov. 6, 2015, 6:28 a.m. (Nov. 6, 2015, 6:28 a.m.)

Mail me all the patches related to trace cpu with the information on the order
in which these have to applied. Also mail a trace that runs for about 10 minutes
on the TraceCPU. I'll test the performance of current implementation with one
using a heap. I'll report this to you by next Friday.

Radhika Jagtap Nov. 20, 2015, 1:58 a.m. (Nov. 20, 2015, 1:58 a.m.)

Thanks for the offer Nilay. I'll look into this later. The base functionality is a lot of code so I'd appreciate it being committed and as we are actively using it, I'll look into improving performance.

src/cpu/trace/trace_cpu.cc (Diff revision 1)

Strange name.

src/cpu/trace/trace_cpu.cc (Diff revision 1)

Why call a function for stores and handle loads in place when both need same amount of processing?

Show all issues

Radhika Jagtap Nov. 5, 2015, 11:05 a.m. (Nov. 5, 2015, 11:05 a.m.)

I factored out releaseStoreBuffer() because it is not only used here but also in completeMemAccess().

src/cpu/trace/trace_cpu.cc (Diff revision 1)

I read the code on how curr and next elements are being set and used. If my understanding is correct, I think we need only one of the variables. I the only line which makes use of nextElement is line 1023. Instead, we can read the trace into currElement all the time. Record the tick value before calling nextExecute() on line 1022 and use the recorded tick value in line 1023. This will avoid all the coping from next to curr.

Show all issues

Radhika Jagtap Nov. 5, 2015, 11:05 a.m. (Nov. 5, 2015, 11:05 a.m.)

Well spotted, re-wrote this and cleanup a bit more around this. Thanks!

src/cpu/trace/trace_cpu.cc (Diff revision 1)

I would really prefer if we have braces here. Had it been just the line: retryPkt = pkt, I would not have asked for braces.

Radhika Jagtap Nov. 5, 2015, 11:05 a.m. (Nov. 5, 2015, 11:05 a.m.)
```
Fixing.
```

src/cpu/trace/trace_cpu.cc (Diff revision 1)

I think the two functions should behave in the same way. That is, both of them should return false in case no dependency was found.  The function removeDepOnInst() should check the return value from removeRegDep().

Show all issues

Radhika Jagtap Nov. 5, 2015, 11:05 a.m. (Nov. 5, 2015, 11:05 a.m.)
```
Okay, minor issue but did fix it.
```

Diff:

Revision 2 (+2605)

Show changes

	src/cpu/trace/SConscript
	src/cpu/trace/TraceCPU.py
	src/cpu/trace/trace_cpu.hh
	src/cpu/trace/trace_cpu.cc

src/cpu/trace/trace_cpu.hh (Diff revision 2)

Reference

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.hh (Diff revision 2)

Why not Tick as return type?

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)

Same as in elastic trace probe - I chose an int type for the delay instead of uint because I want to assert that it is positive.

src/cpu/trace/trace_cpu.hh (Diff revision 2)

const ref

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.hh (Diff revision 2)

I think we should just pass the number of cpus as a parameter to the constructor.

src/cpu/trace/trace_cpu.cc (Diff revision 2)

nullptr

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

BaseCPU::init();

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

Why is every cpu creating and scheduling this event? We should need only one. I am not sure this would work if we have multiple such events.

Show all issues

src/cpu/trace/trace_cpu.cc (Diff revision 2)

There is no dependency here, but I would like this function call to happen first.

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

Brackets around DTRACE call.

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

nullptr

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

I don't think this is correct.

Show all issues

Stephan Diestelhorst Nov. 6, 2015, 2:15 a.m. (Nov. 6, 2015, 2:15 a.m.)

Radhika, maybe add a comment that nextExecute() will update currElement?

Nilay Vaish Nov. 6, 2015, 6:24 a.m. (Nov. 6, 2015, 6:24 a.m.)

delta >= 0 && currElement.tick < last_tick is possible.

I think the code should be:

  assert(currElement.tick >= last_tick);

  delta = currElement.tick - last_tick;

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)

"delta >= 0 && currElement.tick < last_tick is possible." How?
Timestamps in the traces are increasing. When nextExecute is called and there is a valid element read from the file, currElement.tick is updated to a value >= last_tick and nextExecute call returns true. If end of file is reached, nextExecute returns false.

Nilay Vaish Nov. 24, 2015, 1:42 p.m. (Nov. 24, 2015, 1:42 p.m.)

I am quoting from the page: http://en.cppreference.com/w/cpp/language/operator_arithmetic
Unsigned integer arithmetic is always performed modulo 2^n where n is the number of bits in that particular integer. E.g. for unsigned int, adding one to UINT_MAX gives 0, and subtracting one from 0 gives UINT_MAX.

Here is an example that in my understanding would lead to delta >= 0 and currElement.tick < last_tick.

currElement.tick = 0, last_tick = 2^63+1.

delta = currElement.tick - last_tick;

Then, delta = (-2^63-1) % 2^64 = 2^63-1. 
I would prefer that delta is also unsigned type and we assert that currElement.tick >= last_tick.

Radhika Jagtap Dec. 1, 2015, 9:56 a.m. (Dec. 1, 2015, 9:56 a.m.)
```
Ok, I'm fixing this too.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

What's happening here?

Radhika Jagtap Nov. 20, 2015, 1:56 a.m. (Nov. 20, 2015, 1:56 a.m.)
```
Assigning a dummy value to the data field of the request.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

These N events all decrement the same /reference/ to the per-class static counter. So having multiple events sharing a single counter is exactly the right design here.

Nilay Vaish Nov. 6, 2015, 6:22 a.m. (Nov. 6, 2015, 6:22 a.m.)
```
Got it.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

space around /

Show all issues

Radhika Jagtap Nov. 20, 2015, 1:59 a.m. (Nov. 20, 2015, 1:59 a.m.)
```
Done.
```

src/cpu/trace/trace_cpu.cc (Diff revision 2)

I think

trace.read(&currElement)

would make it more obvious that currElement is an output operand here.  (Other options would be currElement = trace.read(), but then checking for failure needs extra handling.)

Show all issues