Review Board 2.0.15


x86: Fix SMT support (zeroReg, TLBs)

Review Request #1281 - Created June 29, 2012 and updated

Information
Andrea Pellegrini
gem5
default
Reviewers
Default
Changeset 9074:8a6f47da502a
---------------------------
x86: Fix SMT support (zeroReg, TLBs)


-----------------------------------------------------------------------------------------------------------------------
- Fix the zeroReg problem

There was an issue w/ the rename logic, which would assign a "previous" physical register to the ZeroReg architectural register in x86 (which, BTW, I don't believe exists in x86).  This issue was giving problems for instructions squashed in threads w/ ID different from 0, sometimes allowing non-mispredicted instructions to obtain a value different from zero when reading the zeroReg.

* changed cpu/o3/rename_map.cc

-----------------------------------------------------------------------------------------------------------------------
- Replicated the pre-decoders

There was an issue w/ the pre-decoders, as already pointed out by she user in the mailing list.

* The new fetch stage has one decoder per thread, that might have fixed it

-----------------------------------------------------------------------------------------------------------------------
- Replicated the TLBs (both ITLB and DTLB)

This seems to be the simplest solution for x86, and it mimics design choices made for the Hyper-Threading technology in the P4.
"Hyper-Threading Technology Architecture and Microarchitecture" from Marr et al.
http://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf

* Added Num threads in: src/arch/x86/X86TLB.py
  If there is a better way to get this information, please let me know.

* Added thread information to the page table:
  I am assigning a did to ask - ask was there and not used. I am not sure if it will work in FS mode (or if it is correct in this model, but it seems to be the more logical and simplest thing to do).

-----------------------------------------------------------------------------------------------------------------------
- Changed the exit group sys call so now the program exits only when all the threads terminated

Is there a better way to handle it?

-----------------------------------------------------------------------------------------------------------------------
- Changed the se.py script to support SMT.
It seems to work for both single and multi -threaded workloads.

-----------------------------------------------------------------------------------------------------------------------

Test:

* Are there any regression tests for x86 SMT?

* Simple 4 threaded workload:

Andreas-MacBook-Air:smt apellegr$ ../build/X86/m5.debug ../configs/example/se.py --maxinsts=500000000 --cpu-type=detailed --caches -n 1 -c '/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello'
\gem5 Simulator System.  http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Jun 28 2012 17:49:19
gem5 started Jun 28 2012 17:50:40
gem5 executing on Andreas-MacBook-Air.local
command line: ../build/X86/m5.debug ../configs/example/se.py --maxinsts=500000000 --cpu-type=detailed --caches -n 1 -c /Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7002
0: system.remote_gdb.listener: listening for remote gdb #1 on port 7003
0: system.remote_gdb.listener: listening for remote gdb #2 on port 7004
0: system.remote_gdb.listener: listening for remote gdb #3 on port 7005
**** REAL SIMULATION ****
info: Entering event queue @ 0.  Starting simulation...
warn: instruction 'fnstcw_Mw' unimplemented
warn: instruction 'fldcw_Mw' unimplemented
Hello world!
Done!
Done!
Done!
hack: be nice to actually delete the event here
Exiting @ tick 37617000 because target called exit()
Test:

* Are there any regression tests for x86 SMT?

* Simple 4 threaded workload:

Andreas-MacBook-Air:smt apellegr$ ../build/X86/m5.debug ../configs/example/se.py --maxinsts=500000000 --cpu-type=detailed --caches -n 1 -c '/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello'
\gem5 Simulator System.  http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Jun 28 2012 17:49:19
gem5 started Jun 28 2012 17:50:40
gem5 executing on Andreas-MacBook-Air.local
command line: ../build/X86/m5.debug ../configs/example/se.py --maxinsts=500000000 --cpu-type=detailed --caches -n 1 -c /Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop;/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/loop/bin/x86/linux/loop
/Users/apellegr/Research/svnrepo/viperII/gem5-viper/tests/test-progs/hello/bin/x86/linux/hello
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7002
0: system.remote_gdb.listener: listening for remote gdb #1 on port 7003
0: system.remote_gdb.listener: listening for remote gdb #2 on port 7004
0: system.remote_gdb.listener: listening for remote gdb #3 on port 7005
**** REAL SIMULATION ****
info: Entering event queue @ 0.  Starting simulation...
warn: instruction 'fnstcw_Mw' unimplemented
warn: instruction 'fldcw_Mw' unimplemented
Hello world!
Done!
Done!
Done!
hack: be nice to actually delete the event here
Exiting @ tick 37617000 because target called exit()

Issue Summary

5 5 0 0
Posted (Jan. 18, 2013, 6:56 a.m.)
Ali, Steve, can you take a look at this patch?

In my opinion, all the changes being made to x86 tlb are pointless.
If we have to replicate everything with in the tlb for each thread, 
why not create completely separate tlb structures. The cpu can just 
make the translation call to the tlb corresponding to the thread which 
needs the translation.
  1. I agree, if we're going to replicate all the state in the TLB, we might as well just instantiate multiple TLB objects.
    
    I also don't understand why se.py needs to change; doesn't it already support multithreaded jobs?
    
    I think the zeroreg fix could be committed as is though (I didn't look at it closely, but if it's a real bug fix it should just go in).
    
  2. I'm not sure about the best way to handle the tlb issue. I don't really know how it's done on real cores, but I doubt for ever thread you get another N tlb entries. I suppose it's OK if you make each of them smaller. Similarly I'm not sure that you want multiple page table walkers. 
Posted (Jan. 21, 2013, 8:15 a.m.)
Hi Andrea,

Thanks for the work. I'm a bit concerned about the TLB changes, as I think while they might work for SE mode, they're definatly wrong for FS mode in terms of threads/processes moving between threads and not having any valid TLB state. I think what we want is to associate the hardware thread context with any cached state in the TLB (about the currently running thread (e.g. thread context, running ASN, etc), but we probably don't want to in essence build two TLBs in one. 

Ali

configs/example/se.py (Diff revision 5)
 
 
Why does this code need to be duplicated? Should it be below as well? It seems like we should only need to check for multiprorcesses in one place.
src/arch/x86/linux/syscalls.cc (Diff revision 5)
 
 
Do you really want to change this? exit_group is supposed to end all threads, while exit is supposed to end the process only if they're all done. Are we getting this wrong in the some way?
src/arch/x86/pagetable_walker.cc (Diff revision 5)
 
 
this is the hardware thread id, not the ASN. It seems like you want the ASN, as if the threads switched cpus all the TLB entries would be misses. This might work for SE mode, but it's going to be wrong for FS.
src/cpu/o3/rename_map.cc (Diff revision 5)
 
 
I think we should commit this