Review Board 2.0.15


ruby: Fix protocol deadlock with RMW operations

Review Request #2983 - Created July 21, 2015 and updated

Information
Lena Olson
gem5
default
Reviewers
Default
Changeset 10926:97dc170a3064
---------------------------
ruby: Fix protocol deadlock with RMW operations

Previously, if there was a memory operation between the read part and the write
part of a RMW transaction, a deadlock could occur. There was logic to allow the
write part of the RMW transaction to bypass the blocked controller, but it also
allowed other operations to bypass as well. Thus, only the write parts of a RMW
operation can bypass. This patch adds a new function to the message class which
controls whether a block can bypass the blocked controller.

It is possible that an IFETCH is issued between the read and write parts of the
RMW transaction.

Tested with the ruby random tester as well as running some multiprogrammed workloads, including ones where this deadlock manifested. The multicore tests I have are for x86, so if those with more knowledge of other ISAs could comment that would be helpful.

Many thanks to Jason Power for help in tracking this down and fixing it.

Issue Summary

1 1 0 0
Posted (Aug. 14, 2015, 1:03 p.m.)



  
src/mem/slicc/ast/PeekStatementAST.py (Diff revision 1)
 
 

This is a very subtle change to already complex code, so I'm not sure I understand how it works. The patch description indicates that bad things could happen if controller activity occurred between the read and write of a RMW, but it doesn't really say what the bad things are. This code appears to only block intermediate activity when the write part of the RMW gets to the head of a controller queue.

Can you clarify what was causing the deadlock, and why this works to fix it?

  1. Joel, consider the following operations happen on some address A:

    RMW Read Address A
    Some other mem op on Address A
    RMW Write Address A

    Currently, on seeing a RMW Read, the sequencer asks the controller to block
    any operation on address A other than those coming from the mandatory queue.
    This is to prevent losing coherence permissions on address A. My guess is that
    RMW operations in x86 have to always succeed, unlink LL/SC which I believe can
    fail. I think the expectation here is that no operation other than RMW Write
    on the same address would be issued from the manadatory queue. This implies
    the sequencer expects that the processing core would not issue any operation
    other than RMW Write. But Lena must have observed an instruction fetch request
    to the same address between RMW Read and RMW write.

    Now, I think the proposed fix is a character away from correctness. The condition should be:

     if (m_is_blocking &&
        (m_block_map.count(in_msg_ptr->m_$address_field) == 1) &&
        ((m_block_map[in_msg_ptr->m_$address_field] != &$qcode) ||
        (!in_msg_ptr->bypassBlocked()))) {
    

    I am actually not in favor of fixing the problem the way this patch does it.
    As I said above, the controller expects that the mandatory queue will not
    issue any operation on the address other than RMW Write. I think we should
    change the sequencer and put in a check.

  2. Hi Nilay. We've discussed this patch pretty thoroughly on the gem5-gpu list (here: https://groups.google.com/forum/#!topic/gem5-gpu-dev/RQv4SxIKv7g ). Based on some testing, the patch in this review request probably doesn't work correctly, even if you change the condition as you suggest. I've posted a preliminary fix on the thread linked here, but I'm not sure that is a good way to do it either. I'd be happy to hear your take on our gem5-gpu discussion.

  3. I saw your patch. It is essentially what I had in mind when I said that the check should
    go in the sequencer. But I would prefer the check is made at line 577 just before insertRequest()
    is called, rather than at 177 as is being done currently in your patch. And mark isBlocked as const.

  4. Posted here: http://reviews.gem5.org/r/3149/