1. SUMMARY

We have implemented a Cache Simulator for analyzing how different Snooping-Based Cache Coherence Protocols - MSI, MESI, MOSI, MOESI, Dragonfly, and Competitive Snooping; perform under various workloads. Given any program, we can use our simulator to compare the performance of various protocols, based on number of Bus Transactions, Memory Requests, Memory Write-Backs and Cache-to-Cache Transfers.

2. BACKGROUND

We have studied about different snooping based Cache Coherence Protocols in class. Whenever a processor wants to read or write something, it tries to use its own cache to avoid having to go to the memory each time (as it’s very slow). But, when we have multiple processors, we need to synchronize the caches, so that all processors have a coherent view of memory. For this, one approach is to use a snooping cache, where each cache monitors the memory reads and writes done by other caches and takes some action based on those requests. MSI is one simple choice but there are other protocols too which offer different kinds of benefits under specific workloads - MESI, MOSI, MOESI, Dragonfly, etc. With this project we basically aim to study and demonstrate the advantages of each protocol over the others based on some real-world, and some synthetic memory traces. We have also implemented a split-bus memory access, so that we can mimic memory transactions similar to how they happen in real machines, instead of using an atomic bus. This gives us a clear understanding of the performance of each snooping based protocol. The inputs for analysis have been derived from programs that are concurrent in nature. We created two kinds of programs - one based on locks, so that the read/writes are in a FIFO manner; and other based on random (aka wild) access to the memory, which neither maintain consistency of data nor the order of transactions. We also designed some artificial test inputs to be able to clearly differentiate the performance of each snooping based protocol.

2.1. Snooping based cache protocols

The protocols that we have implemented are:
Write-Invalidate Protocols
1. MSI
2. MESI
3. MOSI
4. MOESI

Write-Update Protocols
1. Dragonfly (write-back update)

Hybrid Protocol
1. Competitive Snooping(***) - While studying about various snooping based protocol, we found this hybrid protocol intriguing. As it tries to combine the benefit of both write-invalidate and write-update protocols. It works by maintaining a counter for each cache line. On a snooping update it’s decremented and on a processor access it’s incremented. If on a snoop this counter reaches zero, instead of updating we invalidate the line.

(***) This was not a part of our initial plan for the project, but since we are running ahead of our current schedule, we will try our best to get this in before the final deadline. Some preliminary results have been updated now.

[!] UPDATE
We were able to implement this protocol in time, and have added the analysis in the results section of this report.

All of these protocols use write-allocate write-back caching.

Following are the State Transition Diagrams for above protocols.
The basic MSI protocol with the Modified, Shared and Invalid states. alt text
MESI protocol adds ‘Exclusive’ state to MSI that reduces bus transactions caused by writes to cachelines that exist in a single cache. alt text
MOSI protocol adds ‘Owner’ state to MSI to reduce writebacks caused by reads from other processors. alt text
MOESI protocol combines the benefits of MESI and MOSI. alt text
Dragon protocol is a write-update protocol which on a write to cacheline, instead of invalidating the cacheline on other caches, sends an update message. alt text

3. APPROACH:

We use Intel’s pintool to instrument programs and generate memory traces for them. For this, we wrote our own pintool, which for each instruction, checks whether it has a memory access or not. If it has a memory access, then at runtime we record the following information:
1. Whether the access is a read or a write
2. The memory address for the load/store
3. The thread-ID for the thread which performed the load/store

This is saved in a file in the following format :

[Thread-ID] [R/W] [Memory_Address]

Example:

0 R 0x1024
0 W 0x1024
1 R 0x1088
1 W 0x1088
2 R 0x1152

For improved analysis of a program, we have written another pintool which allows demarcating the code which we want to analyze using the simulator. This is to help in the scenario, where a program is very large and there are only specific parts of the program that we want to analyze. We do this by calling two dummy functions: dummy_instr_start() and dummy_instr_end(). The pintool recognizes calls to these functions, so when it sees a dummy_instr_start(), it starts recording memory accesses and on dummy_instr_end(), it stops recording them.
Once we have the memory trace generated by one of the pintools, we pass it to our Cache Simulator.

3.2. Cache Simulator Design:

alt text
Our Cache Simulator allows varying the following parameters:

-c Number of Processor Cores/Caches
-s Cache Size (in MB)
-a Set Associativity
-p Cache Coherence Protocol to use (MSI/MESI/MOSI/MOESI/Dragon)
-t The trace file to use

At initialization, we create caches for each processor, and we create two pThreads for each cache. One thread is responsible for handling requests from the processor and initiating bus transactions if necessary (processor transactions), while the other thread is responsible for responding to bus transactions (snooped bus transactions). For accesses to memory, we have simulated a split-transaction bus. We create a pThread to act as the memory controller. We have also added a small delay(of the order of couple ms), so that each memory request has some considerable latency associated with it. This also means that while a memory request is pending, another cache might request the same cacheline. This causes some interesting correctness issues that we will discuss later.

The high level workflow of our simulator:

3.3. Metrics Used for Comparison

  1. Number of Snooping Bus Transactions - The number of bus transactions that the protocol implementation requires for the program.
  2. Number of Memory Requests - The number of cacheline loads that have to be served by main memory (and not another cache).
  3. Number of Memory Write-Backs - The number of times any cacheline has to be written back to main memory.
  4. Number of Cache to Cache Transfers - The number of cacheline loads that are served by another cache.


    Ideally, we want as few bus transactions, memory requests, memory write-backs, or cache to cache transfers as possible because all of them will have some amount of latency associated with them. If it is a choice between a memory request and a cache to cache transfer, we always prefer the latter, because a memory request will have much higher latency compared to getting data from another cache.

3.4. Interesting Synchronization and Implementation Issues:

As described earlier, we create separate threads to simulate various parts of the cache. To maintain and ensure correctness, there are certain issues we need to take care of:

3.4.1. Split Transaction Bus

Having a split transaction bus means that when there is a request from one cache pending, another cache can request load of the same cacheline. In this case, the memory does not need to service the same request twice. It just needs to provide the data for the cacheline to both the caches, or in other words, both the caches need to be listening for the memory’s servicing of the same request. For implementing this, we create a separate condition variable for each request. When a cache needs something serviced from memory, it first checks whether the access already exists in the request table. If it already exists, instead of creating another memory request in the table, it simply waits on the condition variable of the existing request. When the memory worker is done servicing the request, it performs a broadcast on the condition variable associated with the request, so that all caches that were waiting for that request can continue with their work.

3.4.2. Cache Worker and Snooping Worker

We have two workers (pThreads) for each cache. One for servicing load/store requests from the processor and one for servicing for snooping bus transactions. Both these threads require concurrent access to the same cache, and hence the accessed need to be synchronized. Our implementation uses a single copy of the cache with pthread mutexes for synchronization.

3.4.3. Snooping Bus

The current implementation of the snooping bus uses pthread mutices and condition variables. The order in which caches gain access to the bus is completely dependent on the implementation of pthread mutex and condition variables. Based on our implementation, we can easily enforce a specific ordering on the arbitration, but we decided to use the simple default implementation. To explain in brief, we have a mutex for getting access to the snooping bus for broadcasting actions. Another mutex is required for other snooping caches to respond to this action. This also involves the snooping caches informing the cache worker whether the cacheline would be in the exclusive or shared state, or if the cache worker does not have the cacheline, whether the snooping cache can provide the data for the cacheline. A snooping caches can also NACK the transaction, if its own cache has a conflicting pending transaction that is waiting to be serviced by the split transaction bus.

4. PROTOCOLS’ LOW LEVEL DESIGN CHOICES:

5. TEST HARNESS

To test if our cache implementation is working properly, we designed some artificial traces. For these traces, we already knew the values of the metrics that would be generated by running our program. So, this made it possible to check that the our cache implementation is working as expected. The scripts to generate these traces can be found in the “../418CacheSim/traces/” directory. The scripts are:

MSIvMESI.py
MSIvMOSI.py
MOESIvsDragon.py

For benchmarking and comparisons between the protocols, we wrote a few test programs - wild_add, lock_add, wild_fill_bucket, lock_fill_bucket. We also used our code from 15-618 assignments :- assignment 1 (mandelbrot) and assignment 3 (pagerank and bfs topdown). Additionally, we used tests from the Splash 2 application suite. Except for Splash 2, all tests were run on ghc-42 with 16 threads. The caches were setup to have a size of 1MB with a set-associativity of 8. As for Splash2, it was run with 4 threads on our own machines. (The Splash2 tests required additional programs which were not available on GHC machines). The program sizes were purposely chosen to be small so that the memory traces do not get too large. The results from the tests are shown below.

6. GRAPHS AND ANALYSIS

  1. MSI vs MESI: This tests out an artificial trace that Read and Writes to unique addresses on separate cache lines. This means that there is no requirement of sending out write-invalidation or write-update messages. From the results, we can see that MSI and MOSI perform worse compared to others because they do not have an Exclusive state. This means that on a write, they need to send out unnecessary write-invalidate messages on the bus. (All memory writebacks here are due to cache evictions.)
    alt text
  2. MSI vs MOSI: This test has 16 threads where all of them are reading and writing to the same address. In a normal MSI or MESI protocol, this would require writebacks on each invalidation. MOSI and MOESI on the other hand, just get the cache to transfer the data to the new owner, avoiding the need for memory writebacks.
    alt text
  3. MOESI vs Dragon: This test has 16 threads, with one of them constantly writing to an address, while the others keep reading from the same address. Write-invalidate protocols would require invalidations on each write. The Dragon write-update protocol instead of invalidation sends an update to the other caches, so that they do no need get a cache miss on each and every read. The results clearly show the low number of bus transactions required by Dragon.
    alt text
  4. Lock Add: In this test all the threads try to add ‘1’ to a global counter using locks. We see lower number of memory writebacks in MOSI and MOESI because of the presence of the owner state.
    alt text
  5. Wild Add: In this test all the threads try to add ‘1’ to a global counter without using locks. Here, the exclusive state seems to be helping. Probably the read on the global variable and subsequent write to it brings out the benefit in MESI and MOESI protocols.
    alt text
  6. Lock Fill Bucket: This tests makes buckets for numbers, so that it can count how many times each number is present in the array. This is done using locks. The dragon protocol is performing much worse here compared to others. This is probably because updates to the buckets are not always needed by other processors, hence the updates on the writes do not help.
    alt text
  7. Wild Fill Bucket: This tests makes buckets for numbers, so that it can count how many times each number is present in the array. Locks are not used in this case. The results are similar to the lock fill bucket case.
    alt text
  8. Mandelbrot: This is the mandelbrot from Assignment 1.
    alt text
  9. Pagerank: This is the Pagerank from Assignment 3, run on tiny.graph. The score updates in one iteration are read in the next iteration. This is probably also the reason why Dragon protocol is showing good results compared to the write-invalidate protocols.
    alt text
  10. BFS: This is the BFS top-down approach from from Assignment 3 run on grid_100x100.
    alt text
  11. Ocean: This is the splash-2 ocean run with 4 processors and 18x18 grid. Write-update protocols seem to perform noticeably better.
    alt text

Competitive Snooping

This is a hybrid kind of protocol, a cross between write-invalidate and write-update. We take Dragon as a base and associate an update counter with each cacheline. When, a processor accesses a cacheline, the counter is incremented and when it receives an udpate from another processor, the counter is decremented. When, the counter reaches zero, instead of updating the cacheline, we invalidate it. This basically means that the processor does not access the cacheline very often and instead of updating it, invalidating it might be better.
Result From our experiments, we did not see any noticeable improvements compared to others. The numbers were similar to Dragon.

7. PLATFORM CHOICE

We used C++ as the programming language for creating both the pintools and the cache simulator.

8. RESULTS

Interesting Conclusions

Adding ‘E’ state:

To measure the performance differences caused by adding the Exclusive state to the protocols, we can look at the differences in metrics in MSI vs MESI and MOSI vs MOESI. The main benefit of the Exclusive state is in reducing the number of snooping bus transactions required. If we consider a parallel program where each thread works on a chunk of an array and updates only that chunk, or if we assume a sequential program that has a single thread, then in these cases, there will be a lot of cases where a program is reading a cacheline and updating it. In MSI, this would translate to first loading the cacheline using a BusRd moving to the S state, and then performing a BusRdX and moving to the M state. This requires two snooping bus transactions. In the case of MESI, this can be done in a single transaction. The cache would perform a BusRd moving to the E state and then since no other cache has the cacheline, there is no need of a BusRdX transaction to move to the M state. It can just silently change the status of the cacheline to Modified.
This gives a significant boost in programs which access and modify unique memory addresses.
From the tests and memory traces that we generated, adding the Exclusive state reduced the number of snooping bus transactions by 5-6%. (These tests also include other memory accesses, so this is a considerable reduction).

Adding ‘O’ state :

To measure the performance differences caused by adding the Owner state to the protocols, we can look at the differences in metrics in MSI vs MOSI and MESI vs MOESI. The intention in adding the Owner state is to reduce the number of memory write backs required by the protocol especially on cache invalidations. To understand how the Owner state helps, we can imagine a parallel program where one thread writes to a memory address, while the other threads simply observe or read the value at the memory address. In the MSI protocol case, whenever the thread writes to the memory address, it would move to the Modified state and when the other threads would want to read from that address, it would have to flush the data to memory and move to the Shared state. This would result in a memory writeback each time. On the other hand, if we consider the MOSI protocol, instead of having to flush the contents to memory each time, the writer thread would move to the Owner state instead of invalidating the cacheline. To ensure correctness, whenever a cacheline in the Owner or Modified state is evicted from the cache, the memory writeback needs to be performed.
From the tests and memory traces that we generated, adding the Exclusive state reduced the number of snooping bus transactions by 8-10%. (Again, these tests also include other memory accesses, so this is a considerable reduction)

Write-Invalidation vs Write-Update

Since we have implemented both write invalidation and write update protocols, our simulator can also tell whether for a given program or memory trace, write invalidation protocols will be better or write update.
For a write-invalidation protocol, when a processor writes to a memory location, other processor caches need to invalidate that cacheline. In a write-update protocol, instead of invalidating the cachelines, it sends the updated cacheline to the other caches. Therefore, in cases where the other processors will need to read those values in the future, write-update performs well, but if the other processors are not going to be needing those values, then the updates are not going to be of any use, and will just cause extra bus transactions. Therefore, the effects of the protocol would be completely dependent.
From our tests, we saw lesser number of bus transactions with Dragon for page_rank, mandelbrot and Splash2-Ocean. In pagerank, the score updated in one iteration is used by the other threads in the next iteration. This would explain why updating rather than invalidating reduced the number of bus transactions.

9. REFERENCES

  1. http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2010/8a_sk?ref=driverlayer.com/image
  2. Suleman, Linda Bigelow Veynu Narasiman Aater. “An Evaluation of Snoop-Based Cache Coherence Protocols.”
  3. en.wikipedia.org
  4. 15-618 Course Slides
  5. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
  6. https://software.intel.com/sites/landingpage/pintool/docs/65163/Pin/html/index.html
  7. Grahn, Håkan, Per Stenström, and Michel Dubois. “Implementation and evaluation of update-based cache protocols under relaxed memory consistency models.” Future Generation Computer Systems 11.3 (1995): 247-271.

10. LIST OF WORK BY EACH STUDENT

Equal work was performed by both project members.