

University of Jordan Student name: Yousef Yaseen Simulation Report: Cache Optimization in Chip Multi-Processor

# Cache Optimization in Chip Multi-Processor

Yousef Yaseen

**Abstract:** In CMP systems based on SMP organization, cache is much more important than before because performance promoted by multiprocessor is easily degraded by memory latency in shared symmetric multiprocessors. Cache design strategies in uniprocessor systems are implemented in multiprocessor simulator by executing parallel programs for different number of processors. In this simulation we found Full associative, Dragon and LFU having more beneficial impact on system performance

## **1. Design Options:**

- 1- To simulate a code with different mapping protocols (Direct, Set associative and full associative) then find the protocol leads to largest hits rate we used the following configuration: MESI coherence protocol, LRU arbitration, 16 bits word wide, 32 words by block, 1024 blocks in main memory, 64 bytes block, 64Kb main memory, 8 blocks in cache, 512 bytes cache size, LRU replacement policy and one cache level.
- 2- To simulate a code with different coherence protocols (MSI, MESI and Dragon) then find the protocol leads to largest hits rate.
- 3- To simulate a code with different bus arbitration protocols (Random, LRU and LFU) then find the protocol leads to largest hits rate.

### 2. Simulator:

SMPcache 2.0 is chose as the simulator. Parameters that can be designed in the simulator are: cache coherence protocols, policies of bus arbitration, mapping, replacement policies, cache size (blocks in cache), number of cache sets, number of words by block (memory block size) and word wide.

Being a MIMD (Multiple Instruction stream, Multiple Data stream) system, CMP requires parallel programs executed in multiprocessors to take full advance of SMP architecture's computation capacity.

#### 3. Workload:

Table 1 shows three parallel programs for two, four and eight processors respectively, which represent several typical programs in real application. We keep the problem size constant in every simulation configuration. To be specific, 40,000 memory traces of FFT parallel program was executed in our simulation project.

| Name    | References | Language | Description                                                                               |
|---------|------------|----------|-------------------------------------------------------------------------------------------|
| FFT     | 7,451,717  | Fortran  | Parallel application that simulates the fluid dynamics with FFT                           |
| Simple  | 27,030,092 | Fortran  | Parallel version of the SIMPLE application                                                |
| Weather | 31,764,036 | Fortran  | Parallel version of the WEATHER<br>application, which is used for weather<br>forecasting. |

Table1 Details of parallel benchmarks

# 4. Simulation Results:

### 4.1 Mapping protocol and performance evaluation

Figure 1 illustrates the change in hit rate as mapping protocol is changed for two different multiprocessor configurations. Simulation results show that despite of processors amount the Full Associative has more beneficial impact on system performance, since it reduces the frequency of costly cache misses. And it's better than 4-way Set Associative which has better results than Direct protocol.



## 4.2 Coherence protocol and performance evaluation

Figure 2 illustrates the change in hit rate as coherence protocol is changed for two different multiprocessor configurations. Simulation results show that despite of processors amount the Dragon has more beneficial impact on system performance, since it reduces the frequency of costly cache misses. And it's better than MSI and MESI.



#### 4.3 Bus arbitration and performance evaluation

Figure 3 illustrates the change in hit rate as bus arbitration is changed for two different multiprocessor configurations. Simulation results show that despite of processors amount the LFU has more beneficial impact on system performance, since it reduces the frequency of costly cache misses. And it's better than LRU that gave better results than Random.



# 5. Conclusion:

In this simulation we found Full associative, Dragon and LFU having more beneficial impact on system performance. As the number of processors is increased, total miss rates rise for that communication among processors increase, which leads to more coherence misses.