COMPUTER ORGANIZATION (ELE408) LAB 2

The CPU Core and Memory Hierarchy of the ColdFire MCF5208

1. Pre-lab Report

· Read Chapters 3, 5, 6, and 8 of the MCF5208 Reference Manual.

· Read this lab handout carefully. Take notes and prepare the programs required in this handout.

· Study the SBCTools and review dBUG command set.

2. Objectives of this Lab

The purpose of this lab is to learn and gain first hand experiences on the CPU core and the memory hierarchy architecture of the ColdFire professor. You will apply your knowledge and basic concepts of computer architecture that you have learnt in the lectures to the lab experiments. In particular, the concept of memory hierarchy and cache design is the main focus of this lab.

Specifically, you will learn and exercise the basics of

2.1. Cache design including mapping, replacement, configuration, and performance impact.

2.2. How to use programmable and high speed SRAM to control the execution time of a program.

2.3. What can be done to reduce power consumption?

3. Basics

The growing speed gap between memories and processing elements makes the memory system design become a crucial challenge to computer designers. High performance computers need sophisticated memory hierarchy to bridge the speed disparity between processors and the main memory. A large register file is usually small that can hardly hold the working set of a program. Register file also requires extra efforts in software to manage it. Highly interleaved memory may provide enough bandwidth, but the memory speed has to be extremely fast and the number of interleaved memory modules has to be excessively large in order to provide enough memory bandwidth. In addition, the interconnection network between processors and memories adds additional delay to each memory access.

High speed on chip cache memories has been successfully used in any processor today. A cache memory bridges the speed gap between the CPU and the off chip memory. Cache memories are generally transparent to programmers and controlled by hardware. For embedded processors, programmable SRAM is often used to selectively hold important program and data that need fast execution.

The MCF5208 processor has both on chip cache and SRAM for speeding up memory accesses. The cache memory can be configured as an instruction cache, a data cache, or a mixed cache by writing appropriate values to cache control register and access control register. The SRAM is a user configurable high speed memory to allow quick program execution. In this lab, you will configure both high speed memories to observe performance differences for different configurations.

4. Experiment Procedures

This lab consists of two parts: 1) experiments on cache memory, and 2) experiments on SRAM.

4.1. Cache Memory Experiments

Preparation:

Ø Design an assembly program to configure the cache memory:

o Enable/disable the entire cache

o Enable instruction cache and disable data cache

o Enable data cache and disable instruction cache

o Enable both instruction and data cache: mixed cache

Ø Design an assemble program to measure execution times

Ø Design a C program to do intensive matrix computations based on the algorithm provided.

Hint: you would need to configure both CACR and ACR to make cache work.

In the Lab:

Ø Run your program for each of the 4 configurations and measure the execution time. You may change the program size, data size, or algorithm structure to observe performances. Is there any difference in your measurements? Why?

Ø Can you change the data size or program structure to improve the cache performance. Any idea or suggestions are welcome. Please put your thoughts and your results in your lab report.

4.2. SRAM Experiments

Preparation:

Ø Design an assembly program to configure the on-chip SRAM:

o Mask/enable the SRAM

o Set different power modes

Ø Design a program to initialize SRAM with critical program and data

Ø Design a C program of your choice that can be run on the computer

In the Lab:

Ø Run your program with all your program and data in SRAM

Ø Run your program again with your program in the SRAM but data in DRAM

Ø Run your program with data in SRAM and program in DRAM

Ø Run your program with low power mode

Ø Measure execution time of each run and compare them.

5. Lab Report Requirements

In your lab report, you should discuss your designs, trade-offs between performance and power, Explanation and interpretation of your results are very important. The lab report will be graded based on your report and discussions. Total mark for the report is 100 points.

Ø Prelab report: 20 points

Ø Successful experiment: 50 points

Ø Results analysis, interpretation, and discussions on your design and engineering constraints: 30 points

In the following items, numbers inside each bracket “[]” indicates the point you will earn on a satisfactory report and discussions.

In your discussion and explanations of your results, you should consider the following constraints:

Ø What knowledge of mathematics, science and engineering have you applied in this lab and what tools have you used in this lab?[5]

Ø Economic Constraint (performance/cost ratios) [5]

Ø Manufacturability, Modularity and Expandability Constraints, Environmental Constraint (power consumption) [5]

Ø Sustainability: Is your design and implementation sustainable? [5]

Ø What is the potential impact of your design on real time applications? [5]

Ø How would you utilize the memory hierarchy available to you to design real time applications? [5]

For each of the above programs hand in the debugged source code with comments; the machine code is not necessary. Be very specific with your comments that explain what you are doing and why you are doing it.

6. Reference C Program for Matrix Multiplication

To better utilize cache memory, each computer provides a set of cache-optimized libraries for common computational tasks.

The following example gives a Blocked Matrix Multiplication Algorithm that attempts to maximize cache hit ratio by increasing the reuse rate of small blocks of data. The block size should be chosen in such a way that it can be completely stored in the cache. An application programmer should select the block size based on the available cache size.

1. j := 1;

2. for jj = 1 to N₂/B₂do begin /* jj-loop */

3. for kk = 1 to N/B₁ do /* kk-loop */

4. for i = 1 to N₁ do /* i-loop */

5. for k = B₁(kk-1) + 1 to kkB₁ do /* k-loop */

6. begin

7. put X[i,k] in a register;

8. for ii = j to j + B₂- 1 do /* ii-loop */

9. Z[i,ii] := Z[i,ii] + X[i,k] * Y[k,ii];

10. end;

11. j := j + B₂;

12. end jj;

For simplicity, we assume that B₁ divides N and B₂ divides N2.