

&

*Augustus K. Uht* Dept. of Electrical and Computer Engineering



Alireza Khalafi David Morano Marcos de Alba David Kaeli Dept. of Electrical and Computer Engineering



Euro-Par August 28, 2002

## Acknowledgements

Work supported by:

- U.S. National Science Foundation
- URI Office of the Provost
- Intel
- Mentor Graphics
- Xilinx
- Ministry of Education, Culture and Sports of Spain (D. Kaeli)

#### Outline

- 1. Closely Related Work
- 2. Needs and Lavo Solutions
- 3. High-Level Architecture and Microarchitecture
- 4. Time-Tag Example
- 5. Resource-Flow Execution
- 6. High-IPC Multipath Method Example
- 7. Experiments
- 8. Summary



# **Closely Related Work**

- Riseman & Foster (1972), Lam & Wilson (1992) and others (<u>un</u>constrained resources): *much ILP in General Purpose code:* > x100
- But: *little IPC realized in real machines:* ~ 1-2
- Segmented IQ's ISCA2002, etc.: don't scale, in dispatch stage, PE's not distributed; we predate.
- Tomasulo '67 elegant, but doesn't scale
- Limited register lifetime Sohi et al '92
   One key to Levo scalability
- Warp machine Cleary et al '95 *time-tags* 
  - Basic idea good, but used floating-point tags

URI - LEV - NEU - Euro-Par Aug. 28, 2002

## Needs and LEVO Solutions

- 1. Cheap and scalable dependency detection & operand linking
  - → *time-tags* (small): link & order operand usage.
- 2. Little cycle-time impact & scalability
  → constant length *segmented* or *spanning buses*
- 3. Simple execution algorithm
   → *resource-flow* execution: Instructions flow to PE's, executed <u>regardless</u> of dependencies.
- 4. High IPC → hardware predication &
   *Disjoint Eager Execution (DEE)* smart multipath
- 5. Legacy code  $\rightarrow$  ISA independent, no compiler assist



Processing Elements (PEs) are distributed among AS's.



## Active Station (AS)

• LSTT (Last-Snarfed Time Tag) is key to operand linking



| <u>Instruction</u><br><u>Number</u>             |                                                      | <u>Instruction</u><br><u>Result Time</u><br>(ResTT) |                      | <u>Last Snarfed</u><br><u>Time Tag</u><br>(LSTT).<br><u>In Active Station</u> . |  |  |  |
|-------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|----------------------|---------------------------------------------------------------------------------|--|--|--|
| 1.                                              | R4 = 1                                               | 1                                                   | R4 = 1               | _                                                                               |  |  |  |
| 5.                                              | R4 = 2                                               | 5                                                   | R4 = 2               | _                                                                               |  |  |  |
| 9.                                              | R3 = R4                                              | 9                                                   | R3 = R4              | 1, then $5$                                                                     |  |  |  |
|                                                 | Sequential<br>Execution<br>(at end,<br>R3 holds '2') | Execution<br>(at end, $-R3 = 1$ , LSTT = 1          |                      |                                                                                 |  |  |  |
| Case $2 \prec LSTT$ is set to and stays at '5'; |                                                      |                                                     |                      |                                                                                 |  |  |  |
|                                                 | d by I9.)                                            |                                                     |                      |                                                                                 |  |  |  |
| (i) <u>Pr</u>                                   | ogram Code                                           | (i                                                  | ii) <u>With Time</u> | Tags                                                                            |  |  |  |



Example







Example

#### **Resource-Flow Execution**

#### What it is:

*Execute everything, then clean up.* (Example of this: in last set of slides, if: I1, I5, I9 all execute in first cycle, then either Case 1 or 2.)

Or, more precisely:

Execute any instruction regardless of the presence of its operands or predicates, <u>resources permitting</u>, then apply programmatic constraints to obtain correct execution.

# High-IPC Methods

- *Hardware predication*:
  - Predicates generated with hardware
  - Branch domains determined with hardware
- *D-paths*: multipath execution based on DEE
   *Not-predicted path* of some branches executed justin-case; has lower priority for resources











# Experimental Methodology

- Trace-driven simulator used
- MIPS-1 ISA binaries simulated
- Five SPECint95 and SPECint2000 benchmarks simulated
- L1 D-cache: 1 cycle hit, 10 cycles miss
- L1 I-cache, L2, memory: perfect (100% hit)
- *Baseline Machine (BM)*: bound by true dependencies, no time-tagging, no resource flow, no D-paths.
- *BM-CM*: baseline with Conventional Memory

#### Experiments

- Varying machine configuration: *s(SG's/column) a(M-path AS's /SG) c(columns)*  [*c* is # M-path columns, is also # D-path columns when present]
- *CM vs. PM* (Perfect Memory: 100% L1 hit)
- *BL*: baseline no resource flow, no D-paths
   *vs. RF*: w/resource flow but no D-paths
   *vs. D*: w/resource flow and D-paths

#### Raw IPC vs. Configuration

| Machine | SGsper | ASs per | Columns | gzip  | gap   | parser | bzip  | go    |
|---------|--------|---------|---------|-------|-------|--------|-------|-------|
| Config  | Column | SG      |         | BL-CM | BL-CM | BL-CM  | BL-CM | BL-CM |
|         |        |         |         | IPC   | IPC   | IPC    | IPC   | IPC   |
| s4a4c4  | 4      | 4       | 4       | 2.3   | 2.4   | 1.8    | 1.9   | 1.7   |
| s8a4c4  | 8      | 4       | 4       | 2.8   | 3.5   | 2.4    | 2.5   | 2.4   |
| s8a4c8  | 8      | 4       | 8       | 2.9   | 3.9   | 2.5    | 2.7   | 2.5   |
| s8a8c8  | 8      | 8       | 8       | 4.1   | 4.4   | 2.7    | 2.9   | 2.7   |
| s16a8c4 | 16     | 8       | 4       | 3.1   | 3.9   | 2.5    | 2.6   | 2.5   |
| s8a4c16 | 8      | 4       | 16      | 3.1   | 4.2   | 2.5    | 2.5   | 2.5   |

Levo machine configurations and BL-CM IPC values for the 5 benchmarks. s = SGs per column, a = ASs per SG and c = Columns.



## Speedups vs. Config. & Machine Type



#### Summary



- New execution core
- Novel techniques for scalability with low cycle time
- Time-Tags & Resource Flow Execution are wins
- High-IPC, & more there:
  - D-CM with branch oracle: about 50% more IPC
- Conventional memory IPC close to perfect memory
- D-paths quite effective at improving performance

#### **Relevant Web Sites**

Levo links:

www.ele.uri.edu/~uht

Or: www.levo.org

Levo visualization (direct): ovel.ele.uri.edu:8080

