ELE 548: Computer Architecture

ELE 548 Computer Architecture Project - Spring 2020

Project Description

The project should explore some original research in computer architecture, validate (debunk) an idea in a published paper, explore a published idea in more detail (examine the energy efficiency of a new technique), apply a published idea in a different context or for different workloads, or propose your own idea and evaluate. I expect projects to involve building or extending simulators, analytic models, or benchmarking/testing on real hardware.

You will be graded on how well you define your problem, survey previous work, design and conduct experiments, and present your results. The goal is to shoot for a conference paper, like the ones in your paper list.

The key is to have well defined pieces in your project and demonstrate some aspects to be completely working. For example, you may have made changes to the compiler flow, so not all benchmarks may run. But maybe you can build a solid analytical model to make your claims.

Logistics

+	Projects will be groups of two or individual.
+	Start thinking about projects as soon as possible.
+	Project plan is due Feb 25.
+	Project proposals due by March 17. Short 15-minute talk also included.
+	Project progress reports due on April 14. 15-min talk also included.
+	Project final presentations: April 28. Final project presentations will be 20-minute talks.
+	Final projects report due on April 30.
+	Final projects must be in 2-column IEEE conference format, 8 to 10 pages.
+	I will meet with each project team once a week starting March 17 (tentative).

Project ideas

You are encouraged to come up with your own topic. Ideally, the topic will be related to your current research interests. For example, if you have an interest in compilers, then code scheduling for instruction level parallelism might be a good topic. If you are interested in VLSI design, a project related to pipeline clocking or low power architecture would be good. If you are interested in databases, quantifying the architectural characteristics of database workloads, and comparing them with characteristics of other workloads (e.g., SPEC) might be good. Some simulators and benchmark programs (e.g., SPEC2006, 2017) will be made available for carrying out simulation studies. Please reach out if you need any help finding a topic.

To help you with topics, a list of example projects follows.

+	Propose a new data prefetching algorithm and implement on data prefetching competition framework.
+	Implement and compare few recent cache replacement policies, or propose a new one.
+	Using old register values to predict addresses of subsequent memory accesses. This allows the pipeline to do the cache access early in the pipeline, avoiding load-use stalls.
+	By looking for phases in applications where fewer physical registers may suffice we can cut down the amount of energy consumed by the register file.
+	Attempt to quantify how much of processor performance gain in the past decade has come from faster clocks and how much from ILP.
+	cache enhancements, including victim caches, stream buffers, and hash addressing, other.
+	Implement and compare two recent prefetching schemes, or propose a new one.
+	architectural support of operating systems (e.g., user-level traps for lightweight threads)
+	prefetching methods (hardware and/or software) and their impact on performance
+	architectural characteristics of database workloads
+	cache behavior of networking (or other) applications or algorithms, with modification to exploit caches and memory hierarchies
+	Implement a new branch prediction mechanism and compare with recent ones.
+	Code analysis of 2-3 SPEC CPU 2017 integer benchmarks to identify performance bottlenecks (poor branch behavior, poor cache behavior), and propose methods for improvement.
+	Study problematic data structures in SPEC applications in terms of branch or cache behavior, and propose new data structures that can perform better - pure software approach.
+	Analyze course-grain memory access patterns, e.g., function-level, and study mechanisms to exploit these patterns.
+	Implement a traversal aware memory allocator to improve caching.

Tools

+	Championship Simulator (complete) - https://github.com/ChampSim/ChampSim
+	Gem5 simulator - http://gem5.org/Main_Page
+	3^rd Data Prefetching Competition Framework (2019) - https://dpc3.compas.cs.stonybrook.edu
+	5^th Branch Prediction Competition Framework - https://www.jilp.org/cbp2016/
+	2^nd Cache Replacement Competition Framework (2017) - https://crc2.ece.tamu.edu
+	1^st Value Prediction Competition Framework (2018) - https://www.microarch.org/cvp1/
+	URI Z4000 FPGA Computer - http://www.ele.uri.edu/~simoneau/z4800/doc/html/

Project plan

Submit a project plan by Feb 25 with the following:

+	Project partner name unless individual
+	List of topics you have considered for the project
+	Optional: More details and an abstract if you have one
+	Optional (highly recommended): Meet me at least once between Feb 19 and March 5 (proposal deadline) to discuss your project.

Proposal

Proposals may be turned in earlier to get earlier feedback. Come and talk with me during my office hours or by appointment. Proposals should be about two pages long. I will accept only PDF documents. They should include:

+	Abstract
+	A description of your topic
+	Statement of why the topic is important
+	Your key insight
+	Outline of the design you are proposing (if applicable)
+	Methodology of evaluation
+	Metrics of evaluation
+	References to related work. The paper list has pointers to many papers. See also the following conference web pages: ISCA, MICRO, HPCA, PACT, ASPLOS. Important journals in architecture include IEEE Micro, IEEE Computer, TACO, IEEE Transactions on Computers, ACM Transactions on Computer Systems.

Progress report

Progress reports should include a revised version of the proposal (if necessary). Primarily I am looking for a one to two pages document describing accomplishments so far and a weekly plan of work for the remainder of the semester. Concentrate on describing sub-tasks completed, rather than the tasks started.

Your report should include the following:

+	What parts of design specification are complete
+	Design implementations status. For example, if you are modifying a simulator, what mods are done
+	Outline of planned experiments. This can evolve and change, but I want you to have thought of an initial plan.
+	Finally, whatever questions/comments I marked on the proposal (or we have discussed during our project meetings) must be addressed. Include a response to the questions.

Presentation

We will schedule a mini-conference. Each group will give a 20-minute talk and 5-minutes for questions from the class. Your talk should focus on the highlights of the project instead of getting to the gory details of each and everything you did. Motivate the problem, describe your key insight, summary your results, and point to future work.

Final report

Final reports should consist of an abstract, body and optional appendices, much like a conference paper. Submit 2-column formatted PDFs around 8 pages in length excluding appendix (if you really need more than 8 pages, you may use at most 2-extra pages). This report should strive for the same quality of presentation as the papers you have read during the semester, and should be well-organized, clearly written, and should stand alone (meaning that a reasonably well-informed reader will not have to refer to additional materials to understand what you have done). Please format the report neatly and check for typos, spelling, and grammatical correctness.

Here is a recommended report structure:

+	Title and Abstract: these should capture the main contributions of your project and summarize your results/findings.
+	Introduction and Motivation: introduce the topical area, identify the key problem or problems, explain their importance, and provide a brief overview of the solution (or study, or survey; whichever is relevant) that you are presenting.
+	Prior Work: summarize relevant related work, with appropriate citations
+	Main Concepts/Contributions: One or more sections that explain your main contributions and the details of operation of any algorithms, structures, or simulation studies that you proposed, implemented, and/or evaluated as part of your project. Block diagrams, flow charts, and detailed examples can be very helpful in presenting such concepts.
+	Results and Discussion: Provide and justify any quantitative results, preferably using graphs (vs. tables).
+	Conclusions and Future Work: Summarize your findings, identify remaining open problems or issues, and suggest the next steps for this project.
+	References: Properly formatted references to relevant prior papers that are cited in the body of the report.
+	Statement of Work: The project report must also include a statement of work that identifies the contributions of each individual on the team. This statement of work must reflect a team consensus and must be signed by all team members. I recommend that you structure this statement as a table with a row for each project milestone, a column for each team member, and the percentage contribution of each team member to each milestone in the entries in the table.
+	Finally, please provide a TAR or ZIP file of any source code (software and/or hardware description language) that you have written as part of this project. This file should be emailed to sendag@uri.edu. Since the course project is graded primarily based on effort, not results, I will need to access your source files to gauge the amount of effort you have expended on the project.

Paper list - Few papers from this list and others will be assigned for reading in class. You should be able to find copies on google. If you cannot, let me know.

Advanced Microarchitecture

1. J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Processors, IEEE Trans. on Computers, May 1988.

2. Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996.

3. D. Papworth. Tuning the Pentium Pro Architecture, IEEE Micro, April 1996.

4. Robert E. Kessler. The Alpha 21264 Microprocessor, IEEE Micro, March/April 1999, (Vol. 19, No. 2), pp. 24-36.

5. Simcha Gochman, Ronny Ronen, Ittai Anati, Ariel Berkovits, Tsvika Kurts, Alon Naveh, Ali Saeed, Zeev Sperber, Robert C. Valentine, The Intel (R) Pentium(R) M Processor: Microarchitecture and Performance Intel Technology Journal, May 2003.

6. Timothy J. Slegel, et al., IBMs S/390 G5 Microprocessor, IEEE Micro, Mar/Apr 1999.

7. Borch et al., Loose Loops Sink Chips, HPCA 2002.

8. Hartstein and Puzak, Optimum Power/Performance Pipeline Depth, MICRO 2003.

9. B. Sprunt. Pentium 4 performance-monitoring features, IEEE Micro, July 2002.

10. Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, Rumi Zahir, Introducing the IA-64 Architecture, IEEE Micro Sep/Oct 2000, pp 12-23.

11. Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton, Continual Flow Pipelines, in Proceedings of ASPLOS 2004, October 2004.

12. Kim and M. Lipasti, Understanding Scheduling Replay Schemes, in Proceedings of the 10th International Symposium on High-performance Computer Architecture (HPCA-10), February 2004.

13. Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, Haitham H. Akkary, Transparent Control Independence, in Proceedings of ISCA-34, 2007.

14. T. Shaw, M. Martin, A. Roth, NoSQ: Store-Load Communication without a Store Queue, in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006.

15. Pierre Salverda, Craig B. Zilles: Fundamental performance constraints in horizontal fusion of in-order cores. HPCA 2008: 252-263.

16. Andrew Hilton, Santosh Nagarakatte, Amir Roth, iCFP: Tolerating All-Level Cache Misses in In-Order Processors, Proceedings of HPCA 2009.

17. Loh, G. H., Xie, Y., and Black, B. 2007. Processor Design in 3D Die-Stacking Technologies. IEEE Micro 27, 3 (May. 2007), 31-48.

18. Alvin R. Lebeck, Jinson Koppanalil, Tong Li, Jaidev Patwardhan, and Eric Rotenberg. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th annual international symposium on Computer architecture (ISCA 02).

19. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors, Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), pages 129-140, Anaheim, CA, February 2003.

20. MARTINEZ, J. F.,RENAU, J., HUANG, M., PRVULOVIC, M., AND TORRELLAS, J. 2002. Checkpointed early resource recycling in out-of-order microprocessors. In Proceedings of the 35th International Symposium on Microarchitecture, 2002. 3-14.

21. CRISTAL, A., ORTEGA, D., LLOSA, J., AND VALERO, M. Out-of-order commit processors. In Proceedings of the 10th International Symposium on High-Performance Computer Architecture, 2004, 48-59.

22. CRISTAL, A., ORTEGA,D., LLOSA,J., AND VALERO,M. Kilo-instruction processors. In Proceedings of the 5th International Symposium on High-Performance Computing, 2003. 10-25.

23. Adrian Cristal, Oliverio J. Santana, Mateo Valero, and Jose; F. Martinez. 2004. Toward kilo-instruction processors. ACM Trans. Archit. Code Optim. 1, 4 (December 2004), 389-417.

24. Pericas, M.; Cristal, A.; Gonzalez, R.; Jimenez, D.A.; Valero, M., A decoupled KILO-instruction processor, High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, vol., no., pp. 53- 64, 11-15 Feb. 2006.

Caches

1. https://crc2.ece.tamu.edu - website: 2^nd Cache Replacement Championship, 2017.

2. Hawkeye Cache Replacement: Leveraging Beladys Algorithm for Improved Cache Replacement, A Jain and Calvin Lin, 2^nd Cache Replacement Championship, 2017.

3. SHiP++: Enhancing Signature-Based Hit Predictor for Improved Cache Performance, V. Young, C. Chou, A. Jaleel, M. Qureshi, 2^nd Cache Replacement Championship, 2017.

4. ReD: A Policy Based on Reuse Detection for a Demanding Block Selection in Last-Level Caches, J. Diaz et al., 2^nd Cache Replacement Championship, 2017.

5. Divide-and-conquer: a bubble replacement for low level caches, Chuanjun Zhang and Bing Xue, In Proceedings of the 23rd international conference on Supercomputing (ICS 09). ACM, New York, NY, USA, 80-89.

6. ACM-W Athena Award Lecture, Mary Jane Irwin: Shared Caches in Multicores: The Good, The Bad, and The Ugly, ISCA 2010. Slides only.

7. A Dueling Segmented LRU Replacement Algorithm with Adaptive Bypassing, H. Gao and C. Wilkerson, JWAC-1 Workshop 2010.

8. Map-based Adaptive Insertion Policy, Y. Ishii, M. Inaba, and K. Hiraki, JWAC-1, 2010.

9. Achieving Non-Inclusive Cache Performance With Inclusive Caches - Temporal Locality Aware (TLA) Cache Management Policies, Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr, and Joel Emer. In International Symposium on Microarchitecture (MICRO), Atlanta, Georgia, December 2010.

10. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP), Aamer Jaleel, Kevin Theobald, Simon C. Steely Jr, and Joel Emer. In International Symposium on Computer Architecture (ISCA), Saint-Malo, France, June 2010

11. Adaptive Spill-Receive for Robust High-performance Caching in CMPs, Moinuddin K. Qureshi, HPCA 2009.

12. Adaptive Insertion Policies for High-Performance Caching, Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., and Joel Emer. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA), San Diego, CA, June 2007.

13. A Case for MLP-Aware Cache Replacement, Moinuddin K. Qureshi, Daniel Lynch, Onur Mutlu, and Yale N. Patt, in the International Symposium on Computer Architecture (ISCA) 2006.

14. The V-Way Cache: Demand Based Associativity via Global Replacement, Moinuddin K. Qureshi, David Thompson, and Yale N. Patt, in the International Symposium on Computer Architecture (ISCA) 2005.

15. Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, Architecting Phase Change Memory as a Scalable DRAM Alternative, Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009.

16. Norman P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, ISCA 1990.

17. Emre Ozer, Resit Sendag, and David Gregg, Multiple-Valued Caches for Power-Efficient Embedded Systems, IEEE International Symposium on Multiple-Valued Logic (ISMVL-35), May 2005.

18. David H. Albonesi, Selective Cache Ways: On-demand Cache Resource Allocation, MICRO 1999.

19. Resit Sendag, Ayse Yilmazer, Joshua J. Yi, and Augustus K. Uht, Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems, IEEE International Parallel and Distributed Processing Symposium, April 2006.

20. Changkyu Kim, Doug Burger, and Stephen W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, ASPLOS 2002.

21. S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement, In FAST, 2004.

22. Basu et al. Scavenger: A New Last Level Cache Architecture with Global Block Priority. In Micro-40, 2007.

23. L. A. Belady. A study of replacement algorithms for a virtual-storage computer. In IBM Systems journal, pages 78-101, 1966.

24. M. Chaudhuri. Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-level Caches, In Micro, 2009.

25. F. J. Corbato, A paging experiment with the multics system, In Honor of P. M. Morse, pp. 217-228, MIT Press, 1969.

26. Jaleel, et al. Adaptive Insertion Policies for Managing Shared Caches. In PACT, 2008.

27. S. Jiang and X. Zhang, LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance, in Proc. ACM SIGMETRICS Conf., 2002.

28. T. Johnson and D. Shasha, 2Q: A low overhead high performance buffer management replacement algorithm, in VLDB Conf., 1994.

29. S. Kaxiras et al. Cache decay: exploiting generational behavior to reduce cache leakage power. In ISCA-28, 2001.

30. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In ISCA-28, 2001

31. D. Lee et al. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies, IEEE Trans.Computers, vol. 50, no. 12, pp. 1352-1360, 2001.

32. Seznec, A case for two-way skewed-associative cache, Proceedings of the 20th International Symposium on Computer Architecture(IEEE-ACM), San Diego, may 1993

33. W. Lin et al. Predicting last-touch references under optimal replacement. Technical Report CSE-TR-447-02, U. of Michigan, 2002.

34. H. Liu et al. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Micro-41, 2008.

35. G. Loh. Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy. In Micro, 2009.

36. C.-K. Luk et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pages 190-200, 2005.

37. N. Megiddo and D. S. Modha, ARC: A self-tuning, low overhead replacement cache, in FAST, 2003.

38. E. J. ONeil et al. The LRU-K page replacement algorithm for database disk buffering, in Proc. ACM SIGMOD Conf., pp. 297-306, 1993.

39. M. Qureshi, A. Jaleel, Y. Patt, S. Steely, J. Emer. Adaptive Insertion Policies for High Performance Caching. In ISCA-34, 2007.

40. K. Rajan and G. Ramaswamy. Emulating Optimal Replacement with a Shepherd Cache. In Micro-40, 2007.

41. J. T. Robinson and M. V. Devarakonda, Data cache management using frequency-based replacement, in SIGMETRICS Conf, 1990.

42. R. Sugumar and S. Abraham, Efficient simulation of caches under optimal replacement with applications to miss characterization, in SIGMETRICS, 1993.

43. Y. Xie, G. Loh. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches. In ISCA-36, 2009

44. Y. Zhou and J. F. Philbin, The multi-queue replacement algorithm for second level buffer caches, in USENIX Annual Tech. Conf, 2001.

45. Jun Yang, Youtao Zhang, Rajiv Gupta, Frequent Value Compression in Data Caches, Microarchitecture, IEEE/ACM International Symposium on, p. 258, 33th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO00), 2000.

46. Resit Sendag, Pengfei Chuang, and David J Lilja, Address Correlation: Exceeding the limits of Locality, IEEE Computer Architecture Letters, Volume 2, May 2003.

Branch Prediction

1. 5^th Branch Prediction Competition Framework - https://www.jilp.org/cbp2016/

2. TAGE-SC-L Branch Predictors Again, Andre Seznec, 5^th Branch Prediction Competition, 2016.

3. An Alternative TAGE-like Conditional Branch Predictor, Pierre Michaud. ACM Trans. Archit. Code Optim., 2018.

4. A Seznec, The L-TAGE predictor, Journal of Instruction Level Parallelism, May 2007.

5. Hongliang Gao and Huiyang Zhou, PMPM: Prediction by combining Multiple Partial Matches, Second Championship Branch Prediction Competition, 2006.

6. D. Jimenez, Piecewise Linear Branch Prediction, Int. Symposium on Computer Architecture, 2005.

7. D. Jimenez, Reconsidering Complex Branch Predictors, International Symposium on High-Performance Computer Architecture, 2003.

8. D. Jimenez and C. Lin, Dynamic Branch Prediction with Perceptrons, International Symposium on High Performance Computer Architecture, 2001.

9. J. E. Smith. A study of branch prediction strategies. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 135-48, May 1981.

10. T.-Y. Yeh and Y. N. Patt. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 124-34, May 1992.

11. S. McFarling. Combining branch predictors. Tech. Note TN-36, DEC WRL, June 1993.

12. Ayose Falcon, Jared Stark, Alex Ramirez, Konrad Lai, and Mateo Valero. Prophet/Critic Hybrid Branch Prediction. In Proceedings of the 31st annual international symposium on Computer architecture (ISCA 04). IEEE Computer Society, Washington, DC, 2004.

13. Kevin Skadron, Margaret Martonosi, Douglas W. Clark: A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions. IEEE PACT 2000: 199-206.

14. C.-C. Lee, I.-C. K. Chen, and T. N. Mudge. The bi-mode branch predictor. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 4-13, Dec. 1997.

15. M. Evers, S. J. Patel, R. S. Chappell, and Y. N. Patt. An analysis of correlation and predictability: What makes two-level branch predictors work. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 52-61, June 1998.

16. Zhijian Lu, John Lach, Mircea R. Stan, Kevin Skadron: Alloyed Branch History: Combining Global and Local Branch History for Robust Performance. International Journal of Parallel Programming 31(2): 137-177 (2003).

17. A. Seznec, S. Felix, V. Krishnan, Y.Sazeides, Design trade-offs on the EV8 branch predictor , Proceedings of the 29th International Symposium on Computer Architecture (IEEE-ACM), Anchorage, may 2002.

18. A. Seznec and A. Fraboulet Effective ahead pipelining of instruction block address generation, In Proceedings of the 30th Annual international Symposium on Computer Architecture, 2003.

19. E. Jacobsen, E. Rotenberg, and J. Smith, Assigning Confidence to Conditional Branch Predictions, International Symposium on Microarchitecture, 1996.

20. R. Sendag, Joshua J. Yi, and Peng-fei Chuang, Branch Misprediction Prediction: Complementary Branch Predictors, IEEE Computer Architecture Letters, Dec. 2007.

21. H. Gao, et al. Address-Branch Correlation: A Novel Locality for Long-Latency Hard-to-Predict Branches.HPCA-14, 2008.

22. T. Heil, Z. Smith, J. E. Smith. Improving Branch Predictors by Correlating on Data Values. MICRO-32, 1999.

23. Muawya Al-Otoom, Elliott Forbes, and Eric Rotenberg. EXACT: explicit dynamic-branch prediction with active updates. In Proceedings of the 7th ACM international conference on Computing frontiers (CF 10). ACM, New York, NY, 165-176, May 2010.

24. Celal Ozturk and Resit Sendag, An analysis of hard-to-predict branches, IEEE Symposium on Performance Analysis of Systems and Software (ISPASS-2010), White Plains, NY, March 2010.

Prefetching

1. 3^rd Data Prefetching Competition Framework (2019) - https://dpc3.compas.cs.stonybrook.edu

2. Array Tracking Prefetcher for Indirect Accesses, M. Cavus, R. Sendag and J. Yi, IEEE International Conference on Computer Design (ICCD), Orlando, FL, USA, Oct 7-10, 2018.

3. IMP: indirect memory prefetcher, Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas, In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). 2015.

4. A Best-Offset Prefetcher, P. Michaud, 2nd Data Prefetching Competition, 2015.

5. Towards Bandwidth-Efficient Prefetching with Slim AMPM, Vinson Young, Ajit Krisshna, 2nd Data Prefetching Competition, 2015.

6. Prefetching on time and when it works, I. Karsli, M. Cavus and R. Sendag, 2nd Data Prefetching Competition, 2015.

7. Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, When Prefetching works, when it does not and Why? ACM Transactions on Architecture and Code Optimizations, 2012.

8. Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi and Andreas Moshovos. Practical Off-chip Meta-data for Temporal Memory Streaming. Proc. of the 15th International Symposium on High Performance Computer Architecture (HPCA), Feb. 2009.

9. Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi. Spatio-Temporal Memory Streaming. Proc. of the 36th International Symposium on Computer Architecture (ISCA), Jun. 2009.

10. Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas,Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. Temporal Streaming of Shared Memory. Proc. of the 32nd Annual ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun 2005.

11. Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt, Coordinated Control of Multiple Prefetchers in Multi-Core Systems Proceedings of the 42nd International Symposium on Microarchitecture (MICRO), New York, NY, December 2009.

12. Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt, Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Proceedings of the 15th International Symposium on High-Performance Computer Architecture (HPCA), pages 7-17, Raleigh, NC, February 2009

13. Onur Mutlu, Hyesoon Kim, and Yale N. Patt, Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns Proceedings of the 38th International Symposium on Microarchitecture (MICRO), pages 233-244, Barcelona, Spain, November 2005.

14. Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt, Improving Memory Bank-Level Parallelism in the Presence of Prefetching Proceedings of the 42nd International Symposium on Microarchitecture (MICRO), New York, NY, December 2009.

15. J. D. Collins, S. Sair, B. Calder, and D.M. Tullsen. Pointer cache assisted prefetching. In MICRO-35, 2002.

16. R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In ASPLOS-X, 2002.

17. K. J.Nesbit and J. E.Smith. Data cache prefetching using a global history buffer. In HPCA-10, 2004.

18. D. Joseph and D. Grunwald. Prefetching using Markov predictors. In ISCA-24, 1997.

19. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In ASPLOS-8, 1998.

20. A. Roth and G. S. Sohi. Effective jump-pointer prefetching for linked data structures. In ISCA-26, 1999.

21. Z. Wang et al. Guided region prefetching: a cooperative hardware/software approach. In ISCA-30, 2003.

22. Norman P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, ISCA 1990.

Simulation

1. Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanovic;, and David Patterson. 2010. A case for FAME: FPGA architecture model execution. In Proceedings of the 37th annual international symposium on Computer architecture (ISCA 10).

2. James C. Hoe, Doug Burger, Joel S. Emer, Derek Chiou, Resit Sendag, Joshua J. Yi: The Future of Architectural Simulation. IEEE Micro 30(3): 8-18 (2010)

3. Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, Eric Johnson, Jebediah Keefe and Hari Angepat. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. Proceedings of MICRO, December 2007.

4. Sharookh Daruwalla, Resit Sendag, Joshua J. Yi: Adaptive simulation sampling using an Autoregressive framework. ICSAMOS 2009: 59-66

5. Joshua J. Yi, Resit Sendag, David J. Lilja, Douglas M. Hawkins: Speed versus Accuracy Trade-Offs in Microarchitectural Simulations. IEEE Trans. Computers 56(11): 1549-1563 (2007)

6. Joshua J. Yi, Resit Sendag, Lieven Eeckhout, Ajay Joshi, David J. Lilja, Lizy Kurian John: Evaluating Benchmark Subsetting Approaches. IISWC 2006: 93-104.

7. Joshua J. Yi, Sreekumar V. Kodakara, Resit Sendag, David J. Lilja, Douglas M. Hawkins: Characterizing and Comparing Prevailing Simulation Techniques. HPCA 2005: 266-277

8. R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS - Accelerating Microarchitecure Simulation via Rigorous Statistical Sampling. Proc. of the 30th International Symposium on Computer Architecture (ISCA), (short paper) Jun. 2003.

9. T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, 35(2):59-67, Feb. 2002.

10. F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In USENIX 2005 Annual Technical Conference, FREENIX Track, pages 41-46, 2005.

11. C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-ﬁrst simulation. In SIGMETRICS 02: Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 108-116, New York, NY, USA, 2002.

12. T. F. Wenish, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsaﬁ, and J. C. Hoe. SimFlex: Statistical Sampling of Computer Architecture Simulation. IEEE Micro, 26(4):18-31, July/August 2006.

13. M. T. Yourst. PTLSim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator. In Proceedings of ISPASS, Jan. 2007.

14. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, pages 45-57. 2002.

15. D. Patterson, Arvind, K. Asanovic, D. Chiou, J. C. Hoe, C. Kozyrakis, S.-L. Lu, , M. Oskin, J. Rabaey, and J. Wawrzynek. RAMP: Research Accelerator for Multiple Processors. In Proceedings of Hot Chips 18, Palo Alto, CA, Aug. 2006

16. Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith, A Top-Down Approach to Architecting CPI Component Performance Counters, IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93.

17. Ajay M. Joshi, Lieven Eeckhout, Robert H. Bell, Jr., and Lizy K. John, Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks, IISWC 2006, pp. 105-115.

18. Stijn Eyerman and Lieven Eeckhout, Per-Thread Cycle Accounting, IEEE Micro, Special Issue on Top Picks from 2009 Microarchitecture Conferences, Vol 30, No 1, pp. 71-80, Jan/Feb 2010.

19. Stijn Eyerman and Lieven Eeckhout, Modeling Critical Sections in Amdahls Law and its Implications for Multicore Design, Proceedings of ISCA, pp. 362-370, June 2010