Proceedings of MICRO'95: 28th Annual IEEE/ACM International Symposium on Microarchitecture
Download PDF

Abstract

Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for pipelines with a single stage of instruction decode, and another for pipelines with multiple decode stages. We evaluate these designs in a number of contexts: with and without software support, in-order vs. out-of-order issue, and on architectures with many and few registers. We find that our approach is quite effective at reducing the impact of load latency, even more so on architectures with in-order issue and few registers.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!