Stencil-based kernels constitute the core of many scientific applica-tions on block-structured grids. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. Unfor-tunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. I propose for my Ph.D. dissertation research to develop an automatic system to generate highly efficient, platform-adapted implementations of stencil kernels. In practice, performance is a complex function of many factors, including compiler technology, machine architecture, in-struction scheduling, and memory access behavior. However, through the use of performance models and search, we can generate very good, if not optimal stencil code. This tuned code can be over twice as fast as untuned code.
Discover the world's research
[Show abstract] [Hide abstract] ABSTRACT: In this work we investigate the impact of evolving mem- ory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi it- erations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to eval- uate the eectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an an- alytical memory cost model for quantifying cache-blocking behavior and demonstrate its eectiveness in predicting the stencil-computation performance.
Overall results demon- strate that recent trends memory system organization have reduced the ecacy of traditional cache-blocking optimiza- tions.
Full-text · Conference Paper · Jan 2005
[Show abstract] [Hide abstract] ABSTRACT: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an ideal cache of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.
Conference Paper · Jan 2005
Matteo Frigo Volker Strumpen
[Show abstract] [Hide abstract] ABSTRACT: Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve non-traditional transformations, which means that they cannot be produced by optimizing the reference stencil with a compiler. In fact, many stencils are produced by code generators that were tediously handcrafted. In this paper, we show how stencil implementations can be produced with sketching. Sketching is a software synthesis approach where the programmer develops a partial implementation–a sketch–and a separate specification of the desired functionality given by a reference (unoptimized) stencil.
The synthesizer then completes the sketch to behave like the specification, filling in code fragments that are difficult to develop manually. Existing sketching systems work only for small finite programs, i.e. programs that can be represented as small Boolean circuits. In this paper, we develop a sketching synthesizer that works for stencil computations, a large class of programs that, unlike circuits, have unbounded inputs and outputs, as well as an unbounded number of computations. The key contribution is a reduction algorithm that turns a stencil into a circuit, allowing us to synthesize stencils using an existing sketching synthesizer.
Full-text · Conference Paper · Jun 2007