Please use this identifier to cite or link to this item: https://doi.org/10.1109/IPDPS.2012.36
Title: Highly efficient performance portable tracking of evolving surfaces
Authors: Yu W.
Franchetti F.
Hoe J.C.
Chen T. 
Keywords: level set
parallel computing
performance tuning
scientific applications
Issue Date: 2012
Citation: Yu W., Franchetti F., Hoe J.C., Chen T. (2012). Highly efficient performance portable tracking of evolving surfaces. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012 : 296-307. ScholarBank@NUS Repository. https://doi.org/10.1109/IPDPS.2012.36
Abstract: In this paper we present a framework to obtain highly efficient implementations for the narrow band level set method on commercial off-the-shelf (COTS) multicore CPU systems with a cache-based memory hierarchy such as Intel Xeon and Atom processors. The narrow-band level set algorithm tracks wave-fronts in discretized volumes (for instance, explosion shock waves), and is computationally very demanding. At the core of our optimization framework is a novel projection-based approach to enhance data locality and enable reuse for sparse surfaces in dense discretized volumes. The method reduces stencil operations on sparse and changing sets of pixels belonging to an evolving surface into dense stencil operations on meta-pixels in a lower-dimensional projection of the pixel space. These meta-pixels are then amenable to standard techniques like time tiling. However, the complexity introduced by ever-changing meta-pixels requires us to revisit and adapt all other necessary optimizations. We apply adapted versions of SIMDization, multi-threading, DAG scheduling for basic tiles, and specialization through code generation to extract maximum performance. The system is implemented as highly parameterized code skeleton that is auto-tuned and uses program generation. We evaluated our framework on a dual-socket 2.8 GHz Xeon 5560 and a 1.6 GHz Atom N270. Our single-core performance reaches 26% - 35% of the machine peak on the Xeon, and 12% - 20% on the Atom across a range of image sizes. We see up to 6.5x speedup on 8 cores of the dual-socket Xeon. For cache-resident sizes our code outperforms the best available third-party code (C pre-compiled into a DLL) by about 10x and for the largest out-of-cache sizes the speedup approaches around 200x. Experiments fully explain the high speedup numbers.
Source Title: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
URI: http://scholarbank.nus.edu.sg/handle/10635/146129
ISBN: 9780769546759
DOI: 10.1109/IPDPS.2012.36
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.