Please use this identifier to cite or link to this item:
|Title:||PiPA: Pipelined profiling and analysis on multicore systems|
Parallel cache simulation
|Source:||Zhao, Q.,Cutcutache, I.,Wong, W.-F. (2010). PiPA: Pipelined profiling and analysis on multicore systems. Transactions on Architecture and Code Optimization 7 (3). ScholarBank@NUS Repository. https://doi.org/10.1145/1880037.1880038|
|Abstract:||Profiling and online analysis are important tasks in program understanding and feedback-directed optimization. However, fine-grained profiling and online analysis tend to seriously slow down the application. To cope with the slowdown, one may have to terminate the process early or resort to sampling. The former tends to distort the result because of warm-up effects. The latter runs the risk of missing important effects because sampling was turned off during the time that these effects appeared. A promising approach is to make use of the parallel processing capabilities of the now ubiquitous multicore processors to speed up the profiling and analysis process. In this article, we present Pipelined Profiling and Analysis (PiPA), which is a novel technique for parallelizing dynamic program profiling and analysis by taking advantage of multicore systems. In essence, the application under examination is profiled using a dynamic instrumentation tool. Optimized instrumentation code outputs the profile information in a succinct format, that we call the REP format, to buffers. This lightweight trace compression minimizes the processing overhead impinged on the application whenever a buffer is full. Another thread recovers the required information from the REP buffer. The recovered full profile is then divided up and passed to multiple threads for further analysis. To achieve the best performance, the entire system has to be well-balanced. We have implemented prototypes of PiPA using two dynamic instrumentation systems, namely DynamoRIO and Pin, thereby demonstrating its portability. Our experiments show that PiPA is able to speed up the overall profiling and analysis tasks significantly. Compared to the more than 100× slowdown of Cachegrind and the 32× slowdown of Pin dcache, we achieved a mere 10.2× slowdown on an 8-core system. In this paper, we will also describe the insights we gained in obtaining the balance needed for PiPA to perform optimally. © 2010 ACM.|
|Source Title:||Transactions on Architecture and Code Optimization|
|Appears in Collections:||Staff Publications|
Show full item record
Files in This Item:
There are no files associated with this item.
checked on Dec 5, 2017
WEB OF SCIENCETM
checked on Nov 2, 2017
checked on Dec 9, 2017
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.