Trevor Erik Carlson

Email Address

dcstec@nus.edu.sg

Organizational Units

Organizational Unit

COMPUTING

faculty

Organizational Unit

DEPARTMENT OF COMPUTER SCIENCE

dept

Publication Search Results

Now showing 1 - 6 of 6

Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit
(2019) Alipour, M; Carlson, T.E; Black-Schaffer, D; Kaxiras, S; DEPARTMENT OF COMPUTER SCIENCE
Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction execution—in other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy. © 2018, The Author(s).
Leaking Control Flow Information via the Hardware Prefetcher
(2021-09-02) Chen, Yun; Pei, Lingfeng; Carlson, Trevor E; Dr Trevor Erik Carlson; DEPARTMENT OF COMPUTER SCIENCE
Modern processor designs use a variety of microarchitectural methods to achieve high performance. Unfortunately, new side-channels have often been uncovered that exploit these enhanced designs. One area that has received little attention from a security perspective is the processor's hard-ware prefetcher, a critical component used to mitigate DRAM latency in today's systems. Prefetchers, like branch predictors, hold critical state related to the execution of the application, and have the potential to leak secret information. But up to now, there has not been a demonstration of a generic prefetcher side-channel that could be actively exploited in today's hardware. In this paper, we present AfterImage, a new side-channel that exploits the Intel Instruction Pointer-based stride prefetcher. We observe that, when the execution of the processor switches between different private domains, the prefetcher trained by one domain can be triggered in another. To the best of our knowledge, this work is the first to publicly demonstrate a methodology that is both algorithm-agnostic and also able to leak kernel data into userspace. AfterImage is different from previous works, as it leaks data on the non-speculative path of execution. Because of this, a large class of work that has focused on protecting transient, branch-outcome-based data will be unable to block this side-channel. By reverse-engineering the IP-stride prefetcher in modern Intel processors, we have successfully developed three variants of AfterImage to leak control flow information across code regions, processes and the user-kernel boundary. We find a high level of accuracy in leaking information with our methodology (from 91%, up to 99%), and propose two mitigation techniques to block this side-channel, one of which can be used on hardware systems today.
Power-performance tradeoffs in data center servers: DVFS, CPU pinning, horizontal, and vertical scaling
(Elsevier BV, 2018-04-01) Krzywda, J; Ali-Eldin, A; Carlson, TE; Östberg, PO; Elmroth, E; Dr Trevor Erik Carlson; DEPARTMENT OF COMPUTER SCIENCE
Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running on our local testbed, we perform a set of experiments to quantify the influence of DVFS, vertical and horizontal scaling, and CPU pinning on end-to-end response time (average and tail), throughput, and power consumption with different workloads. Results of the experiments show that DVFS rarely reduces the power consumption of underloaded servers by more than 5%, but it can be used to limit the maximal power consumption of a saturated server by up to 20% (at a cost of performance degradation). CPU pinning reduces the power consumption of underloaded server (by up to 7%) at the cost of performance degradation, which can be limited by choosing an appropriate CPU pinning scheme. Horizontal and vertical scaling improves both the average and tail response time, but the improvement is not proportional to the amount of resources added. The load balancing strategy has a big impact on the tail response time of horizontally scaled applications.
Ultra-Fast CGRA Scheduling to Enable Run Time, Programmable CGRAs
(2021) Lee, Jinho; Carlson, Trevor Erik; Dr Trevor Erik Carlson; DEPARTMENT OF COMPUTER SCIENCE
Coarse-Grained Reconfigurable Arrays (CGRAs) can offer both energy-efficiency and high-throughput for embedded systems today. But, one limitation of CGRAs is the extremely long mapping time that can take many hours to complete for atypical workload. This extended mapping time, coupled with the typical use of a fixed CGRA program configuration, significantly limits potential use cases as well as hinders the ability to achieve the required performance and efficiency targets. To overcome these limitations, we propose a new, low-complexity CGRA mapping algorithm that compiles applications in milliseconds instead of hours. This is achieved by the use of key instruction placement guidelines which enable speedups of up to 800,000× while maintaining comparable kernel performance.This result allows, for the first time, the ability to dynamically reconfigure CGRA accelerators to adapt to the scenario at hand, be it an important phase of an application, or a user-generated query or request. Overall, this compiler solution could lay the foundation for improved system throughput and efficiency.
Efficient Instruction Scheduling using Real-time Load Delay Tracking
(2021-09-07) Diavastos, Andreas; Carlson, Trevor E; Dr Trevor Erik Carlson; DEPARTMENT OF COMPUTER SCIENCE
Many hardware structures in today's high-performance out-of-order processors do not scale in an efficient way. To address this, different solutions have been proposed that build execution schedules in an energy-efficient manner. Issue time prediction processors are one such solution that use data-flow dependencies and predefined instruction latencies to predict issue times of repeated instructions. In this work, we aim to improve their accuracy, and consequently their performance, in an energy efficient way. We accomplish this by taking advantage of two key observations. First, memory accesses often take additional time to arrive than the static, predefined access latency that is used to describe these systems. Second, we find that these memory access delays often repeat across iterations of the same code. This, in turn, allows us to predict the arrival time of these accesses. In this work, we introduce a new processor microarchitecture, that replaces a complex reservation-station-based scheduler with an efficient, scalable alternative. Our proposed scheduling technique tracks real-time delays of loads to accurately predict instruction issue times, and uses a reordering mechanism to prioritize instructions based on that prediction, achieving close-to-out-of-order processor performance. To accomplish this in an energy-efficient manner we introduce: (1) an instruction delay learning mechanism that monitors repeated load instructions and learns their latest delay, (2) an issue time predictor that uses learned delays and data-flow dependencies to predict instruction issue times and (3) priority queues that reorder instructions based on their issue time prediction. Together, our processor achieves 86.2% of the performance of a traditional out-of-order processor, higher than previous efficient scheduler proposals, while still consuming 30% less power.
Elasticlave: An Efficient Memory Model for Enclaves.
(2020) Yu, Zhijingcheng; Shinde, Shweta; Carlson, Trevor E; Saxena, Prateek; Dr Trevor Erik Carlson; DEPARTMENT OF COMPUTER SCIENCE
Trusted-execution environments (TEE), like Intel SGX, isolate user-space applications into secure enclaves without trusting the OS. Thus, TEEs reduce the trusted computing base, but add one to two orders of magnitude slow-down. The performance cost stems from a strict memory model, which we call the spatial isolation model, where enclaves cannot share memory regions with each other. In this work, we present Elasticlave---a new TEE memory model that allows enclaves to selectively and temporarily share memory with other enclaves and the OS. Elasticlave eliminates the need for expensive data copy operations, while offering the same level of application-desired security as possible with the spatial model. We prototype Elasticlave design on an RTL-designed cycle-level RISC-V core and observe 1 to 2 orders of magnitude performance improvements over the spatial model implemented with the same processor configuration. Elasticlave has a small TCB. We find that its performance characteristics and hardware area footprint scale well with the number of shared memory regions it is configured to support.

Trevor Erik Carlson

Email Address

Organizational Units

Filters

Author

Subject

Date

Has files

Item Type

Types

Department

Settings

Sort By

Results per page

Citations

Publication Search Results