High-Performance Graphics 2020
Permanent URI for this collection
Browse
Browsing High-Performance Graphics 2020 by Issue Date
Now showing 1 - 12 of 12
Results Per Page
Sort Options
Item Compacted CPU/GPU Data Compression via Modified Virtual Address Translation(ACM, 2020) Seiler, Larry; Lin, Daqi; Yuksel, Cem; Yuksel, Cem and Membarth, Richard and Zordan, VictorWe propose a method to reduce the footprint of compressed data by using modified virtual address translation to permit random access to the data. This extends our prior work on using page translation to perform automatic decompression and deswizzling upon accesses to fixed rate lossy or lossless compressed data. Our compaction method allows a virtual address space the size of the uncompressed data to be used to efficiently access variable-size blocks of compressed data. Compression and decompression take place between the first and second level caches, which allows fast access to uncompressed data in the first level cache and provides data compaction at all other levels of the memory hierarchy. This improves performance and reduces power relative to compressed but uncompacted data. An important property of our method is that compression, decompression, and reallocation are automatically managed by the new hardware without operating system intervention and without storing compression data in the page tables. As a result, although some changes are required in the page manager, it does not need to know the specific compression algorithm and can use a single memory allocation unit size. We tested our method with two sample CPU algorithms. When performing depth buffer occlusion tests, our method reduces the memory footprint by 3.1x. When rendering into textures, our method reduces the footprint by 1.69x before rendering and 1.63x after. In both cases, the power and cycle time are better than for uncompacted compressed data, and significantly better than for accessing uncompressed data.Item Concurrent Binary Trees (with application to longest edge bisection)(ACM, 2020) Dupuy, Jonathan; Yuksel, Cem and Membarth, Richard and Zordan, VictorWe introduce the concurrent binary tree (CBT), a novel concurrent representation to build and update arbitrary binary trees in parallel. Fundamentally, our representation consists of a binary heap, i.e., a 1D array, that explicitly stores the sum-reduction tree of a bitfield. In this bitfield, each one-valued bit represents a leaf node of the binary tree encoded by the CBT, which we locate algorithmically using a binary-search over the sum-reduction. We show that this construction allows to dispatch down to one thread per leaf node and that, in turn, these threads can safely split and/or remove nodes concurrently via simple bitwise operations over the bitfield. The practical benefit of CBTs lies in their ability to accelerate binary-tree-based algorithms with parallel processors. To support this claim, we leverage our representation to accelerate a longest-edgebisection- based algorithm that computes and renders adaptive geometry for large-scale terrains entirely on the GPU. For this specific algorithm, the CBT accelerates processing speed linearly with the number of processors.Item Using Hardware Ray Transforms to Accelerate Ray/Primitive Intersections for Long, Thin Primitive Types(ACM, 2020) Wald, Ingo; Morrical, Nate; Zellmann, Stefan; Ma, Lei; Usher, Will; Huang, Tiejun; Pascucci, Valerio; Yuksel, Cem and Membarth, Richard and Zordan, VictorWith the recent addition of hardware ray tracing capabilities, GPUs have become incredibly efficient at ray tracing both triangular geometry, and instances thereof. However, the bounding volume hierarchies that current ray tracing hardware relies on are known to struggle with long, thin primitives like cylinders and curves, because the axis-aligned bounding boxes that these hierarchies rely on cannot tightly bound such primitives. In this paper, we evaluate the use of RTX ray tracing capabilities to accelerate these primitives by tricking the GPU's instancing units into executing a hardware-accelerated oriented bounding box (OBB) rejection test before calling the user's intersection program. We show that this can be done with minimal changes to the intersection programs and demonstrate speedups of up to 5.9× on a variety of data sets.Item Efficient Adaptive Deferred Shading with Hardware Scatter Tiles(ACM, 2020) Mallett, Ian; Yuksel, Cem; Seiler, Larry; Yuksel, Cem and Membarth, Richard and Zordan, VictorAdaptive shading is an effective mechanism for reducing the number of shaded pixels to a subset of the image resolution with minimal impact on final rendering quality. We present a new scheduling method based on on-chip tiles that, along with relatively minor modifications to the GPU architecture, provides efficient hardware support. As compared to software implementations on current hardware using compute shaders, our approach dramatically reduces memory bandwidth requirements, thereby significantly improving performance and energy use. We also introduce the concept of a fragment pre-shader for programmatically controlling when a fragment shader is invoked, and describe advanced techniques for utilizing our approach to further reduce the number of shaded pixels via temporal filtering, or to adjust rendering quality to maintain stable framerates.Item Hardware-Accelerated Dual-Split Trees(ACM, 2020) Lin, Daqi; Vasiou, Elena; Yuksel, Cem; Kopta, Daniel; Brunvand, Erik; Yuksel, Cem and Membarth, Richard and Zordan, VictorBounding volume hierarchies (BVH) are the most widely used acceleration structures for ray tracing due to their high construction and traversal performance. However, the bounding planes shared between parent and children bounding boxes is an inherent storage redundancy that limits further improvement in performance due to the memory cost of reading these redundant planes. Dual-split trees can create identical space partitioning as BVHs, but in a compact form using less memory by eliminating the redundancies of the BVH structure representation. This reduction in memory storage and data movement translates to faster ray traversal and better energy efficiency. Yet, the performance benefits of dual-split trees are undermined by the processing required to extract the necessary information from their compact representation. This involves bit manipulations and branching instructions which are inefficient in software. We introduce hardware acceleration for dual-split trees and show that the performance advantages over BVHs are emphasized in a hardware ray tracing context that can take advantage of such acceleration.We provide details on how the operations needed for decoding dual-split tree nodes can be implemented in hardware and present experiments in a number of scenes with different sizes using path tracing. In our experiments, we have observed up to 31% reduction in render time and 38% energy saving using dual-split trees as compared to binary BVHs representing identical space partitioning.Item Neural Denoising for Path Tracing of Medical Volumetric Data(ACM, 2020) Hofmann, Nikolai; Martschinke, Jana; Engel, Klaus; Stamminger, Marc; Yuksel, Cem and Membarth, Richard and Zordan, VictorIn this paper, we transfer machine learning techniques previously applied to denoising surface-only Monte Carlo renderings to path-traced visualizations of medical volumetric data. In the domain of medical imaging, path-traced videos turned out to be an efficient means to visualize and understand internal structures, in particular for less experienced viewers such as students or patients. However, the computational demands for the rendering of high-quality path-traced videos are very high due to the large number of samples necessary for each pixel. To accelerate the process, we present a learning-based technique for denoising path-traced videos of volumetric data by increasing the sample count per pixel; both through spatial (integrating neighboring samples) and temporal filtering (reusing samples over time). Our approach uses a set of additional features and a loss function both specifically designed for the volumetric case. Furthermore, we present a novel network architecture tailored for our purpose, and introduce reprojection of samples to improve temporal stability and reuse samples over frames. As a result, we achieve good image quality even from severely undersampled input images, as visible in the teaser image.Item Post-RenderWarp with Late Input Sampling Improves Aiming Under High Latency Conditions(ACM, 2020) Kim, Joohwan; Knowles, Pyarelal; Spjut, Josef; Boudaoud, Ben; Mcguire, Morgan; Yuksel, Cem and Membarth, Richard and Zordan, VictorEnd-to-end latency in remote-rendering systems can reduce user task performance. This notably includes aiming tasks on game streaming services, which are presently below the standards of competitive first-person desktop gaming.We evaluate the latency-induced penalty on task completion time in a controlled environment and show that it can be significantly mitigated by adopting and modifying image and simulation-warping techniques from virtual reality, eliminating up to 80% of the penalty from 80 ms of added latency. This has potential to enable remote rendering for esports and increase the effectiveness of remote-rendered content creation and robotic teleoperation. We provide full experimental methodology, analysis, implementation details, and source code.Item Quadratic Approximation of Cubic Curves(ACM, 2020) Truong, Nghia; Yuksel, Cem; Seiler, Larry; Yuksel, Cem and Membarth, Richard and Zordan, VictorWe present a simple degree reduction technique for piecewise cubic polynomial splines, converting them into piecewise quadratic splines that maintain the parameterization and C1 continuity. Our method forms identical tangent directions at the interpolated data points of the piecewise cubic spline by replacing each cubic piece with a pair of quadratic pieces. The resulting representation can lead to substantial performance improvements for rendering geometrically complex spline models like hair and fiber-level cloth. Such models are typically represented using cubic splines that are C1-continuous, a property that is preserved with our degree reduction. Therefore, our method can also be considered a new quadratic curve construction approach for high-performance rendering. We prove that it is possible to construct a pair of quadratic curves with C1 continuity that passes through any desired point on the input cubic curve. Moreover, we prove that when the pair of quadratic pieces corresponding to a cubic piece have equal parametric lengths, they join exactly at the parametric center of the cubic piece, and the deviation in positions due to degree reduction is minimized.Item FLIP: A Difference Evaluator for Alternating Images(ACM, 2020) Andersson, Pontus; Nilsson, Jim; Akenine-Möller, Tomas; Oskarsson, Magnus; Åström, Kalle; Fairchild, Mark D.; Yuksel, Cem and Membarth, Richard and Zordan, VictorImage quality measures are becoming increasingly important in the field of computer graphics. For example, there is currently a major focus on generating photorealistic images in real time by combining path tracing with denoising, for which such quality assessment is integral. We present FLIP, which is a difference evaluator with a particular focus on the differences between rendered images and corresponding ground truths. Our algorithm produces a map that approximates the difference perceived by humans when alternating between two images. FLIP is a combination of modified existing building blocks, and the net result is surprisingly powerful. We have compared our work against a wide range of existing image difference algorithms and we have visually inspected over a thousand image pairs that were either retrieved from image databases or generated in-house. We also present results of a user study which indicate that our method performs substantially better, on average, than the other algorithms. To facilitate the use of FLIP, we provide source code in C++, MATLAB, NumPy/SciPy, and PyTorch.Item Sub-triangle opacity masks for faster ray tracing of transparent objects(ACM, 2020) Gruen, Holger; Benthin, Carsten; Woop, Sven; Yuksel, Cem and Membarth, Richard and Zordan, VictorWe propose an easy and simple-to-integrate approach to accelerate ray tracing of alpha-tested transparent geometry with a focus on Microsoft® DirectX® or Vulkan® ray tracing extensions. Pre-computed bit masks are used to quickly determine fully transparent and fully opaque regions of triangles thereby skipping the more expensive alpha-test operation. These bit masks allow us to skip up to 86% of all transparency tests, yielding up to 40% speed up in a proof-of-concept DirectX® software only implementation.Item Generalized Light Portals(ACM, 2020) Ogaki, Shinji; Yuksel, Cem and Membarth, Richard and Zordan, VictorLight portals are useful for accelerating the convergence of Monte Carlo path tracing when rendering interiors. However, they are generally limited to flat polygonal shapes. In this paper, we introduce a new concept that allows existing polygon meshes with arbitrary shaders in a scene to be used as generalized light portals. We also present an efficient sampling method that takes into account the pixel values of the environment map and ray guiding two-dimensional textures that are typically opacity or transparency maps. This novel sampling strategy can be combined with other sampling techniques by using multiple importance sampling.Item High-Performance Image Filters via Sparse Approximations(ACM, 2020) Schuster, Kersten; Trettner, Philip; Kobbelt, Leif; Yuksel, Cem and Membarth, Richard and Zordan, VictorWe present a numerical optimization method to find highly efficient (sparse) approximations for convolutional image filters. Using a modified parallel tempering approach,we solve a constrained optimization that maximizes approximation quality while strictly staying within a user-prescribed performance budget. The results are multi-pass filters where each pass computes a weighted sum of bilinearly interpolated sparse image samples, exploiting hardware acceleration on the GPU. We systematically decompose the target filter into a series of sparse convolutions, trying to find good trade-offs between approximation quality and performance. Since our sparse filters are linear and translation-invariant, they do not exhibit the aliasing and temporal coherence issues that often appear in filters working on image pyramids. We show several applications, ranging from simple Gaussian or box blurs to the emulation of sophisticated Bokeh effects with user-provided masks. Our filters achieve high performance as well as high quality, often providing significant speed-up at acceptable quality even for separable filters. The optimized filters can be baked into shaders and used as a drop-in replacement for filtering tasks in image processing or rendering pipelines.