EGGH08: SIGGRAPH/Eurographics Workshop on Graphics Hardware 2008
Permanent URI for this collection
Browse
Browsing EGGH08: SIGGRAPH/Eurographics Workshop on Graphics Hardware 2008 by Title
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item All-Pairs Shortest-Paths for Large Graphs on the GPU(The Eurographics Association, 2008) Katz, Gary J.; Jr., Joseph T. Kider; David Luebke and John OwensThe all-pairs shortest-path problem is an intricate part in numerous practical applications. We describe a shared memory cache efficient GPU implementation to solve transitive closure and the all-pairs shortest-path problem on directed graphs for large datasets. The proposed algorithmic design utilizes the resources available on the NVIDIA G80 GPU architecture using the CUDA API. Our solution generalizes to handle graph sizes that are inherently larger then the DRAM memory available on the GPU. Experiments demonstrate that our method is able to significantly increase processing large graphs making our method applicable for bioinformatics, internet node traffic, social networking, and routing problems.Item Coherent Layer Peeling for Transparent High-Depth-Complexity Scenes(The Eurographics Association, 2008) Carr, Nathan; Mech, Radomir; Miller, Gavin; David Luebke and John OwensWe present two new image space techniques for efficient rendering of transparent surfaces that exploit partial ordering in the scene geometry. The first technique, called hybrid layer peeling, combines unordered meshes with ordered meshes in an efficient way, and is ideal for scenes such as volumes with embedded transparent meshes. The second technique, called coherent layer peeling, efficiently detects and renders correctly sorted fragment sequences for a given pixel in one iteration, allowing for a smaller number of passes than traditional layer peeling for typical scenes. Although more expensive than hybrid layer peeling by a constant factor, coherent layer peeling applies to a broader class of scenes, including single meshes or collections of meshes. Coherent layer peeling does not require costly clipping or perfect sorting. However, the performance of the algorithm depends on the degree to which the data is sorted. At best, when the data is perfectly sorted, the algorithm renders a correct result in a single iteration. At worst, when the data is sorted in reverse order, the algorithm mimics the performance of layer peeling but with a higher cost per iteration. We conclude with a discussion of a modified form of coherent layer peeling designed for an idealized rasterization architecture that would match layer-peeling in the worst case, while still exploiting correctly sorted sequences when they are present.Item DHTC: An Effective DXTC-based HDR Texture Compression Scheme(The Eurographics Association, 2008) Sun, Wen; Lu, Yan; Wu, Feng; Li, Shipeng; David Luebke and John OwensIn this paper, we propose a simple yet effective scheme (named DHTC) for HDR (high dynamic range) texture compression based on the popular LDR (low dynamic range) texture compression scheme - S3TC/DXTC. In the proposed scheme, the original HDR texture is first pre-processed with adaptive color transform and local dynamic range reduction. Then a color distribution linearization process and a joint-channel texture coding process are applied iteratively to generate the compressed HDR texture at 8 bpp. These techniques lead to near lossless visual quality comparable to and even better than the state-of-the-art HDR texture compression schemes. Since DHTC is built upon the ubiquitous DXTC, dedicated DHTC hardware design only needs moderate extension on current hardware. Furthermore, the proposed DHTC format is not only suitable for HDR textures, but also LDR textures with alpha channels. We believe this paper provides a unique solution to meet all the practical requirements for HDR and LDR texture compression.Item Floating-Point Buffer Compression in a Unified Codec Architecture(The Eurographics Association, 2008) Ström, Jacob; Wennersten, Per; Rasmusson, Jim; Hasselgren, Jon; Munkberg, Jacob; Clarberg, Petrik; Akenine-Möller, Tomas; David Luebke and John OwensThis paper presents what we believe are the first (public) algorithms for floating-point (fp) color and fp depth buffer compression. The depth codec is also available in an integer version. The codecs are harmonized, meaning that they share basic technology, making it easier to use the same hardware unit for both types of compression. We further suggest to use these codecs in a unified codec architecture, meaning that compression/decompression units previously only used for color- and depth buffer compression can be used also during texture accesses. Finally, we investigate the bandwidth implication of using this in a unified cache architecture. The proposed fp16 color buffer codec compresses data down to 40% of the original, and the fp16 depth codec allows compression down to 4.5 bpp, compared to 5.3 for the state-of-the-art int24 depth compression method. If used in a unified codec and cache architecture, bandwidth reductions of about 50% are possible, which is significant.Item GPU Accelerated Pathfinding(The Eurographics Association, 2008) Bleiweiss, Avi; David Luebke and John OwensIn the past few years the graphics programmable processor (GPU) has evolved into an increasingly convincing computational resource for non graphics applications. The GPU is especially well suited to address problem sets expressed as data parallel computation with the same program executed on many data elements concurrently. In pursuing a scalable navigation planning approach for many thousands of agents in crowded game scenes, developers became more attracted to decomposable movement algorithms that lend to explicit parallelism. Pathfinding is one key computational intelligence action in games that is typified by intense search over sparse graph data structures. This paper describes an efficient GPU implementation of parallel global pathfinding using the CUDA programming environment, and demonstrates GPU performance scale advantage in executing an inherently irregular and divergent algorithm.Item A Hardware Processing Unit for Point Sets(The Eurographics Association, 2008) Heinzle, Simon; Guennebaud, Gaël; Botsch, Mario; Gross, Markus; David Luebke and John OwensWe present a hardware architecture and processing unit for point sampled data. Our design is focused on fundamental and computationally expensive operations on point sets including k-nearest neighbors search, moving least squares approximation, and others. Our architecture includes a configurable processing module allowing users to implement custom operators and to run them directly on the chip. A key component of our design is the spatial search unit based on a kd-tree performing both kNN and eN searches. It utilizes stack recursions and features a novel advanced caching mechanism allowing direct reuse of previously computed neighborhoods for spatially coherent queries. In our FPGA prototype, both modules are multi-threaded, exploit full hardware parallelism, and utilize a fixed-function data path and control logic for maximum throughput and minimum chip surface. A detailed analysis demonstrates the performance and versatility of our design.Item An Improved Shading Cache for Modern GPUs(The Eurographics Association, 2008) Sitthi-amorn, Pitchaya; Lawrence, Jason; Yang, Lei; Sander, Pedro V.; Nehab, Diego; David Luebke and John OwensSeveral recently proposed techniques based on the principle of data reprojection allow reusing shading information generated in one frame to accelerate the calculation of the shading in the following frame. This strategy can significantly reduce the average rendering cost for many important real-time effects at an acceptable level of approximation error. This paper analyzes the overhead associated with incorporating temporal data reprojection on modern GPUs. Based on this analysis, we propose an alternative algorithm to those previously described in the literature and measure its efficiency for multiple scenes and hardware platforms.Item Non-Uniform Fractional Tessellation(The Eurographics Association, 2008) Munkberg, Jacob; Hasselgren, Jon; Akenine-Möller, Tomas; David Luebke and John OwensWe present a technique that modifies the tessellator in current graphics hardware so that the result is a more uniformly distributed tessellation in screen space. For increased flexibility, vertex tessellation weights are introduced. Our results show that the tessellation quality is improved at a moderate cost.Item On Dynamic Load Balancing on Graphics Processors(The Eurographics Association, 2008) Cederman, Daniel; Tsigas, Philippas; David Luebke and John OwensTo get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. With the recent advent of scatter operations and atomic hardware primitives it is now possible to bring some of the more elaborate dynamic load balancing schemes from the conventional SMP systems domain to the graphics processor domain. We have compared four different dynamic load balancing methods to see which one is most suited to the highly parallel world of graphics processors. Three of these methods were lock-free and one was lock-based. We evaluated them on the task of creating an octree partitioning of a set of particles. The experiments showed that synchronization can be very expensive and that new methods that take more advantage of the graphics processors features and capabilities might be required. They also showed that lock-free methods achieves better performance than blocking and that they can be made to scale with increased numbers of processing units.Item Total Recall: A Debugging Framework for GPUs(The Eurographics Association, 2008) Sharif, Ahmad; Lee, Hsien-Hsin S.; David Luebke and John OwensGPUs have transformed from simple fixed-function processors to powerful, programmable stream processors and are continuing to evolve. Programming these massively parallel GPUs, however, is very different from programming a sequential CPU. Lack of native support for debugging coupled with the parallelism in the GPU makes program development for the GPU a non-trivial task. As GPU programs grow in complexity because of scaling in maximum allowed program size and increased demand in terms of realism, debugging GPU code is becoming a major timesink for content developers. In addition to more complex shaders, applications are using multi-pass effects in order to create more convincing reality. In this paper, we present a debugging framework that can be employed to debug complex code running on the GPU in an efficient manner. By observing the API calls of the application that are made to the 3D runtime, the framework can keep track of the program's state in memory. Upon the programmer's request, it is able to capture and deterministically replay the stream of instructions that caused the final write to a pixel of interest. This execution stream includes writes to intermediate render targets and spans across shader boundaries. The stream of instructions can then be replayed on the CPU via emulation and the programmer can debug the straight-line code with ease. We also present a hardware-friendly scheme that can be used to accelerate the debugging process for long-chain multi-pass effects.Item Tracy: A Debugger and System Analyzer for Cross-Platform Graphics Development(The Eurographics Association, 2008) Kyöstilä, Sami; Kangas, Kari J.; Pulli, Kari; David Luebke and John OwensWe describe Tracy, an offline graphics debugging and system analysis toolkit for cross-platform system and application development in mobile graphics. Tracy operates by recording graphics function calls and argument data of unmodified applications into a trace file for offline playback, debugging, and performance analysis. In addition, traces can be edited and converted into platform-independent C files. We pay special attention to real-time performance; our trace compression mechanism allows interactive use of applications even when tracing long, multi-thousand-frame traces in real mobile hardware. We describe the use of the toolkit through real-world use cases such as debugging a visual error or a performance problem in an application, analyzing the application quality, and benchmarking a graphics engine.