EGPGV12: Eurographics Symposium on Parallel Graphics and Visualization
Permanent URI for this collection
Browse
Browsing EGPGV12: Eurographics Symposium on Parallel Graphics and Visualization by Title
Now showing 1 - 16 of 16
Results Per Page
Sort Options
Item Auto Splats: Dynamic Point Cloud Visualization on the GPU(The Eurographics Association, 2012) Preiner, Reinhold; Jeschke, Stefan; Wimmer, Michael; Hank Childs and Torsten Kuhlen and Fabio MartonCapturing real-world objects with laser-scanning technology has become an everyday task. Recently, the acquisition of dynamic scenes at interactive frame rates has become feasible. A high-quality visualization of the resulting point cloud stream would require a per-frame reconstruction of object surfaces. Unfortunately, reconstruction computations are still too time-consuming to be applied interactively. In this paper we present a local surface reconstruction and visualization technique that provides interactive feedback for reasonably sized point clouds, while achieving high image quality. Our method is performed entirely on the GPU and in screen space, exploiting the efficiency of the common rasterization pipeline. The approach is very general, as no assumption is made about point connectivity or sampling density. This naturally allows combining the outputs of multiple scanners in a single visualization, which is useful for many virtual and augmented reality applications.Item Dynamic Scheduling for Large-Scale Distributed-Memory Ray Tracing(The Eurographics Association, 2012) Navrátil, Paul A.; Fussell, Donald S.; Lin, Calvin; Childs, Hank; Hank Childs and Torsten Kuhlen and Fabio MartonRay tracing is an attractive technique for visualizing scientific data because it can produce high quality images that faithfully represent physically-based phenomena. Its embarrassingly parallel reputation makes it a natural candidate for visualizing large data sets on distributed memory clusters, especially for machines without specialized graphics hardware. Unfortunately, the traditional recursive ray tracing algorithm is exceptionally memory inefficient on large data, especially when using a shading model that generates incoherent secondary rays. As visualization moves through the petascale to the exascale, disk and memory efficiency will become increasingly important for performance, and traditional methods are inadequate. This paper presents a dynamic ray scheduling algorithm that effectively manages both ray state and data accesses. Our algorithm can render datasets that are larger than aggregate system memory, which existing statically scheduled ray tracers cannot render. For example, using 1024 cores of a supercomputing cluster, our unoptimized algorithm ray traces a 650GB dataset from an N-Body simulation with shadows and reflections, at about 1100 seconds per frame. For smaller problems that fit in aggregate memory, but are larger than typical shared memory, our algorithm is competitive with the best static scheduling algorithm.Item EAVL: The Extreme-scale Analysis and Visualization Library(The Eurographics Association, 2012) Meredith, Jeremy S.; Ahern, Sean; Pugmire, Dave; Sisneros, Robert; Hank Childs and Torsten Kuhlen and Fabio MartonAnalysis and visualization of the data generated by scientific simulation codes is a key step in enabling science from computation. However, a number of challenges lie along the current hardware and software paths to scientific discovery. First, only advanced parallelism techniques can take full advantage of the unprecedented scale of coming machines. In addition, as computational improvements outpace those of I/O, more data will be discarded and I/O-heavy analysis will suffer. Furthermore, the limited memory environment, particularly in the context of in situ analysis which can sidestep some I/O limitations, will require efficiency of both algorithms and infrastructure. Finally, advanced simulation codes with complex data models require commensurate data models in analysis tools. However, community visualization and analysis tools designed for parallelism and large data fall short in a number of these areas. In this paper, we describe EAVL, a new library with infrastructure and algorithms designed to address these critical needs for current and future generations of scientific software and hardware. We show results from EAVL demonstrating the strengths of its robust data model, advanced parallelism, and efficiency.Item Explicit Cache Management for Volume Ray-Casting on Parallel Architectures(The Eurographics Association, 2012) Jönsson, Daniel; Ganestam, Per; Doggett, Michael; Ynnerman, Anders; Ropinski, Timo; Hank Childs and Torsten Kuhlen and Fabio MartonA major challenge when designing general purpose graphics hardware is to allow efficient access to texture data. Although different rendering paradigms vary with respect to their data access patterns, there is no flexibility when it comes to data caching provided by the graphics architecture. In this paper we focus on volume ray-casting, and show the benefits of algorithm-aware data caching. Our Marching Caches method exploits inter-ray coherence and thus utilizes the memory layout of the highly parallel processors by allowing them to share data through a cache which marches along with the ray front. By exploiting Marching Caches we can apply higher-order reconstruction and enhancement filters to generate more accurate and enriched renderings with an improved rendering performance. We have tested our Marching Caches with seven different filters, e. g., Catmul-Rom, Bspline, ambient occlusion projection, and could show that a speed up of four times can be achieved compared to using the caching implicitly provided by the graphics hardware, and that the memory bandwidth to global memory can be reduced by orders of magnitude. Throughout the paper, we will introduce the Marching Cache concept, provide implementation details and discuss the performance and memory bandwidth impact when using different filters.Item Fast Collision Culling in Large-Scale Environments Using GPU Mapping Function(The Eurographics Association, 2012) Avril, Quentin; Gouranton, Valérie; Arnaldi, Bruno; Hank Childs and Torsten Kuhlen and Fabio MartonThis paper presents a novel and efficient GPU-based parallel algorithm to cull non-colliding object pairs in very large-scale dynamic simulations. It allows to cull objects in less than 25ms with more than 100K objects. It is designed for many-core GPU and fully exploits multi-threaded capabilities and data-parallelism. In order to take advantage of the high number of cores, a new mapping function is defined that enables GPU threads to determine the objects pair to compute without any global memory access. These new optimized GPU kernel functions use the thread indexes and turn them into a unique pair of objects to test. A square root approximation technique is used based on Newton's estimation, enabling the threads to only perform a few atomic operations. A first characterization of the approximation errors is presented, enabling the fixing of incorrect computations. The I/O GPU streams are optimized using binary masks. The implementation and evaluation is made on largescale dynamic rigid body simulations. The increase in speed is highlighted over other recently proposed CPU and GPU-based techniques. The comparison shows that our system is, in most cases, faster than previous approaches.Item GLuRay: Enhanced Ray Tracing in Existing Scientific Visualization Applications using OpenGL Interception(The Eurographics Association, 2012) Brownlee, Carson; Fogal, Thomas; Hansen, Charles D.; Hank Childs and Torsten Kuhlen and Fabio MartonRay tracing in scientific visualization allows for substantial gains in performance and rendering quality with large scale polygonal datasets compared to brute-force rasterization, however implementing new rendering ar- chitectures into existing tools is often costly and time consuming. This paper presents a library, GLuRay, which intercepts OpenGL calls from many common visualization applications and renders them with the CPU ray tracer Manta without modification to the underlying visualization tool. Rendering polygonal models such as isosurfaces can be done identically to an OpenGL implementation using provided material and camera properties or superior rendering can be achieved using enhanced settings such as dielectric materials or pinhole cameras with depth of field effects. Comparative benchmarks were conducted on the Texas Advanced Computing Center's Longhorn cluster using the popular visualization packages ParaView, VisIt, Ensight, and VAPOR. Through the parallel ren- dering package ParaView, scaling up to 64 nodes is demonstrated. With our tests we show that using OpenGL interception to accelerate and enhance visualization programs provides a viable enhancement to existing tools with little overhead and no code modification while allowing for the creation of publication quality renderings us- ing advanced effects and greatly improved large-scale software rendering performance within tools that scientists are currently using.Item HyperFlow: A Heterogeneous Dataflow Architecture(The Eurographics Association, 2012) Vo, Huy T.; Osmari, Daniel K.; Comba, João; Lindstrom, Peter; Silva, Cláudio T.; Hank Childs and Torsten Kuhlen and Fabio MartonWe propose a dataflow architecture, called HyperFlow, that offers a supporting infrastructure that creates an abstraction layer over computation resources and naturally exposes heterogeneous computation to dataflow processing. In order to show the efficiency of our system as well as testing it, we have included a set of synthetic and real-case applications. First, we designed a general suite of micro-benchmarks that captures main parallel pipeline structures and allows evaluation of HyperFlow under different stress conditions. Finally, we demonstrate the potential of our system with relevant applications in visualization. Implementations in HyperFlow are shown to have greater performance than actual hand-tuning codes, yet still providing high scalability on different platforms.Item Light Propagation Maps on Parallel Graphics Architectures(The Eurographics Association, 2012) Gruson, Adrien; Patil, Ajit Hakke; Cozot, Remi; Bouatouch, Kadi; Pattanaik, Sumanta; Hank Childs and Torsten Kuhlen and Fabio MartonLight going through a participating medium like smoke can be scattered or absorbed by every point in the medium. To accurately render such a medium we must compute the radiance resulting at every point inside the medium because of these physical effects, which have been modeled by the radiative transfer equation. Computing the radiance at any point inside a participating medium amounts to numerically solving this radiative transport equation. Discrete Ordinate Method (DOM) is a widely used solution method. DOM is computationally intensive. Fattal [Fat09] proposed Light Propagation Maps (LPM) to expedite DOM computation. In this paper we propose a streaming based parallelization of LPM to run on SIMD graphics hardware. Our method is fast and scalable. We report more than 20x speed improvement by using our method as compared to Fattal's original method. Using our approach we are able to render 64x64x64 dynamic volumes with multiple scattering of light at interactive speed on complex lighting, and are able to render volumes of any size independent of the GPU memory capability.Item Load-Balanced Multi-GPU Ambient Occlusion for Direct Volume Rendering(The Eurographics Association, 2012) Ancel, Alexandre; Dischler, Jean-Michel; Mongenet, Catherine; Hank Childs and Torsten Kuhlen and Fabio MartonAmbient occlusion techniques were introduced to improve data comprehension by bringing soft fading shadows to the visualization of 3D datasets. They consist in attenuating light by considering the occlusion resulting from the presence of neighboring structures. Nevertheless they often come with an important precomputation cost, which prevents their use in interactive applications based on transfer function editing. This paper explores parallel solutions to reach interactive framerates with the use of a multi-GPU setup. Our method distributes the data to the different devices for computation. We use bricking and load balancing to optimize computation time. We also introduce two repartition schemes: a static one, which divides the dataset into as many blocks as there are GPUs and a dynamic one, which divides the dataset into smaller blocks and distributes them using a producerconsumer way. Results, using an 8-GPU architecture, show that we manage to get important speedups compared to a mono-GPU setup.Item Multi-GPU Image-based Visual Hull Rendering(The Eurographics Association, 2012) Hauswiesner, Stefan; Khlebnikov, Rostislav; Steinberger, Markus; Straka, Matthias; Reitmayr, Gerhard; Hank Childs and Torsten Kuhlen and Fabio MartonMany virtual mirror and telepresence applications require novel viewpoint synthesis with little latency to user motion. Image-based visual hull (IBVH) rendering is capable of rendering arbitrary views from segmented images without an explicit intermediate data representation, such as a mesh or a voxel grid. By computing depth images directly from the silhouette images, it usually outperforms indirect methods. GPU-hardware accelerated implementations exist, but due to the lack of an intermediate representation no multi-GPU parallel strategies and implementations are currently available. This paper suggests three ways to parallelize the IBVH-pipeline and maps them to the sorting classification that is often applied to conventional parallel rendering systems. In addition to sort-first parallelization, we suggest a novel sort-last formulation that regards cameras as scene objects. We enhance this method's performance by a block-based encoding of the rendering results. For interactive systems with hard real-time constraints, we combine the algorithm with a multi-frame rate (MFR) system. We suggest a combination of forward and backward image warping to improve the visual quality of the MFR rendering. We observed the runtime behavior of the suggested methods and assessed how their performance scales with respect to input and output resolutions and the number of GPUs. By using additional GPUs, we reduced rendering times by up to 60%. Multi-frame rate viewing can even be ten times faster.Item Parallel Rendering on Hybrid Multi-GPU Clusters(The Eurographics Association, 2012) Eilemann, Stefan; Bilgili, Ahmet; Abdellah, Marwan; Hernando, Juan; Makhinya, Maxim; Pajarola, Renato; Schürmann, Felix; Hank Childs and Torsten Kuhlen and Fabio MartonAchieving efficient scalable parallel rendering for interactive visualization applications on medium-sized graphics clusters remains a challenging problem. Framerates of up to 60hz require a carefully designed and fine-tuned parallel rendering implementation that fits all required operations into the 16ms time budget available for each rendered frame. Furthermore, modern commodity hardware embraces more and more a NUMA architecture, where multiple processor sockets each have their locally attached memory and where auxiliary devices such as GPUs and network interfaces are directly attached to one of the processors. Such so called fat NUMA processing and graphics nodes are increasingly used to build cost-effective hybrid shared/distributed memory visualization clusters. In this paper we present a thorough analysis of the asynchronous parallelization of the rendering stages and we derive and implement important optimizations to achieve highly interactive framerates on such hybrid multi-GPU clusters. We use both a benchmark program and a real-world scientific application used to visualize, navigate and interact with simulations of cortical neuron circuit models.Item PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators(The Eurographics Association, 2012) Lo, Li-ta; Sewell, Christopher; Ahrens, James; Hank Childs and Torsten Kuhlen and Fabio MartonDue to the wide variety of current and next-generation supercomputing architectures, the development of highperformance parallel visualization and analysis operators frequently requires re-writing the underlying algorithms for many different platforms. In order to facilitate portability, we have devised a framework for creating such operators that employs the data-parallel programming model. By writing the operators using only data-parallel primitives (such as scans, transforms, stream compactions, etc.), the same code may be compiled to multiple targets using architecture-specific backend implementations of these primitives. Specifically, we make use of and extend NVIDIA's Thrust library, which provides CUDA and OpenMP backends. Using this framework, we have implemented isosurface, cut surface, and threshold operators, and have achieved good parallel performance on two different architectures (multi-core CPUs and NVIDIA GPUs) using the exact same operator code. We have applied these operators to several large, real scientific data sets, and have open-source released a beta version of our code base.Item Polygonization of Implicit Surfaces on Multi-Core Architectures with SIMD Instructions(The Eurographics Association, 2012) Shirazian, Pourya; Wyvill, Brian; Duprat, Jean-Luc; Hank Childs and Torsten Kuhlen and Fabio MartonIn this research we tackle the problem of rendering complex models which are created using implicit primitives, blending operators, affine transformations and constructive solid geometry in a design environment that organizes all these in a scene graph data structure called BlobTree. We propose a fast, scalable, parallel polygonization algorithm for BlobTrees that takes advantage of multicore processors and SIMD optimization techniques available on modern architectures. Efficiency is achieved through the usage of spatial data structures and SIMD optimizations for BlobTree traversals and the computation of mesh vertices and other attributes. Our solution delivers interactive visualization for modeling systems based on BlobTree scene graph.Item Shift-Based Parallel Image Compositing on InfiniBand TM Fat-Trees(The Eurographics Association, 2012) Cavin, Xavier; Demengeon, Olivier; Hank Childs and Torsten Kuhlen and Fabio MartonParallel image compositing has been widely studied over the past 20 years, as this is one, if not the most, crucial element in the implementation of a scalable parallel rendering system. Many algorithms have been proposed and implemented on a large variety of supercomputers. Among the existing supercomputers, InfiniBandTM (IB) PC clusters, and their associated fat-tree topology, are clearly becoming the dominant architecture, as they provide the scalability, high bandwidth and low latency required by the most demanding parallel applications. Surprisingly, very few efforts have been devoted to the implementation and performance evaluation of parallel image compositing algorithms on this kind of architecture. We propose in this paper a new parallel image compositing algorithm, called Shift-Based, relying on a well-known communication pattern called shift permutation. Indeed, shift permutation is one of the possible ways to get the maximum cross bisectional bandwidth provided by an IB fat-tree cluster. We show that our Shift-Based algorithm scales on any number of processing nodes (with peak performance on specific counts), allows overlapping communications with computations and exhibits contentionfree network communications. This is demonstrated with the image compositing of very high resolution images at interactive frame rates.Item A Study of Ray Tracing Large-scale Scientific Data in Two Widely Used Parallel Visualization Applications(The Eurographics Association, 2012) Brownlee, Carson; Patchett, John; Lo, Li-Ta; DeMarle, David; Mitchell, Christopher; Ahrens, James; Hansen, Charles D.; Hank Childs and Torsten Kuhlen and Fabio MartonLarge-scale analysis and visualization is becoming increasingly important as supercomputers and their simulations produce larger and larger data. These large data sizes are pushing the limits of traditional rendering algorithms and tools thus motivating a study exploring these limits and their possible resolutions through alternative rendering algorithms . In order to better understand real-world performance with large data, this paper presents a detailed timing study on a large cluster with the widely used visualization tools ParaView and VisIt. The software ray tracer Manta was integrated into these programs in order to show that improved performance could be attained with software ray tracing on a distributed memory, GPU enabled, parallel visualization resource. Using the Texas Advanced Computing Center's Longhorn cluster which has multi-core CPUs and GPUs with large-scale polygonal data, we find multi-core CPU ray tracing to be significantly faster than both software rasterization and hardware-accelerated rasterization in existing scientific visualization tools with large data.Item Time-constrained Animation Rendering on Desktop Grids(The Eurographics Association, 2012) Aggarwal, Vibhor; Debattista, Kurt; Bashford-Rogers, Thomas; Chalmers, Alan; Hank Childs and Torsten Kuhlen and Fabio MartonThe computationally intensive nature of high-fidelity rendering has led to a dependence on parallel infrastructures for generating animations. However, such an infrastructure is expensive thereby restricting easy access to highfidelity animations to organisations which can afford such resources. A desktop grid formed by aggregating idle resources in an institution is an inexpensive alternative, but it is inherently unreliable due to the non-dedicated nature of the architecture. A naive approach to employing desktop grids for rendering animations could lead to potential inconsistencies in the quality of the rendered animation as the available computational performance fluctuates. Hence, fault-tolerant algorithms are required for efficiently utilising a desktop grid. This paper presents a novel fault-tolerant rendering algorithm for generating high-fidelity animations in a user-defined time-constraint. Time-constrained computation provides an elegant way of harnessing desktop grids as otherwise makespan cannot be guaranteed. The algorithm uses multi-dimensional quasi-random sampling for load balancing, aimed at achieving the best visual quality across the whole animation even in the presence of faults. The results show that the presented algorithm is largely insensitive to temporal variations in computational power of a desktop grid, making it suitable for employing in deadline-driven production environments.