EGPGV10: Eurographics Symposium on Parallel Graphics and Visualization
Permanent URI for this collection
Browse
Browsing EGPGV10: Eurographics Symposium on Parallel Graphics and Visualization by Issue Date
Now showing 1 - 15 of 15
Results Per Page
Sort Options
Item PaTraCo: A Framework Enabling the Transparent and Efficient Programming of Heterogeneous Compute Networks(The Eurographics Association, 2010) Frey, Steffen; Ertl, Thomas; James Ahrens and Kurt Debattista and Renato PajarolaWe propose PaTraCo (Parallel Transparent Computation), a framework for developing parallel applications for single host or ad-hoc compute network environments incorporating a multitude of different kinds of compute devices including graphics cards. It supports both task parallelism and data parallelism, and is designed for algorithms that can be decomposed into passes. The provided API supports the user in structuring the program accordingly. Only application-specific parts need to be implemented using a set of base classes. Multiple compute kernel implementations can be provided per pass, one for each device class (e.g. CPU, GPU, CELL). The scheduler which is based on the critical path method determines prior to the actual computation which implementation to execute on which device to minimize the overall runtime by considering device speed, availability and transfer cost. This procedure has the additional advantage that data can already be transferred to a compute device before the actual need for it arises and thus network transfers can often be executed parallel to computation. Overall, this results in reduced device idling times (if any) and more efficient device utilization. Thread setup and communication, network data transfers and scheduling are handled transparently to the user. PaTraCo monitors the execution in order to update the cost estimates that are used by the scheduler and to provide the user with visual analysis. We evaluate the framework by means of an interactive distributed volume renderer.Item Load-Balanced Isosurfacing on Multi-GPU Clusters(The Eurographics Association, 2010) Martin, Steven; Shen, Han-Wei; McCormick, Patrick; James Ahrens and Kurt Debattista and Renato PajarolaIsosurface extraction is a common technique applied in scientific visualization. Increasing sizes of volumes over which isosurfacing is to be applied combined with increasingly hierarchical parallel architectures present challenges for efficiently distributing isosurfacing work loads. We propose a technique that, with a modest amount of preprocessing, efficiently distributes isosurfacing load to GPU compute resources within a cluster. Load uniformity is maximized over a set of user-defined isovalues, enabling improved scalability over naive, non-data-centric, work distribution approaches.Item Towards a Software Transactional Memory for Graphics Processors(The Eurographics Association, 2010) Cederman, Daniel; Tsigas, Philippas; Chaudhry, Muhammad Tayyab; James Ahrens and Kurt Debattista and Renato PajarolaThe introduction of general purpose computing on many-core graphics processor systems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization. Software transactional memory (STM) simplifies development of concurrent code by allowing the programmer to mark sections of code to be executed concurrently and atomically in an optimistic manner. In contrast to locks, STMs are easy to compose and do not suffer from deadlocks. We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking. The design issues involved in the designing of these two STMs are described and explained in the paper together with experimental results comparing the performance of the two STMs.Item Cross-Node Occlusion in Sort-Last Volume Rendering(The Eurographics Association, 2010) Marchesin, Stéphane; Ma, Kwan-Liu; James Ahrens and Kurt Debattista and Renato PajarolaIn the field of parallel volume rendering, occlusion is a concept which is already widely exploited in order to improve performance. However, when one moves to larger datasets the use of parallelism becomes a necessity, and in that context, exploiting occlusion to speed up volume rendering is not straightforward. In this paper, we propose and detail a new scheme in which the processors exchange occlusion information so as to speed up the rendering by discarding invisible areas. Our pipeline uses full floating point accuracy for all the intermediate stages, allowing the production of high quality pictures.We further show comprehensive performance results using this pipeline with multiple datasets and demonstrate that cross-processor occlusion can improve the performance of parallel volume rendering.Item Streamed Ray Tracing of Single Rays on the Cell Processor(The Eurographics Association, 2010) Bingel, Florian; Hinkenjann, Andre; James Ahrens and Kurt Debattista and Renato PajarolaIn this paper we present an approach to efficiently trace single rays on the Cell Processor, instead of using ray packets. To benefit from the performance of this processor, a data structure is chosen which allows traversal without excessive accesses to main memory. Together with careful optimization for SIMD processing, a performance comparable to a packet based ray tracer, running on the same hardware, is achieved. In special cases, when the coherency of the traced rays get very low, it even outperforms the packet based approach.Item Asynchronous Parallel Reliefboard Computation for Scene Object Approximation(The Eurographics Association, 2010) Süß, Tim; Jähn, Claudius; Fischer, Matthias; James Ahrens and Kurt Debattista and Renato PajarolaWe present a parallel algorithm for the rendering of complex three-dimensional scenes. The algorithm runs across heterogeneous architectures of PC-clusters consisting of a visualization-node, equipped with a powerful graphics adapter, and cluster nodes requiring weaker graphics capabilities only. The visualization-node renders a mixture of scene objects and simplified meshes (Reliefboards). The cluster nodes assist the visualization-node by asynchronous computing of Reliefboards, which are used to replace and render distant parts of the scene. Our algorithm is capable of gaining significant speedups if the cluster's nodes provide weak graphics adapters only. We trade the number of cluster nodes off the scene objects' image quality.Item Fast Compositing for Cluster-Parallel Rendering(The Eurographics Association, 2010) Makhinya, Maxim; Eilemann, Stefan; Pajarola, Renato; James Ahrens and Kurt Debattista and Renato PajarolaThe image compositing stages in cluster-parallel rendering for gathering and combining partial rendering results into a final display frame are fundamentally limited by node-to-node image throughput. Therefore, efficient image coding, compression and transmission must be considered to minimize that bottleneck. This paper studies the different performance limiting factors such as image representation, region-of-interest detection and fast image compression. Additionally, we show improved compositing performance using lossy YUV subsampling and we propose a novel fast region-of-interest detection algorithm that can improve in particular sort-last parallel rendering.Item Scalable Parallel Out-of-core Terrain Rendering(The Eurographics Association, 2010) Goswami, Prashant; Makhinya, Maxim; Bösch, Jonas; Pajarola, Renato; James Ahrens and Kurt Debattista and Renato PajarolaIn this paper, we introduce a novel out-of-core parallel and scalable technique for rendering massive terrain datasets. The parallel rendering task decomposition is implemented on top of an existing terrain renderer using an open source framework for cluster-parallel rendering. Our approach achieves parallel rendering by division of the rendering task either in sort-last (database) or sort-first (screen domain) manner and presents an optimal method for implicit load balancing in the former mode. The efficiency of our approach is validated using massive elevation models.Item Cache-Efficient Parallel Isosurface Extraction for Shared Cache Multicores(The Eurographics Association, 2010) Tchiboukdjian, Marc; Danjean, Vincent; Raffin, Bruno; James Ahrens and Kurt Debattista and Renato PajarolaThis paper proposes to revisit isosurface extraction algorithms taking into consideration two specific aspects of recent multicore architectures: their intrinsic parallelism associated with the presence of multiple computing cores and their cache hierarchy that often includes private caches as well as caches shared between all cores. Taking advantage of these shared caches require adapting the parallelization scheme to make the core collaborate on cache usage and not compete for it, which can impair performance. We propose to have cores working on independent but close data sets that can all fit in the shared cache. We propose two shared cache aware parallel isosurface algorithms, one based on marching tetrahedra, and one using a min-max tree as acceleration data structure. We theoretically prove that in both cases the number of cache misses is the same as for the sequential algorithm for the same cache size. The algorithms are based on the FastCOL cache-oblivious data layout for irregular meshes. The CO layout also enables to build a very compact min-max tree that leads to a reduced number of cache misses. Experiments confirm the interest of these shared cache aware isosurface algorithms, the performance gain increasing as the shared cache size to core number ratio decreases.Item Accelerating and Benchmarking Radix-k Image Compositing at Large Scale(The Eurographics Association, 2010) Kendall, Wesley; Peterka, Tom; Huang, Jian; Shen, Han-Wei; Ross, Robert; James Ahrens and Kurt Debattista and Renato PajarolaRadix-k was introduced in 2009 as a configurable image compositing algorithm. The ability to tune it by selecting k-values allows it to benefit more from pixel reduction and compression optimizations than its predecessors. This paper describes such optimizations in Radix-k, analyzes their effects, and demonstrates improved performance and scalability. In addition to bounding and run-length encoding pixels, k-value selection and load balance are regulated at run-time. Performance is systematically analyzed for an array of process counts, image sizes, and HPC and graphics clusters. Analyses are performed using compositing of synthetic images and also in the context of a complete volume renderer and scientific data. We demonstrate increased performance over binary swap and show that 64 megapixels can be composited at rates of 0.08 seconds, or 12.5 frames per second, at 32 K processes.Item Ray Tracing Dynamic Scenes with Shadows on the GPU(The Eurographics Association, 2010) Guntury, Sashidhar; Narayanan, P. J.; James Ahrens and Kurt Debattista and Renato PajarolaWe present fast ray tracing of dynamic scenes in this paper with primary and shadow rays. We present a GPUfriendly strategy to bring coherency to shadow rays, based on previous work on grids as acceleration structures. We introduce indirect mapping of threads to rays to improve the performance of ray tracing on GPU for the traversal and intersection steps. We also construct a light frustum in a spherical space for shadow rays. A grid structure is constructed each frame for the light frustum and traversed coherently. This involves careful mapping of the primary ray information to the light space and balancing the work load of the threads. Using the finegrained parallelism of GPU, we reorder the shadow rays to make them coherent and process multiple thread blocks to each cell to balance the work load. Spherical mapping is key to handling light sources placed anywhere in the scene by reducing the triangle count and improving performance in shadow checking. In addition it also allows us to introduce spotlights in raytracing. In practice, we attain interactive performance for moderately large models which change dynamically in the scene.Item MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems(The Eurographics Association, 2010) Howison, Mark; Bethel, E. Wes; Childs, Hank; James Ahrens and Kurt Debattista and Renato PajarolaThis work studies the performance and scalability characteristics of hybrid parallel programming and execution as applied to raycasting volume rendering a staple visualization algorithm on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with fourand six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.Item Self-Scheduled Parallel Isosurfacing using Distributed Span Space on Cell(The Eurographics Association, 2010) Caruso, Michael R.; Newman, Timothy S.; James Ahrens and Kurt Debattista and Renato PajarolaA method designed for fast isosurfacing on Cell platforms is introduced. It well-utilizes limited amounts of local memory by exploiting a block-based span space. Exploitation goes beyond the usual steps of avoiding span space tiles whose range does not contain the isovalue. In particular, the method keeps resident in local memories most span space information in addition to the parts of the volume most likely to be examined if multiple isovalues are explored. The method also performs distributed self-scheduling of isosurfacing work among the Cell's Synergistic Processing Units (SPUs) without explicit centralized computation of workload or assignment of work. Results are also presented for trials on the Playstation-3, including comparison to another fast, parallel isosufacing method (which is faster than prior reported parallel methods on Cell).Item Multi-Frame Rate Volume Rendering(The Eurographics Association, 2010) Hauswiesner, Stefan; Kalkofen, Denis; Schmalstieg, Dieter; James Ahrens and Kurt Debattista and Renato PajarolaThis paper presents multi-frame rate volume rendering, an asynchronous approach to parallel volume rendering. The workload is distributed over multiple GPUs in such a way that the main display device can provide high frame rates and little latency to user input, while one or multiple backend GPUs asynchronously provide new views. The latency artifacts inherent to such a solution are minimized by forward image warping. Volume rendering, especially in medical applications, often involves the visualization of transparent objects. Former multi-frame rate rendering systems addressed this poorly, because an intermediate representation consisting of a single surface lacks the ability to preserve motion parallax. The combination of volume raycasting with feature peeling yields an image-based representation that is simultaneously suitable for high quality reconstruction and for fast rendering of transparent datasets. Moreover, novel methods for trading excess speed for visual quality are introduced, and strategies for balancing quality versus speed during runtime are described. A performance evaluation section provides details on possible application scenarios.Item Parallel View-Dependent Refinement of Compact Progressive Meshes(The Eurographics Association, 2010) Derzapf, Evgenij; Menzel, Nicolas; Guthe, Michael; James Ahrens and Kurt Debattista and Renato PajarolaThe complexity of polygonal models still grows faster than the ability of the graphics hardware to render them in real-time. A common way to deal with such models is to use multiple levels of detail (LODs). These can be static with the advantage that the simplification can be performed without regarding real-time constraints and the rendering algorithm simply chooses which LODs to render at runtime. Static LODs however suffer from sudden mesh transitions (popping artifacts) when the levels are too different. Dynamic or view-dependent LODs solve this problem by allowing for a continuous and smooth refinement. Unfortunately, they become computationally too expensive when the number of vertices is high, because refinement operations have to be computed for every vertex. In this paper, we address this problem by introducing a compact data structure for progressive meshes optimized for parallel processing and low memory consumption on the GPU. We also present an efficient LOD adaption algorithm resulting in an adaption time almost equal to the rendering time of the adapted mesh.