EGPGV11: Eurographics Symposium on Parallel Graphics and Visualization
Permanent URI for this collection
Browse
Browsing EGPGV11: Eurographics Symposium on Parallel Graphics and Visualization by Title
Now showing 1 - 14 of 14
Results Per Page
Sort Options
Item Cross-Segment Load Balancing in Parallel Rendering(The Eurographics Association, 2011) Erol, Fatih; Eilemann, Stefan; Pajarola, Renato; Torsten Kuhlen and Renato Pajarola and Kun ZhouWith faster graphics hardware comes the possibility to realize even more complicated applications that require more detailed data and provide better presentation. The processors keep being challenged with bigger amount of data and higher resolution outputs, requiring more research in the parallel/distributed rendering domain. Optimizing resource usage to improve throughput is one important topic, which we address in this article for multi-display applications, using the Equalizer parallel rendering framework. This paper introduces and analyzes cross-segment load balancing which efficiently assigns all available shared graphics resources to all display output segments with dynamical task partitioning to improve performance in parallel renderingItem Data-Parallel Mesh Connected Components Labeling and Analysis(The Eurographics Association, 2011) Harrison, Cyrus; Childs, Hank; Gaither, Kelly P.; Torsten Kuhlen and Renato Pajarola and Kun ZhouWe present a data-parallel algorithm for identifying and labeling the connected sub-meshes within a domaindecomposed 3D mesh. The identification task is challenging in a distributed-memory parallel setting because connectivity is transitive and the cells composing each sub-mesh may span many or all processors. Our algorithm employs a multi-stage application of the Union-find algorithm and a spatial partitioning scheme to efficiently merge information across processors and produce a global labeling of connected sub-meshes. Marking each vertex with its corresponding sub-mesh label allows us to isolate mesh features based on topology, enabling new analysis capabilities. We briefly discuss two specific applications of the algorithm and present results from a weak scaling study. We demonstrate the algorithm at concurrency levels up to 2197 cores and analyze meshes containing up to 68 billion cells.Item Distributed OpenGL Rendering in Network Bandwidth Constrained Environments(The Eurographics Association, 2011) Neal, Braden; Hunkin, Paul; McGregor, Antony; Torsten Kuhlen and Renato Pajarola and Kun ZhouDisplay walls made from multiple monitors are often used when very high resolution images are required. To utilise a display wall, rendering information must be sent to each computer that the monitors are connect to. The network is often the performance bottleneck for demanding applications, like high performance 3D animations. This paper introduces ClusterGL; a distribution library for OpenGL applications. ClusterGL reduces network traffic by using compression, frame differencing and multi-cast. Existing applications can use ClusterGL without recompilation. Benchmarks show that, for most applications, ClusterGL outperforms other systems that support unmodified OpenGL applications including Chromium and BroadcastGL. The difference is larger for more complex scene geometries and when there are more display machines. For example, when rendering OpenArena, ClusterGL outperforms Chromium by over 300% on the Symphony display wall at The University of Waikato, New Zealand. This display has 20 monitors supported by five computers connected by gigabit Ethernet, with a full resolution of over 35 megapixels. ClusterGL is freely available via Google Code.Item Efficient I/O for Parallel Visualization(The Eurographics Association, 2011) Fogal, Thomas; Krüger, Jens; Torsten Kuhlen and Renato Pajarola and Kun ZhouWhile additional cores and newer architectures, such as those provided by GPU clusters, steadily increase available compute power, memory and disk access has not kept pace, and most believe this trend will continue. It is therefore of critical importance that we design systems and algorithms which make effective use of off-processor storage. This work details our experiences using parallel file systems, details performance using current systems and software, and suggests a new API which has greater potential for increased scalability.Item GPU Algorithms for Diamond-based Multiresolution Terrain Processing(The Eurographics Association, 2011) Yalçin, M. Adil; Weiss, Kenneth; Floriani, Leila De; Torsten Kuhlen and Renato Pajarola and Kun ZhouWe present parallel algorithms for processing, extracting and rendering adaptively sampled regular terrain datasets represented as a multiresolution model defined by a super-square-based diamond hierarchy. This model represents a terrain as a nested triangle mesh generated through a series of longest edge bisections and encoded in an implicit hierarchical structure, which clusters triangles into diamonds and diamonds into super-squares. We decompose the problem into three parallel algorithms for performing: generation of the diamond hierarchy from a regularly distributed terrain dataset, selective refinement on the diamond hierarchy and generation of the corresponding crack-free triangle mesh for processing and rendering. We avoid the data transfer bottleneck common to previous approaches by processing all data entirely on the GPU. We demonstrate that this parallel approach can be successfully applied to interactive terrain visualization with a high tessellation quality on commodity GPUs.Item Interactive Particle Tracing in Time-Varying Tetrahedral Grids(The Eurographics Association, 2011) Bußler, Michael; Rick, Tobias; Kelle-Emden, Andreas; Hentschel, Bernd; Kuhlen, Torsten; Torsten Kuhlen and Renato Pajarola and Kun ZhouParticle tracing methods are a fundamental class of techniques for vector field visualization. Specifically, interactive particle advection allows the user to rapidly gain an intuitive understanding of flow structures. Yet, it poses challenges in terms of computational cost and memory bandwidth. This is particularly true if the underlying data is time-dependent and represented by a series of unstructured meshes. In this paper, we propose a novel approach which maps the aforementioned computations to modern many-core compute devices in order to achieve parallel, interactive particle advection. The problem of cell location on unstructured tetrahedral meshes is addressed by a two-phase search scheme which is performed entirely on the compute device. In order to cope with limited device memory, the use of data reduction techniques is proposed. A CUDA implementation of the proposed algorithm is evaluated on the basis of one synthetic and two real-world data sets. This particularly includes an assessment of the effects of data reduction on the advection process' accuracy and its performance.Item Load Balancing Utilizing Data Redundancy in Distributed Volume Rendering(The Eurographics Association, 2011) Frey, Steffen; Ertl, Thomas; Torsten Kuhlen and Renato Pajarola and Kun ZhouIn interactive volume rendering, the cost for rendering a certain block of the volume strongly varies with dynamically changing parameters (most notably the camera position and orientation). In distributed environments wherein each compute device renders one block this potentially causes severe load-imbalance. Balancing the load usually induces costly data transfers causing critical rendering delays. In cases in which the sum of memory of all devices substantially exceeds the size of the data set, transfers can be reduced by storing data redundantly. We propose to partition the volume into many equally sized bricks and redundantly save them on different compute devices with the goal of being able to achieve evenly balanced load without any data transfers. The bricks assigned to a device are widely scattered throughout the volume. This minimizes the dependency on the view parameters, as the distribution of relatively cheap and expensive bricks stays roughly the same for most camera configurations. This again enables our fast and simple scheduler to evenly balance the load in almost any situation. In scenarios in which only very few bricks constitute the majority of the overall cost a brick can also be partitioned further and rendered by multiple devices.Item Optimal Multi-Image Processing Streaming Framework on Parallel Heterogeneous Systems(The Eurographics Association, 2011) Ha, Linh K.; Krüger, Jens; Comba, Joao; Joshi, Sarang; Silva, Cláudio T.; Torsten Kuhlen and Renato Pajarola and Kun ZhouAtlas construction is an important technique in medical image analysis that plays a central role in understanding the variability of brain anatomy. The construction often requires applying image processing operations to multiple images (often hundreds of volumetric datasets), which is challenging in computational power as well as memory requirements. In this paper we introduce MIP, a Multi-Image Processing streaming framework to harness the processing power of heterogeneous CPU/GPU systems. In MIP we introduce specially designed streaming algorithms and data structures that provides an optimal solution for out-of-core multi-image processing problems both in terms of memory usage and computational efficiency. MIP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the MIP out-of-core solution could achieve the same performance as the in-core solution. We demonstrate the efficiency of the MIP framework on synthetic and real datasets.Item Parallel Computational Steering and Analysis for HPC Applications using a ParaView Interface and the HDF5 DSM Virtual File Driver(The Eurographics Association, 2011) Biddiscombe, John; Soumagne, Jerome; Oger, Guillaume; Guibert, David; Piccinali, Jean-Guillaume; Torsten Kuhlen and Renato Pajarola and Kun ZhouWe present a framework for interfacing an arbitrary HPC simulation code with an interactive ParaView session using the HDF5 parallel IO library as the API. The implementation allows a flexible combination of parallel simulation, concurrent parallel analysis and GUI client, all of which may be on the same or separate machines. Data transfer between the simulation and the ParaView server takes place using a virtual file driver for HDF5 that bypasses the disk entirely and instead communicates directly between the coupled applications in parallel. The simulation and ParaView tasks run as separate MPI jobs and may therefore use different core counts and/or hardware configurations/platforms, making it possible to carefully tailor the amount of resources dedicated to each part of the workload. The coupled applications write and read datasets to the shared virtual HDF5 file layer, which allows the user to read data representing any aspect of the simulation and modify it using ParaView pipelines, then write it back, to be reread by the simulation (or vice versa). This allows not only simple parameter changes, but complete remeshing of grids, or operations involving regeneration of field values over the entire domain, to be carried out. To avoid the problem of manually customizing the GUI for each application that is to be steered, we make use of XML templates that describe outputs from the simulation, inputs back to it, and what user interactions are permitted on the controlled elements. This XML is used to generate GUI and 3D controls for manipulation of the simulation without requiring explicit knowledge of the underlying model.Item Parallel Gradient Domain Processing of Massive Images(The Eurographics Association, 2011) Philip, Sujin; Summa, Brian; Bremer, Peer-Timo; Pascucci, Valerio; Torsten Kuhlen and Renato Pajarola and Kun ZhouGradient domain processing remains a particularly computationally expensive technique even for relatively small images. When images become massive in size, giga or terapixel, these problems become particularly troublesome and the best serial techniques take on the order of hours or days to compute a solution. In this paper, we provide a simple framework for the parallel gradient domain processing. Specifically, we provide a parallel out-of-core method for the seamless stitching of gigapixel panoramas in a parallel MPI environment. Unlike existing techniques, the framework provides both a straightforward implementation, maintains strict control over the required/allocated resources, and makes no assumptions on the speed of convergence to an acceptable image. Furthermore, the approach shows good weak/strong scaling from several to hundreds of cores and runs on a variety of hardware.Item Parallel In Situ Coupling of Simulation with a Fully Featured Visualization System(The Eurographics Association, 2011) Whitlock, Brad; Favre, Jean M.; Meredith, Jeremy S.; Torsten Kuhlen and Renato Pajarola and Kun ZhouThere is a widening gap between compute performance and the ability to store computation results. Complex scientific codes are the most affected since they must save massive files containing meshes and fields for offline analysis. Time and storage costs instead dictate that data analysis and visualization be combined with the simulations themselves, being done in situ so data are transformed to a manageable size before they are stored. Earlier approaches to in situ processing involved combining specific visualization algorithms into the simulation code, limiting flexibility. We introduce a new library which instead allows a fully-featured visualization tool, VisIt, to request data as needed from the simulation and apply visualization algorithms in situ with minimal modification to the application code.Item A Preview and Exploratory Technique for Large-Scale Scientific Simulations(The Eurographics Association, 2011) Tikhonova, Anna; Yu, Hongfeng; Correa, Carlos D.; Chen, Jacqueline H.; Ma, Kwan-Liu; Torsten Kuhlen and Renato Pajarola and Kun ZhouSuccessful in-situ and remote visualization solutions must have minimal storage requirements and account for only a small percentage of supercomputing time. One solution that meets these requirements is to store a compact intermediate representation of the data, instead of a 3D volume itself. Recent work explores the use of attenuation functions as a data representation that summarizes the distribution of attenuation along the rays. This representation goes beyond conventional static images and allows users to dynamically explore their data, for example, to change color and opacity parameters, without accessing the original 3D data. The computation and storage costs of this method may still be prohibitively expensive for large and time-varying data sets, thus limiting its applicability in the real-world scenarios. In this paper, we present an efficient algorithm for computing attenuation functions in parallel. We exploit the fact that the distribution of attenuation can be constructed recursively from a hierarchy of blocks or intervals of the data, which is a highly parallelizeable process. We have developed a library of routines that can be used in a distance visualization scenario or can be called directly from a simulation code to generate explorable images in-situ. Through a number of examples, we demonstrate the application of this work to large-scale scientific simulations in a real-world parallel environment with thousands of processors. We also explore various compression methods for reducing the size of the RAF. Finally, we present a method for computing an alternative RAF representation, which more closely encodes the actual distribution of samples along a ray, using kernel density estimation.Item Real-Time Ray Tracer for Visualizing Massive Models on a Cluster(The Eurographics Association, 2011) Ize, Thiago; Brownlee, Carson; Hansen, Charles D.; Torsten Kuhlen and Renato Pajarola and Kun ZhouWe present a state of the art read-only distributed shared memory (DSM) ray tracer capable of fully utilizing modern cluster hardware to render massive out-of-core polygonal models at real-time frame rates. Achieving this required adapting a state of the art packetized BVH acceleration structure for use with DSM and modifying the mesh and BVH data layouts to minimize communication costs. Furthermore, several design decisions and optimizations were made to take advantage of InfiniBand interconnects and multi-core machines.Item Revisiting Parallel Rendering for Shared Memory Machines(The Eurographics Association, 2011) Nouanesengsy, Boonthanome; Ahrens, James; Woodring, Jonathan; Shen, Han-Wei; Torsten Kuhlen and Renato Pajarola and Kun ZhouIncreasing the core count of CPUs to increase computational performance has been a significant trend for the better part of a decade. This has led to an unprecedented availability of large shared memory machines. Programming paradigms and systems are shifting to take advantage of this architectural change, so that intra-node parallelism can be fully utilized. Algorithms designed for parallel execution on distributed systems will also need to be modified to scale in these new shared and hybrid memory systems. In this paper, we reinvestigate parallel rendering algorithms with the goal of finding one that achieves favorable performance in this new environment.We test and analyze various methods, including sort-first, sort-last, and a hybrid scheme, to find an optimal parallel algorithm that maximizes shared memory performance.