学术英语论文完整版

巡山小妖精
974次浏览
2021年02月07日 21:24
最佳经验
本文由作者推荐

niudao-

2021年2月7日发(作者:东北玩泥巴歌词)














HEN system office room



HEN16H-HENS2AHE NS8Q8-HENH1688






英语学术论文作业



Hybrid


Parallel


Programming


on


GPU


Clusters



Abstract


—Nowadays, NVIDIA’s CUDA is a general purpose


scalable


parallel programming model for writing highly parallel applications. It


provides several key abstractions



a hierarchy of thread blocks,


shared memory, and barrier synchronization. This model has proven quite


successful at programming multithreaded many core GPUs and scales


transparently to hundreds of cores: scientists throughout industry and


academia are already using CUDA to achieve dramatic speedups on


production and research codes. In this paper, we propose a hybrid


parallel programming approach using hybrid CUDA and MPI programming,


which partition loop iterations according to the number of C1060 GPU


nodes in a GPU cluster which consists of one C1060 and one S1070. Loop


iterations assigned to one MPI process are processed in parallel by CUDA


run by the processor cores in the same computational node.



Keywords: CUDA, GPU, MPI, OpenMP, hybrid, parallel programming



I. INTRODUCTION


Nowadays, NVIDIA’s CU


DA [1, 16] is a general purpose scalable parallel


programming model for writing highly parallel applications. It provides


several key abstractions



a hierarchy of thread blocks, shared memory,


and barrier synchronization. This model has proven quite successful at


programming multithreaded many core GPUs and scales transparently to


hundreds of cores: scientists throughout industry and academia are


already using CUDA [1, 16] to achieve dramatic speedups on production


and research codes.



In NVDIA the CUDA chip, all to the core of hundreds of ways to construct


their chips, in here we will try to use NVIDIA to provide computing


equipment for parallel computing. This paper proposes a solution to not


only simplify the use of hardware acceleration in conventional general




purpose applications, but also to keep the application code portable. In


this paper, we propose a parallel programming approach using hybrid CUDA,


OpenMP and MPI [3] programming, which partition loop iterations


according to the performance weighting of multi-core [4] nodes in a


cluster. Because iterations assigned to one MPI process are processed in


parallel by OpenMP threads run by the processor cores in the same


computational node, the number of loop iterations allocated to one


computational node at each scheduling step depends on the number of


processor cores in that node.



In this paper, we propose a general approach that uses performance


functions to estimate performance weights for each node. To verify the


proposed approach, a heterogeneous cluster and a homogeneous cluster


were built. In ourimplementation, the master node also participates in


computation, whereas in previous schemes, only slave nodes do


computation work. Empirical results show that in heterogeneous and


homogeneous clusters environments, the proposed approach improved


performance over all previous schemes.



The rest of this paper is organized as follows. In Section 2, we


introduce several typical and well-known self-scheduling schemes, and a


famous benchmark used to analyze computer system performance. In Section


3, we define our model and describe our approach. Our system


configuration is then specified in Section 4, and experimental results


for three types of application program are presented. Concluding remarks


and future work are given in Section 5.



II. BACKGROUND REVIEW



A. History of GPU and CUDA



In the past, we have to use more than one computer to multiple CPU


parallel computing, as shown in the last chip in the history of the


beginning of the show does not need a lot of computation, then gradually


the need for the game and even the graphics were and the need for 3D, 3D


accelerator card appeared, and gradually we began to display chip for




processing, began to show separate chips, and even made a



similar in their CPU chips, that is GPU. We know that GPU computing


could be used to get the answers we want, but why do we choose to use


the GPU? This slide shows the current CPU and GPU comparison. First, we


can see only a maximum of eight core CPU now, but the GPU has grown to


260 core, the core number, we'll know a lot of parallel programs for GPU


computing, despite his relatively low frequency of core, we I believe a


large number of parallel computing power could be weaker than a single


issue. Next, we know that there are within the GPU memory, and more


access to main memory and GPU CPU GPU access on the memory capacity, we


find that the speed of accessing GPU faster than CPU by 10 times, a


whole worse 90GB / s, This is



quite alarming gap, of course, this also means that when computing the


time required to access large amounts of data can have a good GPU to


improve.



CPU using advanced flow control such as branch predict or delay


branch and a large cache to reduce memory access latency, and GPU's


cache and a relatively small number of flow control nor his simple, so


the method is to use a lot of GPU computing devices to cover up the


problem of memory latency, that is, assuming an access memory GPU takes


5



seconds of the time, but if there are 100 thread simultaneous access


to, the time is 5 seconds, but the assumption that CPU time memory


access time is seconds, if the 100 thread access, the time is 10


seconds, therefore, GPU parallel processing can be used to hide even in


access memory than



CPU speed. GPU is designed such that more transistors are devoted to


data processing rather than data caching and flow control, as


schematically illustrated by Figure 1.



Therefore, we in the arithmetic logic by GPU advantage, trying to use


NVIDIA's multi-core available to help us a lot of computation, and we


will provide NVIDIA with so many core programs, and NVIDIA Corporation




to provide the API of parallel programming large number of operations to


carry out.




We must use the form provided by NVIDIA Corporation GPU computing to


run it? Not really. We can use NVIDIA CUDA, ATI CTM and apple made


OpenCL (Open Computing Language), is the development of CUDA is one of


the earliest and most people at this stage in the language but with the


NVIDIA CUDA only supports its own graphics card, from where we You can


see at this stage to use GPU graphics card with the operator of almost


all of NVIDIA, ATI also has developed its own language of CTM, APPLE


also proposed OpenCL (Open Computing Language), which OpenCL has been


supported by NVIDIA and ATI, but ATI CTM has also given up the language


of another, by the use of the previous relationship between the GPU,


usually only support single



precision floating- point operations, and in science, precision is a very


important indicator, therefore, introduced this year computing graphics


card has to support a Double precision floating-point operations.



B. CUDA Programming



CUDA (an acronym for Compute Unified Device Architecture) is a parallel


computing [2] architecture developed by NVIDIA. CUDA is the computing


engine in NVIDIA graphics processing units or GPUs that is accessible to


software developers through industry standard programming languages. The


CUDA software stack is composed of several layers as illustrated in


Figure 2: a hardware driver, an application programming interface (API)


and its runtime, and two higher-level mathematical libraries of common


usage, CUFFT [17] and CUBLAS [18]. The hardware has been designed to


support lightweight driver and runtime layers, resulting in high


performance. CUDA architecture supports a range of computational


interfaces


including OpenGL [9] and Direct Compute. CUDA’s parallel



programming model is designed to overcome this challenge while


maintaining a low learning curve for programmer familiar with standard


programming languages such as C. At its core are three key abstractions





a hierarchy of thread groups, shared memories, and barrier


synchronization



that are simply exposed to the programmer as a


minimal set of



language extensions.



These abstractions provide fine-grained data parallelism and thread


parallelism, nested within coarse- grained data parallelism and task


parallelism. They guide the programmer to partition the problem into


coarse sub-problems that can be solved independently in parallel, and


then into finer pieces that can be solved cooperatively in parallel.


Such a decomposition preserves language expressivity by allowing threads


to cooperate when solving each sub-problem, and at the same time enables


transparent scalability since each sub-problem can be scheduled to be


solved on any of the available processor cores: A compiled CUDA program


can therefore execute on any number of processor cores, and only the


runtime system needs to know the physical processor count.



C. CUDA Processing flow



In follow illustration, CUDA processing flow is described as Figure 3


[16]. The first step: copy data from main memory to GPU memory, second:


CPU instructs the process to GPU, third: GPU execute parallel in each


core, finally: copy the result from GPU memory to main memory.



III. SYSTEM HARDWARE



A.



Tesla C1060 GPU Computing Processor



The NVIDIA? Tesla? C1060 transforms a workstation into a high-


performance computer that outperforms a small cluster. This gives


technical professionals a dedicated computing resource at their desk-


side that is much faster and more energy-efficient than a shared cluster


in the data center. The NVIDIA? Tesla? C1060 computing processor board


which consists of 240 cores is a PCI Express form factor computing add-


in card based on the NVIDIA Tesla T10 graphics processing unit (GPU).


This board is targeted as high-performance computing (HPC) solution for


PCI Express systems. The Tesla C1060 [15] is capable of 933GFLOPs/s[13]


niudao-


niudao-


niudao-


niudao-


niudao-


niudao-


niudao-


niudao-