首页 > 基础教育 > 中学 >

学术英语论文完整版

4831次浏览 |792点赞 | 764评论 | 2021-02-07 21:24 更新

来源：互联网

最佳文档

本文由作者推荐

niudao-

2021年2月7日发(作者：东北玩泥巴歌词)

学

术

英

语

论

文

HEN system office room

【

HEN16H-HENS2AHE NS8Q8-HENH1688

】

英语学术论文作业

Hybrid

Parallel

Programming

GPU

Clusters

Abstract

—Nowadays, NVIDIA’s CUDA is a general purpose

scalable

parallel programming model for writing highly parallel applications. It

provides several key abstractions

–

a hierarchy of thread blocks,

shared memory, and barrier synchronization. This model has proven quite

successful at programming multithreaded many core GPUs and scales

transparently to hundreds of cores: scientists throughout industry and

academia are already using CUDA to achieve dramatic speedups on

production and research codes. In this paper, we propose a hybrid

parallel programming approach using hybrid CUDA and MPI programming,

which partition loop iterations according to the number of C1060 GPU

nodes in a GPU cluster which consists of one C1060 and one S1070. Loop

iterations assigned to one MPI process are processed in parallel by CUDA

run by the processor cores in the same computational node.

Keywords: CUDA, GPU, MPI, OpenMP, hybrid, parallel programming

I. INTRODUCTION

Nowadays, NVIDIA’s CU

DA [1, 16] is a general purpose scalable parallel

programming model for writing highly parallel applications. It provides

several key abstractions

–

a hierarchy of thread blocks, shared memory,

and barrier synchronization. This model has proven quite successful at

programming multithreaded many core GPUs and scales transparently to

hundreds of cores: scientists throughout industry and academia are

already using CUDA [1, 16] to achieve dramatic speedups on production

and research codes.

In NVDIA the CUDA chip, all to the core of hundreds of ways to construct

their chips, in here we will try to use NVIDIA to provide computing

equipment for parallel computing. This paper proposes a solution to not

only simplify the use of hardware acceleration in conventional general

purpose applications, but also to keep the application code portable. In

this paper, we propose a parallel programming approach using hybrid CUDA,

OpenMP and MPI [3] programming, which partition loop iterations

according to the performance weighting of multi-core [4] nodes in a

cluster. Because iterations assigned to one MPI process are processed in

parallel by OpenMP threads run by the processor cores in the same

computational node, the number of loop iterations allocated to one

computational node at each scheduling step depends on the number of

processor cores in that node.

In this paper, we propose a general approach that uses performance

functions to estimate performance weights for each node. To verify the

proposed approach, a heterogeneous cluster and a homogeneous cluster

were built. In ourimplementation, the master node also participates in

computation, whereas in previous schemes, only slave nodes do

computation work. Empirical results show that in heterogeneous and

homogeneous clusters environments, the proposed approach improved

performance over all previous schemes.

The rest of this paper is organized as follows. In Section 2, we

introduce several typical and well-known self-scheduling schemes, and a

famous benchmark used to analyze computer system performance. In Section

3, we define our model and describe our approach. Our system

configuration is then specified in Section 4, and experimental results

for three types of application program are presented. Concluding remarks

and future work are given in Section 5.

II. BACKGROUND REVIEW

A. History of GPU and CUDA

In the past, we have to use more than one computer to multiple CPU

parallel computing, as shown in the last chip in the history of the

beginning of the show does not need a lot of computation, then gradually

the need for the game and even the graphics were and the need for 3D, 3D

accelerator card appeared, and gradually we began to display chip for

processing, began to show separate chips, and even made a

similar in their CPU chips, that is GPU. We know that GPU computing

could be used to get the answers we want, but why do we choose to use

the GPU? This slide shows the current CPU and GPU comparison. First, we

can see only a maximum of eight core CPU now, but the GPU has grown to

260 core, the core number, we'll know a lot of parallel programs for GPU

computing, despite his relatively low frequency of core, we I believe a

large number of parallel computing power could be weaker than a single

issue. Next, we know that there are within the GPU memory, and more

access to main memory and GPU CPU GPU access on the memory capacity, we

find that the speed of accessing GPU faster than CPU by 10 times, a

whole worse 90GB / s, This is

quite alarming gap, of course, this also means that when computing the

time required to access large amounts of data can have a good GPU to

improve.

CPU using advanced flow control such as branch predict or delay

branch and a large cache to reduce memory access latency, and GPU's

cache and a relatively small number of flow control nor his simple, so

the method is to use a lot of GPU computing devices to cover up the

problem of memory latency, that is, assuming an access memory GPU takes

seconds of the time, but if there are 100 thread simultaneous access

to, the time is 5 seconds, but the assumption that CPU time memory

access time is seconds, if the 100 thread access, the time is 10

seconds, therefore, GPU parallel processing can be used to hide even in

access memory than

CPU speed. GPU is designed such that more transistors are devoted to

data processing rather than data caching and flow control, as

schematically illustrated by Figure 1.

Therefore, we in the arithmetic logic by GPU advantage, trying to use

NVIDIA's multi-core available to help us a lot of computation, and we

will provide NVIDIA with so many core programs, and NVIDIA Corporation

to provide the API of parallel programming large number of operations to

carry out.

We must use the form provided by NVIDIA Corporation GPU computing to

run it? Not really. We can use NVIDIA CUDA, ATI CTM and apple made

OpenCL (Open Computing Language), is the development of CUDA is one of

the earliest and most people at this stage in the language but with the

NVIDIA CUDA only supports its own graphics card, from where we You can

see at this stage to use GPU graphics card with the operator of almost

all of NVIDIA, ATI also has developed its own language of CTM, APPLE

also proposed OpenCL (Open Computing Language), which OpenCL has been

supported by NVIDIA and ATI, but ATI CTM has also given up the language

of another, by the use of the previous relationship between the GPU,

usually only support single

precision floating- point operations, and in science, precision is a very

important indicator, therefore, introduced this year computing graphics

card has to support a Double precision floating-point operations.

B. CUDA Programming

CUDA (an acronym for Compute Unified Device Architecture) is a parallel

computing [2] architecture developed by NVIDIA. CUDA is the computing

engine in NVIDIA graphics processing units or GPUs that is accessible to

software developers through industry standard programming languages. The

CUDA software stack is composed of several layers as illustrated in

Figure 2: a hardware driver, an application programming interface (API)

and its runtime, and two higher-level mathematical libraries of common

usage, CUFFT [17] and CUBLAS [18]. The hardware has been designed to

support lightweight driver and runtime layers, resulting in high

performance. CUDA architecture supports a range of computational

interfaces

including OpenGL [9] and Direct Compute. CUDA’s parallel

programming model is designed to overcome this challenge while

maintaining a low learning curve for programmer familiar with standard

programming languages such as C. At its core are three key abstractions

–

a hierarchy of thread groups, shared memories, and barrier

synchronization

–

that are simply exposed to the programmer as a

minimal set of

language extensions.

These abstractions provide fine-grained data parallelism and thread

parallelism, nested within coarse- grained data parallelism and task

parallelism. They guide the programmer to partition the problem into

coarse sub-problems that can be solved independently in parallel, and

then into finer pieces that can be solved cooperatively in parallel.

Such a decomposition preserves language expressivity by allowing threads

to cooperate when solving each sub-problem, and at the same time enables

transparent scalability since each sub-problem can be scheduled to be

solved on any of the available processor cores: A compiled CUDA program

can therefore execute on any number of processor cores, and only the

runtime system needs to know the physical processor count.

C. CUDA Processing flow

In follow illustration, CUDA processing flow is described as Figure 3

[16]. The first step: copy data from main memory to GPU memory, second:

CPU instructs the process to GPU, third: GPU execute parallel in each

core, finally: copy the result from GPU memory to main memory.

III. SYSTEM HARDWARE

Tesla C1060 GPU Computing Processor

The NVIDIA? Tesla? C1060 transforms a workstation into a high-

performance computer that outperforms a small cluster. This gives

technical professionals a dedicated computing resource at their desk-

side that is much faster and more energy-efficient than a shared cluster

in the data center. The NVIDIA? Tesla? C1060 computing processor board

which consists of 240 cores is a PCI Express form factor computing add-

in card based on the NVIDIA Tesla T10 graphics processing unit (GPU).

This board is targeted as high-performance computing (HPC) solution for

PCI Express systems. The Tesla C1060 [15] is capable of 933GFLOPs/s[13]

niudao-

中考优秀作文赏析：最尽职尽责的人_800字

小学奥数斐波那契数列典型例题

简介小P老师全部美容护肤视频教程

小学奥数同余问题复习进程

两小无猜电影观后感

【优选】2020年度六年级数学有趣经典的奥数题及答案解析

宁夏石嘴山市数学小学奥数系列8-5-1操作与策略

文档作者

玛丽莲梦兔

错过了我你最好不要后悔，因为我会找一个比你好的

您可能关注的内容

为你推荐

热门标签

学术英语论文完整版

巡山小妖精

974次浏览

2021年02月07日 21:24

最佳经验