学术英语论文完整版
niudao-
学
术
英
p>
语
论
文
HEN system
office room
【
HEN16H-HENS2AHE
NS8Q8-HENH1688
】
英语学术论文作业
Hybrid
Parallel
Programming
on
GPU
Clusters
Abstract
—Nowadays, NVIDIA’s
CUDA is a general purpose
scalable
parallel programming model for writing
highly parallel applications. It
provides several key abstractions
–
a hierarchy of thread
blocks,
shared memory, and barrier
synchronization. This model has proven quite
successful at programming multithreaded
many core GPUs and scales
transparently
to hundreds of cores: scientists throughout
industry and
academia are already using
CUDA to achieve dramatic speedups on
production and research codes. In this
paper, we propose a hybrid
parallel
programming approach using hybrid CUDA and MPI
programming,
which partition loop
iterations according to the number of C1060 GPU
nodes in a GPU cluster which consists
of one C1060 and one S1070. Loop
iterations assigned to one MPI process
are processed in parallel by CUDA
run
by the processor cores in the same computational
node.
Keywords: CUDA, GPU,
MPI, OpenMP, hybrid, parallel
programming
I. INTRODUCTION
Nowadays, NVIDIA’s CU
DA [1,
16] is a general purpose scalable parallel
programming model for writing highly
parallel applications. It provides
several key abstractions
–
a hierarchy of thread
blocks, shared memory,
and barrier
synchronization. This model has proven quite
successful at
programming multithreaded
many core GPUs and scales transparently to
hundreds of cores: scientists
throughout industry and academia are
already using CUDA [1, 16] to achieve
dramatic speedups on production
and
research codes.
In NVDIA the
CUDA chip, all to the core of hundreds of ways to
construct
their chips, in here we will
try to use NVIDIA to provide computing
equipment for parallel computing. This
paper proposes a solution to not
only
simplify the use of hardware acceleration in
conventional general
purpose applications, but also to keep
the application code portable. In
this
paper, we propose a parallel programming approach
using hybrid CUDA,
OpenMP and MPI [3]
programming, which partition loop iterations
according to the performance weighting
of multi-core [4] nodes in a
cluster.
Because iterations assigned to one MPI process are
processed in
parallel by OpenMP threads
run by the processor cores in the same
computational node, the number of loop
iterations allocated to one
computational node at each scheduling
step depends on the number of
processor
cores in that node.
In this
paper, we propose a general approach that uses
performance
functions to estimate
performance weights for each node. To verify the
proposed approach, a heterogeneous
cluster and a homogeneous cluster
were
built. In ourimplementation, the master node also
participates in
computation, whereas in
previous schemes, only slave nodes do
computation work. Empirical results
show that in heterogeneous and
homogeneous clusters environments, the
proposed approach improved
performance
over all previous schemes.
The rest of this paper is organized as
follows. In Section 2, we
introduce
several typical and well-known self-scheduling
schemes, and a
famous benchmark used to
analyze computer system performance. In Section
3, we define our model and describe our
approach. Our system
configuration is
then specified in Section 4, and experimental
results
for three types of application
program are presented. Concluding remarks
and future work are given in Section
5.
II. BACKGROUND
REVIEW
A. History of GPU and
CUDA
In the past, we have to
use more than one computer to multiple CPU
parallel computing, as shown in the
last chip in the history of the
beginning of the show does not need a
lot of computation, then gradually
the
need for the game and even the graphics were and
the need for 3D, 3D
accelerator card
appeared, and gradually we began to display chip
for
processing,
began to show separate chips, and even made
a
similar in their CPU
chips, that is GPU. We know that GPU computing
could be used to get the answers we
want, but why do we choose to use
the
GPU? This slide shows the current CPU and GPU
comparison. First, we
can see only a
maximum of eight core CPU now, but the GPU has
grown to
260 core, the core number,
we'll know a lot of parallel programs for GPU
computing, despite his relatively low
frequency of core, we I believe a
large
number of parallel computing power could be weaker
than a single
issue. Next, we know that
there are within the GPU memory, and more
access to main memory and GPU CPU GPU
access on the memory capacity, we
find
that the speed of accessing GPU faster than CPU by
10 times, a
whole worse 90GB / s, This
is
quite alarming gap, of
course, this also means that when computing the
time required to access large amounts
of data can have a good GPU to
improve.
CPU
using advanced flow control such as branch predict
or delay
branch and a large cache to
reduce memory access latency, and GPU's
cache and a relatively small number of
flow control nor his simple, so
the
method is to use a lot of GPU computing devices to
cover up the
problem of memory latency,
that is, assuming an access memory GPU takes
5
seconds of the
time, but if there are 100 thread simultaneous
access
to, the time is 5 seconds, but
the assumption that CPU time memory
access time is seconds, if the 100
thread access, the time is 10
seconds,
therefore, GPU parallel processing can be used to
hide even in
access memory
than
CPU speed. GPU is
designed such that more transistors are devoted to
data processing rather than data
caching and flow control, as
schematically illustrated by Figure
1.
Therefore, we in the
arithmetic logic by GPU advantage, trying to use
NVIDIA's multi-core available to help
us a lot of computation, and we
will
provide NVIDIA with so many core programs, and
NVIDIA Corporation
to provide the API of parallel
programming large number of operations to
carry out.
We must use the form provided by NVIDIA
Corporation GPU computing to
run it?
Not really. We can use NVIDIA CUDA, ATI CTM and
apple made
OpenCL (Open Computing
Language), is the development of CUDA is one of
the earliest and most people at this
stage in the language but with the
NVIDIA CUDA only supports its own
graphics card, from where we You can
see at this stage to use GPU graphics
card with the operator of almost
all of
NVIDIA, ATI also has developed its own language of
CTM, APPLE
also proposed OpenCL (Open
Computing Language), which OpenCL has been
supported by NVIDIA and ATI, but ATI
CTM has also given up the language
of
another, by the use of the previous relationship
between the GPU,
usually only support
single
precision floating-
point operations, and in science, precision is a
very
important indicator, therefore,
introduced this year computing graphics
card has to support a Double precision
floating-point operations.
B. CUDA Programming
CUDA (an acronym for Compute Unified
Device Architecture) is a parallel
computing [2] architecture developed by
NVIDIA. CUDA is the computing
engine in
NVIDIA graphics processing units or GPUs that is
accessible to
software developers
through industry standard programming languages.
The
CUDA software stack is composed of
several layers as illustrated in
Figure
2: a hardware driver, an application programming
interface (API)
and its runtime, and
two higher-level mathematical libraries of common
usage, CUFFT [17] and CUBLAS [18]. The
hardware has been designed to
support
lightweight driver and runtime layers, resulting
in high
performance. CUDA architecture
supports a range of computational
interfaces
including OpenGL
[9] and Direct Compute. CUDA’s parallel
programming model is designed to
overcome this challenge while
maintaining a low learning curve for
programmer familiar with standard
programming languages such as C. At its
core are three key abstractions
–
a hierarchy of
thread groups, shared memories, and barrier
synchronization
–
that are simply exposed to the programmer as a
minimal set of
language extensions.
These abstractions provide fine-grained
data parallelism and thread
parallelism, nested within coarse-
grained data parallelism and task
parallelism. They guide the programmer
to partition the problem into
coarse
sub-problems that can be solved independently in
parallel, and
then into finer pieces
that can be solved cooperatively in parallel.
Such a decomposition preserves language
expressivity by allowing threads
to
cooperate when solving each sub-problem, and at
the same time enables
transparent
scalability since each sub-problem can be
scheduled to be
solved on any of the
available processor cores: A compiled CUDA program
can therefore execute on any number of
processor cores, and only the
runtime
system needs to know the physical processor
count.
C. CUDA Processing
flow
In follow illustration,
CUDA processing flow is described as Figure 3
[16]. The first step: copy data from
main memory to GPU memory, second:
CPU
instructs the process to GPU, third: GPU execute
parallel in each
core, finally: copy
the result from GPU memory to main
memory.
III. SYSTEM
HARDWARE
A.
Tesla C1060 GPU Computing
Processor
The NVIDIA? Tesla?
C1060 transforms a workstation into a
high-
performance computer that
outperforms a small cluster. This gives
technical professionals a dedicated
computing resource at their desk-
side
that is much faster and more energy-efficient than
a shared cluster
in the data center.
The NVIDIA? Tesla? C1060 computing processor board
which consists of 240 cores is a PCI
Express form factor computing add-
in
card based on the NVIDIA Tesla T10 graphics
processing unit (GPU).
This board is
targeted as high-performance computing (HPC)
solution for
PCI Express systems. The
Tesla C1060 [15] is capable of 933GFLOPs/s[13]