Two Fundamental Hardware Techniques
Used To Increase Performance
- Parallelism
- pipelining
Parallelism
- Multiple copies of hardware unit used
- All copies can operate simultaneously
- Occurs at many levels of architecture
- Term parallel computer applied when parallelism dominates entire architecture
Characterizations Of Parallelism
Microscopic vs. macroscopic
Symmetric vs. asymmetric
Fine-grain vs. coarse-grain
Explicit vs. implicit
Types Of Parallel Architectures
Name Meaning
SISD Single Instruction Single Data stream
SIMD Single Instruction Multiple Data streams
MIMD Multiple
Instructions Multiple Data streams
Distributed Processing
Distributed processing is the type of processing whereby
processing occurs on more than one processor in order for a transaction to be
completed. In other words, processing is distributed across two or more
machines and the processes are most likely not running at the same time.
The word distributed in
terms such as "distributed system", "distributed
programming", and "distributed algorithm" originally
referred to computer networks where individual computers were physically
distributed within some geographical area. The terms are nowadays used in a
much wider sense, even referring to autonomous processes that run on the same physical
computer and interact with each other by message passing. While there is
no single definition of a distributed system, the
following defining properties are commonly used:
- There are several autonomous computational entities, each of which has its own local memory
- The entities communicate with each other by message passing.
In this article, the computational entities are called computers or nodes. A distributed system may have a
common goal, such as solving a large computational problem. Alternatively,
each computer may have its own user with individual needs, and the purpose of
the distributed system is to coordinate the use of shared resources or provide
communication services to the users.
Other typical properties of
distributed systems include the following:
- The system has to tolerate failures in individual computers.
- The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program.
- Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.
A parallel computer is a collection of processing elements that cooperate to solve large problems fast. Extension of “computer architecture” to support communication and cooperation
• OLD:
Instruction Set Architecture
• NEW:
Communication Architecture
Defines
• Critical
abstractions, boundaries, and primitives (interfaces)
•
Organizational structures that implement interfaces (hw or sw)
Compilers,
libraries and OS are important bridges
Communication Architecture
=User/System Interface+Implementation
User/System Interface:
• Comm.
primitives exposed to user-level by hw and system-level sw
Implementation:
•
Organizational structures that implement the primitives: hw or OS
• How
optimized are they? How integrated into processing node?
•
Structure of network
Goals:
•
Performance
• Broad
applicability
•
Programmability
•
Scalability
• Low Cost
Introduction thread programming
Many experimental operating systems, and some commercial ones, have recently included support for concurrent programming. The most popular mechanism for this is some provision for allowing multiple lightweight “threads” within a single address space, used from within a single program. Programming with threads introduces new difficulties even for experienced programmers. Concurrent programming has techniques and pitfalls that do not occur in sequential programming. Many of the techniques are obvious, but some are obvious only with hindsight. Some of the pitfalls are comfortable (for example, deadlock is a pleasant sort of bug—your program stops with all the evidence intact), but some take the form of insidious performance penalties.
The purpose of this
paper is to give you an introduction to the programming techniques that work
well with threads, and to warn you about techniques or interactions that work
out badly. It should provide the experienced sequential programmer with enough hints
to be able to build a substantial multi-threaded program that works—correctly, efficiently,
and with a minimum of surprises.
Having “multiple
threads” in a program means that at any instant the program has multiple points
of execution, one in each of its threads. The programmer can mostly view the
threads as executing simultaneously, as if the computer were endowed with as
many processors as there are threads. The programmer is required to decide when
and where to create multiple threads, or to accept such decisions made for him
by implementers of existing library packages or runtime systems. Additionally,
the programmer must occasionally be aware that the computer might not in fact
execute all his threads simultaneously.
Having the threads
execute within a “single address space” means that the computer’s addressing
hardware is configured so as to permit the threads to read and write the same memory
locations. In a high-level language, this usually corresponds to the fact that
the off-stack (global) variables are shared among all the threads of the
program. Each thread executes on a separate call stack with its own separate
local variables. The programmer is responsible for using the synchronization
mechanisms of the thread facility to ensure that the shared memory is accessed
in a manner that will give the correct answer.
Thread facilities
are always advertised as being “lightweight”. This means that thread creation, existence,
destruction and synchronization primitives are cheap enough that the programmer
will use them for all his concurrency needs. Please be aware that I am
presenting you with a selective, biased and idiosyncratic collection of
techniques. Selective, because an exhaustive survey would be premature, and would
be too exhausting to serve as an introduction—I will be discussing only the
most important thread primitives, omitting features such as per-thread context
information. Biased, because I present examples, problems and solutions in the
context of one.
Introduction programing CUDA
What is
CUDA?
* CUDA
Architecture
— Expose
general-purpose GPU computing as first-class capability
— Retain
traditional DirectX/OpenGL graphics performance
*CUDA C
— Based on
industry-standard C
— A handful
of language extensions to allow heterogeneous programs
—
Straightforward APIs to manage devices, memory, etc.
*This talk
will introduce you to CUDA CIntroduction to CUDA C
*What will
you learn today?
— Start from
“Hello, World!”
— Write and
launch CUDA C kernels
— Manage GPU
memory
— Run
parallel kernels in CUDA C
— Parallel
communication and synchronization
— Race
conditions and atomic operations
CUDA C
Prerequisites
·
You (probably) need experience
with C or C++
·
You do not need any GPU
experience
·
You do not need any graphics
experience
·
You do not need any parallel
programming experience
CUDA C: The Basics
Host
Note: Figure Not to Scale
·
Terminology
·
Host – The CPU and its memory (host memory)
·
Device – The GPU and its memory (device memory)
Memory
Management
Host and device memory are distinct entities
Device
pointers point to GPU memory
May be passed
to and from host code
May not be dereferenced from host code
Host
pointers point to CPU memory
May be passed to and from device code
May not be dereferenced from device code
Basic CUDA API for dealing with device memory
cudaMalloc(),
cudaFree(), cudaMemcpy()
Similar
to their C equivalents, malloc(), free(), memcpy()
source:
http://www.eecs.wsu.edu/~hauser/teaching/Arch-F07/handouts/Chapter17.pdf
http://www.ask.com/question/what-is-distributed-processing
http://www.cis.upenn.edu/~lee/07cis505/Lec/lec-ch1-DistSys-v4.pdf
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/lect08.4up.pdf
https://birrell.org/andrew/papers/035-Threads.pdf
http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf
http://www.csse.monash.edu.au/~rdp/research/Papers/Parallelism_in_a_computer_architecture_to_support_orientation_changes_in_virtual_reality.pdf
source:
http://www.eecs.wsu.edu/~hauser/teaching/Arch-F07/handouts/Chapter17.pdf
http://www.ask.com/question/what-is-distributed-processing
http://www.cis.upenn.edu/~lee/07cis505/Lec/lec-ch1-DistSys-v4.pdf
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/lect08.4up.pdf
https://birrell.org/andrew/papers/035-Threads.pdf
http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf
http://www.csse.monash.edu.au/~rdp/research/Papers/Parallelism_in_a_computer_architecture_to_support_orientation_changes_in_virtual_reality.pdf
No comments :
Post a Comment