Fortran DVM - Language for Portable Parallel Programs Development
N.A.Konovalov, V.A.Krukov, S.N.Michailov, A.A.Pogrebtsov
Keldysh Institute of Applied Mathematics
Russian Academy of Sciences
E-mail: krukov@keldysh.ru
This report introduces the new language model which allows oneto express functional and data parallelism in scientific andengineering applications on massively parallel computers.
The modelcombines the major advantages of PCF Fortran and HPF models and is designed for development of portable applications which can beficiently executed on distributed and shared memory computers. The model supports development of library of standard parallel programs and construction of parallel applications from parallel modules. The model also provides the means for dynamic load balancing.
Key Words: Portable Parallel Program, Massively ParallelComputers, Distributed Memory, Shared Memory, Load Balancing.
Table of contents
2. The Main ideas of the FDVM Language
3. The Description of FDVM Model
3.1 The Abstract Parallel Machine (APM) Design
3.2 Development of a Program for APM
3.3. Declaration of Data Mapping
3.4. Managing Access to Non-Local Data
3.5. The Specification of Mapping of the APM onto Virtual Parallel Machine (VPM)
4. Comparison of FDVM Model with Major Existing Models
4.1 Message Passing Models
4.2. Shared Memory Models
4.3. High Performance Fortran Model
Keldysh Institute of Applied Mathematics develops the programming language Fortran DVM (Distributed Virtual Memory) and related tools as a part of RAMPA-project [Kr93]. The main goal of this project is to automate the development ofportable scientific and engineering applications for massively parallel computers with distributed memory.
The development of applicational programs for distributed systems encounters a few obstacles. Portability is the major one. We would like to emphasize two aspects of this problem:
The main goal of the development of Fortran DVM language (FDVM) is to provide portability of parallel programs. Besides that the FDVM language must provide thedevelopment of efficient programs for massively parallel computers. In this case the performance is determined mainly by the load balancing and overheads for the interprocessor communication and synchronization.
The concept of FDVM language is based on the following main ideas:
* The execution of a parallel program must correspond with its potential parallelism, explicitly declared by a user by means of special directives or parallel constructs. Even enhanced with specification of the data distribution the method of automatic parallelizing has a principal defect: there is no explicit and clear model of parallel execution.This makes impossible to provide a user with convenient tools for analysis of execution and improving performance of his program.
We believe that distributed shared memory computers are going
to become the most popular among massive parallel systems. Since
that to describe the potential parallelism of a program it is
natural to use the means suggested in the PCF Fortran standard
[PCF90]. These means represent the extensive experience of
utilizing multiprocessing computers with shared memory.
Certainly, it is necessary to adopt these means to
distributed memory computers.
* In case of non-uniform memory computers (among them aredistributed systems with logically-shared but physically distributed memory) it is necessary to provide a user with means to determine the data mapping and to distribute the computations among the processors. The suggested means of the HPF language [HPF93] are proper for this goal.
* In case of computers with non-uniform memory it is imperative to provide a user with means of control of access to virtual memory. Such means were suggested and successfully used for the control of virtual memory in the Fortran programs for BESM-6 computer [Kon77].
* Finally, it is necessary to provide a user with means for the dynamic load balancing.
For the purpose of simplification let us suppose that programming in Fortran DVM is done in several stages as follows.
The description of the parallel machine is necessary for a user to control mapping of computations on processors. In case of distributed system APM is also needed to control the data location.
Since PCF Fortran is geared to uniform multiprocessor computers with shared memory, a user have only capabilities to determine a number of processors for execution of the specific parts of a program.
In case of distributed systems mapping of computations on processors demands more accurate planning and could not be performed as dynamically as it is possible on the shared memory computers.
Abstract parallel machine is introduced to provide a user with an opportunity to specify thoroughly internal potential parallelism of a program. Since that the APM is hierarchy of abstract parallel subsystems, which are multidimensional array of subsystems of the next level of hierarchy. The simultaneous existence of different variants of a description of each subsystem is allowed.
The APM may be described statically or constructed dynamically during execution of a program. The dynamic construction is necessary for work with dynamic arrays, as well as for development of a library of standard parallel programs, which are stored as object modules and do not require special preliminary setting.
To develop a program the following model of its execution is used.
At the beginning of execution of a program there is only one
branch (the control flow) which is started on the APM from the
first program statement. After entering the parallel construct
(parallel loop or group of parallel sections) the branch is
divided into several parallel branches, which all are executed on
specified APM subsystems. The branching is
carried out according to specified variant of representation of
current subsystem as an array of subsystems of the next level of
hierarchy.
On exit from the parallel construct all branches join the same branch, which existed prior to entering the parallel construct. At this moment all shared variables, updated by theparallel branches, become available to all processors executing this branch.
To avoid formal data dependence of parallel branches, which is an obstacle to independent execution, some variables (and specially described common-blocks) may be declared as private. In this case a copy of these variables (non-initialized) is created in each parallel branch, and that copy is used independently from the others. Private data definition also reduces the scope of availability of data, and eliminates the needless work to provide data consistency.
A mode of access to shared data in parallel branches must be specified. There are following modes:
In last case explicit synchronization of parallel branches should be used. The synchronization is available in several forms: events, arithmetic sequence synchronizers, and critical sections (PCF Fortran).
Now let us determine the meaning of the phrase "branch is executed on the subsystem". These words could be interpreted in two ways, depending upon what architecture of APM is needed for a user.
In case of shared memory machine the sequential parts of program (before the enter to parallel construct and after the exit from it) are executed on the base processor of the subsystem. In case of distributed memory machine each processor of the subsystem executes the same sequence of statements to have its own copy of variables.
In case of distributed system a user must determine data location in the local memory of the APM processors. It is implemented through the aligning of the arrays and their mapping to APM subsystems (ALIGN in the HPF). Collapsed mapping (when many elements of an array may be mapped to the same subsystem) and replicated mapping (when one element of an array may be mapped to many subsystems) are also allowed.
Data replication is used when calculation can be performed more efficiently repeatedly on different processors than transferring the data between the processors. By default all data which are not mapped by user, are duplicated on all processors of the subsystem where the branch owning these data (the data are private for this branch), is executed.
A user can indicate that when accessing some arrays the access to virtual memory should be performed in bulk preliminary transfers (after the computation of required values by a corresponding processors) rather than element-wise transfers (according to requests from individual processors).
In case of pipeline computation, a programmer must find the balance between the number of transfers and synchronization events on one hand and the productivity of the pipeline on the other.
A user can also specify the buffering mode of non-local variables in the processors memory, which permits to reduce the interprocessor data exchange during repeated access.
Virtual Parallel Machine (VPM) is a machine which is provided to the task by architecture and underlying system software. The structure and number of processors of this machine should be close to real parallel computer or part of it. For distributed computer MPI-machine [MPI93] could be such an example.
VPM as well as APM may be represented as an hierarchy of subsystems. In this case the subsystem should be precisely defined (numbers of its all virtual processors). A subsystem must include only those processors, which belong to the mother subsystem.
Mapping is a correspondence between APM and VPM when each subsystem of APM is mapped onto a subsystem or a processor of VPM. We suggest to provide three ways for such mapping.
This way of mapping as well as means for use of allocatable assumed-shape arrays and means of dynamic creation of APM and VPM are necessary for dynamic load balancing.
Let us compare the FDVM model with the other models, which can be used in distributed systems.
Currently these are the most widely used models. In those models a user specifies the precise data and computation mapping on the processors and provides the processors cooperation by message passing. Although this approach seems to grant the maximum efficiency of a program execution it is not always true affect efficiency of a program. In general their optimal programming on the applicational level is impossible. There are nointernal reduction functions in most of the models, including PVM [Gei93], though such functions exist in MPI and EXPRESS [EXP88].
Although the models are quite expedient in expressing of functional parallelism, in case of data parallelism there are serious problems due to the lack of the global address and name space. The main defect of these models is a strict tie of the program to hardware architecture, including the number of processors and their links. Development of libraries of standard programs is practically impossible.
The unquestionable advantage of the models is absence of the need to develop special compilers and possibility to use the models in conventional languages extended by library functions.
These models use global address space. PCF Fortran and Fortran S [Bod93] are the most characteristic among these models.
The standard of PCF Fortran noted that this model is acceptable not only for the shared memory computers, but also for distributed memory computers. But this is mostly true for such systems as CRAY T3D and CONVEX SPP1000 (distributed shared memory computers) which provide access to non-local data nearly as fast as to local data. The use of the model on the majority of distributed systems is practically impossible due to the lack of means to control data and computations mapping.
Fortran S model provides a user with means of mapping of computations onto processors. Unfortunately the use of program-simulated virtual memory in the model without means to control it unenviably reduces the efficiency of serious applications.
The advantage of the models, as well as FDVM model, is relatively simple compilation.
This model uses global name space and user control over data mapping onto local memories of processors. The model supports the data parallelism and does not support functional parallelism.
There are also two principal differences in the concepts of Fortran DVM and High Performance Fortran languages.
First, a user of FDVM has the possibility of total control over the distribution of computations among processors, while a user of HPF is even not allowed to know how a particular compiler will perform such distribution.
Second, a user of FDVM can fully control the transfer of data between processors and data buffering. A user of HPF has to rely on the "smartness" of the compiler in these matters.
The main effect of differences in approaches mentioned above is that at the cost of certain additional efforts the user of FDVM is always able to make his program as efficient as it could be when conventional means for parallelization and message passing were used. Yet this does not involve any decrease in portability.
The following means of tuning performance of FDVM programs are provided, which are absent in HPF:
Means of processor load balancing:
Means of optimizing interprocessor communication:
Differences in approaches between FDVM and HPF have lead to different views on the role of compile-time optimizations.
The lack of an optimizing compiler from FDVM does not stop the user from achieving required efficiency, although the use of such compiler might in certain cases free him from providing the above annotations.
We believe that the success of FDVM language depends on the successful development of FDVM model. This model on one hand must meet the portability and efficiency requirements and on the other hand must be clear for an applicational programmer as well as relatively easy to implement. Since that developers were focused on designing of the model, which was three times significantly modified during 1993-1994.
Presently the design of the run-time system (LIB-DVM) for this model is close to completion. This run-time system is a backbone of the implementation of FDVM language.
The next stage of FDVM language development is going to be manual programming of several applications (probably using the simple experimental compiler-preprocessor) on the Fortran 77 or C languages extended with LIB-DVM means.
Only after the completion of the above stage the formal FDVM language will be described. If by that time the means of functional-parallelism and dynamic load balancing are introduced to HPF language , it would be reflected in FDVM language.
This work is supported in part by the grant of Russian Fundamental Researches Foundation, No. 93-012-628.