Organizations increasingly rely upon diverse computer systems to perform a variety of knowledge-based tasks. This presents technical issues of interoperability and integration, as well as philosophical issues of how cooperation and interaction between computational entities is to be realized.
Cooperating systems are systems that work together towards a common end. The concepts of cooperation must be realized in technically sound system architectures, having a uniform meta-layer between knowledge sources and the rest of the system. The layer consists of a family of interpreters, one for each knowledge source, and meta-knowledge.
A system architecture to integrate and control diverse knowledge sources is presented. The architecture is based on the meta-level properties of the logic programming language Prolog. Knowledge-based systems play an important role in any up-to-date arsenal of decision support tools.
The tremendous growth of computer communications infrastructure has made distributed computing a viable option, and often a necessity in geographically distributed organizations. It has become clear that to take knowledge-based systems to their next useful level, it is necessary to get independent knowledge-based systems to work together, much as we put together ad hoc work groups in our organizations to tackle complex problems.
Researchers investigating autonomous agents, distributed computation, and cooperating systems will find fresh ideas and new perspectives on well-estab. Such systems consist of acceleration units that provide massive compute capabilities within limited power budgets.
This complex scenario poses a key challenge: how do we optimize data movement between the host core and accelerators from a holistic system-level perspective? My research focuses on addressing the above question. Data movement optimizations can be explored in two flavors: 1 Maximizing locality and keeping the data close to its compute, and 2 moving the actual computation itself close to the data.
Such optimizations fundamentally depend on applications as well as their interaction with underlying architectures. Exploring associated tradeoffs first and foremost requires an accurate modelling infrastructure. To that end, I first propose a systematic simulator calibration methodology to provide a faithful baseline for accurately modeling targeted system architectures. In this section, we identify the efforts needed to devise suitable algorithms.
The fol- lowing issues must be considered by the designer: 1 the types of machines available and their inherent computing char- acteristics, 2 alternate solutions to various subproblems of the application, and Vector MIMD SIMD SP 3 the costs of performing the com- munication over the network. Computations in HC can be classified into two types? Computations in this class fall into the category of coarse- grained heterogeneity. Instructions be- longing to a particular class of parallel- ism are grouped to form a module; each module is then executed on a suitable parallel machine.
Metacomputing re- fers to heterogeneity at the module lev- el. In this fine- grained heterogeneity, almost every al- ternate parallel instruction belongs to a different class of parallel computation.
Programs exhibiting this type of heter- ogeneity are not suitable for execution on a suite of heterogeneous machines because the communication overhead due to frequent exchange of informa- tion between machines can become a bottleneck. Mixed-mode computing refers to heter- Figure 4.
Compiler-directed approach. A MIMD machines. Note that analytical benchmark sign of portable software. Only one improving its performance over a ma- results are used in partitioning and map- algorithm is required for a given appli- chine operating in SIMD or MIMD mode ping. Various types of parallelism Partitioning and mapping. Problems present in the application are identi- Code-type profiling.
Fast parallel ex- that occur in these areas of a homoge- fied. In addition, all communication ecution of the code in a heterogeneous neous parallel environment have been and computation requirements of the computing environment requires iden- widely studied. The partitioning prob- application are preserved in an inter- tifying and profiling the embedded par- lem can be divided into two subprob- mediate specification of the code. The allelism. Traditional program profiling lems.
Parallelism detection determines architecture of each machine in the en- involves testing a program assumed to the parallelism present in a given pro- vironment is modeled in the system rep- consist of several modules by executing gram.
Clustering combines several op- resentation, which captures the inter- it on suitable test data. The prqfiler erations into a program module and connections of the architecture.
The four monitors the execution of the program thus partitions the application into sev- components of this approach are and gathers statistics, including the ex- eral modules.
These two subproblems ecution time of each program module. This is achieved by tional constraints on clustering. As introduced by Mapping allocating program mod- l a tool that lets users specify topolo- Freund. Informally, in HC environment, and be gathered include the types of paral- homogeneous environments, the map- l a mapping module to match the prob- lelism of various modules in the code ping problem can be defined as assign- lem specification and the system repre- and the estimated execution time of ing program modules to processors so sentation.
Code types that can the communication costs is minimized. Figure 5 illustrates this methodology. In Machine selection. An interesting cial purpose such as fast Fourier trans- HC, however, other objectives, such as problem appears in the design of HC form.
If most appropriate suite of heterogeneous Analytical benchmarking. This test such a mapping has to be performed at machines for a given collection of appli- measures how well the available ma- runtime for load-balancing purposes or cation tasks subject to a given constraint. Various timal configuration of machines for ex- their efficiency in executing a given code approaches to optimal and approximate ecuting an application task on a type.
It is also en parallel machine on various types of conceptually at two levels: system or assumed that machines matching the computation. At the given set of code types are available and This benchmarking is also an off-line system-level mapping, each module is that the application code is decomposed process and is more rigorous than previ- assigned to one or more machines in the into equal-sized modules.
Machine-level mapping assigns the performance of code segments on processor. Some experimental results portions of the module to individual nonoptimal machine choices, assuming obtained by analytical benchmarking processors in the machine.
In this approach, the program module most suitable for one type of machine is as- signed to another type of machine. In the formulation of OST and AOST, it has been assumed that the execution of all program modules of a given applica- tion code is totally ordered in time.
In reality, however, different execution interdependencies can exist among pro- gram modules. Also, parallelism can be present inside a module, resulting in further decomposition of program mod- ules. Furthermore, the effect of differ- ent mappings on different machines available for a program module has not been considered in the formulation of these selection theories. It incorporates the effect of various mapping techniques available on different machines for executing a program module.
Also, the dependen- Figure 5. Cluster-M-based heuristic mapping methodology. In the formulation that is, finding a least expensive set of job-first, and shortest-remaining-time, of HOST, an application code is as- machines to solve a given application can be employed at each level of sched- sumed to consist of subtasks to be exe- subject to a maximal execution time uling.
Each subtask contains a constraint. This scheme is applicable to While all three levels of scheduling collection of program modules. Each all of the above selection theories. The can reside in each machine in an HC program module is further decomposed accuracy of the scheme, however, de- environment, a fourth level is needed to into blocks of parallel instructions, called pends upon the method used to assign perform with scheduling at the system code blocks.
This scheduler maintains a bal- To find an optimal set of machines, Iqbal also shows that for applications in anced system-wide workload by moni- we have to assign the program modules which the program modules communi- toring the progress of all program mod- to the machines so that cate in a restrictive manner, one can ules.
In addition, the scheduler needs to find exact algorithms for selecting an know the different module types and cr optimal set of machines. If, however, available machine types in the environ- the program modules communicate in ment, since modules may have to be is minimal, while an arbitrary fashion, the selection prob- reassigned when the system configura- lem is NP-complete.
Communication bottlenecks and Scheduling. De- Synchronization. This process pro- straint on the cost of the machines. High-level scheduling, also sequencing and to supervise interpro- sponding to the assignment under con- called job scheduling, selects a subset of cess cooperation. It refers to three dis- sideration can be obtained by using code- all submitted jobs competing for the tinct but related problems: type profiling andlor by analyzing the available resources.
Intermediate-level algorithms. Low-level scheduling de- cesses, and the total cost of machines employed in termines the next ready process to be l serialization of concurrent accesses the solution does not exceed an upper assigned to a processor for a certain to shared objects by multiple pro- bound.
The scheme can also find a solu- duration. Fast Download speed and ads Free! Heterogeneous Systems Architecture - a new compute platform infrastructure presents a next-generation hardware platform, and associated software, that allows processors of different types to work efficiently and cooperatively in shared memory from a single source program.
HSA also defines a virtual ISA for parallel routines or kernels, which is vendor and ISA independent thus enabling single source programs to execute across any HSA compliant heterogeneous processer from those used in smartphones to supercomputers. The book begins with an overview of the evolution of heterogeneous parallel processing, associated problems, and how they are overcome with HSA. Later chapters provide a deeper perspective on topics such as the runtime, memory model, queuing, context switching, the architected queuing language, simulators, and tool chains.
Contributing authors are HSA Foundation members who are experts from both academia and industry. Some of these distinguished authors are listed here in alphabetical order: Yeh-Ching Chung, Benedict R.
Provides clear and concise explanations of key HSA concepts and fundamentals by expert HSA Specification contributors Explains how performance-bound programming algorithms and application types can be significantly optimized by utilizing HSA hardware and software features Presents HSA simply, clearly, and concisely without reading the detailed HSA Specification documents Demonstrates ideal mapping of processing resources from CPUs to many other heterogeneous processors that comply with HSA Specifications.
The transitions from multicore processors, GPU computing, and Cloud computing are not separate trends, but aspects of a single trend-mainstream; computers from desktop to smartphones are being permanently transformed into heterogeneous supercomputer clusters. The reader will get an organic perspective of modern heterogeneous systems and their future evolution. Most emerging applications in imaging and machine learning must perform immense amounts of computation while holding to strict limits on energy and power.
To meet these goals, architects are building increasingly specialized compute engines tailored for these specific tasks. The resulting computer systems are heterogeneous, containing multiple processing cores with wildly different execution models. Unfortunately, the cost of producing this specialized hardware—and the software to control it—is astronomical. Moreover, the task of porting algorithms to these heterogeneous machines typically requires that the algorithm be partitioned across the machine and rewritten for each specific architecture, which is time consuming and prone to error.
Over the last several years, the authors have approached this problem using domain-specific languages DSLs : high-level programming languages customized for specific domains, such as database manipulation, machine learning, or image processing.
By giving up generality, these languages are able to provide high-level abstractions to the developer while producing high performance output.
The purpose of this book is to spur the adoption and the creation of domain-specific languages, especially for the task of creating hardware designs. In the first chapter, a short historical journey explains the forces driving computer architecture today. Chapter 2 describes the various methods for producing designs for accelerators, outlining the push for more abstraction and the tools that enable designers to work at a higher conceptual level.
From there, Chapter 3 provides a brief introduction to image processing algorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and compare Darkroom and Halide, two domain-specific languages created for image processing that produce high-performance designs for both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick porting of algorithms. The final section describes how the DSL approach also simplifies the problem of interfacing between application code and the accelerator by generating the driver stack in addition to the accelerator configuration.
This book should serve as a useful introduction to domain-specialized computing for computer architecture students and as a primer on domain-specific languages and image processing hardware for those with more experience in the field. Computer architects are beginning to embrace heterogeneous systems as an effective method to utilize increases in transistor densities for executing a diverse range of workloads under varying performance and energy constraints.
As heterogeneous systems become more ubiquitous, architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. In recognizing hardware diversity, state-of-the-art heterogeneous schedulers are able to produce significant performance improvements over their predecessors and enable more flexible system designs.
Nearly all of these, however, are unable to efficiently identify the mapping schemes which will result in the highest system performance. The most obvious heterogeneity is the existence of computing nodes of different capabilities e. But there are also other heterogeneity factors that exist. Heterogeneous Computing with OpenCL 2. Abstract : With the advent of new commodity depth sensors, point cloud data processing plays an increasingly important role in object recognition and perception.
0コメント