Introduction

Overview

The Ensemble Toolkit is a Python framework for developing and executing applications comprised of multiple sets of tasks, aka ensembles. Ensemble Toolkit was originally developed with ensemble-based applications in mind. As our understanding of the variety of workflows in scientific application improved, we realized our approach needs to be more generic. Although our motivation remains that of Ensemble-based applications, from EnTK 0.6 onwards, any application where the task workflow can be expressed as a Directed Acyclic Graph, can be supported.

The Ensemble Toolkit has the following unique features: (i) abstractions that enable the expression of various task graphs, (ii) abstraction of resource management and task execution, (iii) Fault tolerance as a first order concern and (iv) well-established runtime capabilities to enable efficient and dynamic usage of grid resources and supercomputers.

We will now discuss the high level design of Ensemble Toolkit in order to understand how an application is created and executed.

Design

Ensemble Toolkit Design - high level

Figure 1: High level design of Ensemble Toolkit

Ensemble toolkit consists of several components that serve different purposes. There are three user level components, namely, Pipeline, Stage and Task, that are used directly by the user. The Pipeline, Stage and Task are components used to create the application by describing its task graph. We will soon take a look into how these can be used to create an application.

The Application Manager is an internal component, that takes a workflow described by the user and converts it into a set of workloads, i.e. tasks with no dependencies by parsing through the workflow and identifying, during runtime, tasks with equivalent or no dependencies. The Application Manager also accepts the description of the resource request (with resource label, walltime, cpus, gpus, user credentials) to be created.

The Execution Manager is the last component in Ensemble Toolkit. It accepts the workloads prepared by the Application Manager and executes them on the specified resource using a Runtime system (RTS). Internally, it consists of two subcomponents: ResourceManager and TaskManager, that are responsible for the allocation, management, and deallocation of resources, and execution management of tasks, respectively. The Execution Manager is currently configured to use RADICAL Pilot (RP) as the runtime system, but can be extended to other RTS.

Ensemble Toolkit uses a runtime system as a framework to simply execute tasks on high performance computing (HPC) platforms. The runtime system is expected to manage the direct interactions with the various software and hardware layers of the HPC platforms, including their heterogeneitys.

More details about how EnTK is designed and implemented can be found here.

Dependencies

Ensemble Toolkit uses RADICAL Pilot (RP) as the runtime system. RP is targeted currently only for a set of high performance computing (HPC) systems (see here). RP can be extended to support more HPC systems by contacting the developers of RP/EnTK or by the user themselves by following this page.

EnTK also has profiling capabilities and uses Pandas dataframes to store the data. Users can use these dataframes to wrangle the data or directly plot the required fields.

Dependencies such as RP and Pandas are automatically installed when installing EnTK.

Five steps to create an application

  1. Use the Pipeline, Stage and Task components to create the workflow.

  2. Create an Application Manager (Amgr) object with required parameters/configurations.

  3. Describe the resource request to be created. Assign resource request description and workflow to the Amgr.

  4. Run the Application Manager.

  5. Sit back and relax!

Jump ahead to take a look at the step-by-step instructions for an example script here.

Intended users

Ensemble Toolkit is completely Python based and requires familiarity with the Python language.

Our primary focus is to support domain scientists and enable them to execute their applications at scale on various HPC platforms. Some of our include:

User Groups

Domain

University of Colorado, Denver

Biochemistry/ Biophysics

Penn State University

Climate Science

Princeton University

Seismology

University College of London

Biochemistry/ Biophysics Medicine

Rice University

Biochemistry/ Biophysics

Stony Brook University

Polar Science

Northern Arizona University

Polar Science

Oak Ridge National Laboratory

Biochemistry/ Biophysics