.. _uguide_add_data: *********** Adding Data *********** Data management is one of the important elements of any application. In this section, we will take a look at how we can manage data between tasks. .. note:: The reader is assumed to be familiar with the :ref:`PST Model ` and to have read through the :ref:`introduction` of Ensemble Toolkit. .. note:: This chapter assumes that you have successfully installed Ensemble Toolkit, if not see :ref:`Installation`. You can download the complete code discussed in this section :download:`here <../../../examples/user_guide/add_data.py>` or find it in your virtualenv under ``share/radical.entk/user_guide/scripts``. In the following example, we will create a Pipeline of two Stages, each with one Task. In the first stage, we will create a file of size 1MB. In the next stage, we will perform a character count on the same file and write to another file. Finally, we will bring that output to the current location. Since we already know how to create this workflow, we simply present the code snippet concerned with the data movement. The task in the second stage needs to perform two data movements: a) copy the data created in the first stage to its current directory and b) bring the output back to the directory where the script is executed. Below we present the statements that perform these operations. .. literalinclude:: ../../../examples/user_guide/add_data.py :language: python :lines: 46-53 :dedent: 4 Intermediate data are accessible through a unique identification formed by `$Pipeline_%s_Stage_%s_Task_%s/{filename}` where `%s` is replaced by entity name of pipeline, stage and task. If name is not given, `.uid` is available to locate files across tasks. :ref:`task_api` has more information about the use of API. Let's take a look at the complete code in the example. You can generate a more verbose output by setting the environment variable ``RADICAL_LOG_LVL=DEBUG``. .. literalinclude:: ../../../examples/user_guide/add_data.py Handling data from login node ============================= The following example shows how to configure a task to fetch data at runtime, when compute nodes do not have internet access. .. code-block:: python t = Task() t.name = 'download-task' t.executable = '/usr/bin/ssh' t.arguments = ["@", "'bash -l /path/to/download_script.sh '"] t.download_output_data = ['STDOUT', 'STDERR'] t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None} .. note:: ``bash -l`` makes the shell act as if it had been directly invoked by logging in. .. note:: Verify that the command ``ssh localhost hostname`` works on the login node without a password prompt. If you are asked for a password, please create an ssh keypair with the command ``ssh-keygen``. That should create two keys in ``~/.ssh``, named ``id_rsa`` and ``id_rsa.pub``. Now, execute the following commands: .. code-block:: bash cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys chmod 0600 $HOME/.ssh/authorized_keys