Wednesday, January 26, 2011

DataStage OSH Script

Posted by Venkat ♥ Duvvuri 7:58 AM, under | No comments

The IBM InfoSphere DataStage and QualityStage Designer client creates IBM InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable components that execute on the parallel Information Server engine. It allows you to use familiar graphical point-and-click techniques to develop job flows for extracting, cleansing, transforming, integrating, and loading data into target files, target systems, or packaged applications.

The Designer generates all the code. It generates the OSH (Orchestrate SHell Script) and C++ code for any Transformer stages used.
Briefly, the Designer performs the following tasks:
* Validates link requirements, mandatory stage options, transformer logic, etc.
* Generates OSH representation of data flows and stages (representations of
framework “operators”).
* Generates transform code for each Transformer stage which is then compiled
into C++ and then to corresponding native operators.
* Reusable BuildOp stages can be compiled using the Designer GUI or from
the command line.
Here is a brief primer on the OSH:
* Comment blocks introduce each operator, the order of which is determined by
the order stages were added to the canvas.
* OSH uses the familiar syntax of the UNIX shell. such as Operator name,
schema, operator options (“-name value” format), input (indicated by n< where n is the input#), and output (indicated by the n> where n is the output #).
* For every operator, input and/or output data sets are numbered sequentially
starting from zero.
* Virtual data sets (in memory native representation of data links) are
generated to connect operators.

Framework (Information Server Engine) terms and DataStage terms have equivalency. The GUI frequently uses terms from both paradigms. Runtime messages use framework terminology because the framework engine is where execution occurs. The following list shows the equivalency between framework and DataStage terms:
* Schema corresponds to table definition
* Property corresponds to format
* Type corresponds to SQL type and length
* Virtual data set corresponds to link
* Record/field corresponds to row/column
* Operator corresponds to stage

Note: The actual execution order of operators is dictated by input/output designators, and not by their placement on the diagram. The data sets connect the OSH operators. These are “virtual data sets”, that is, in memory data flows. Link names are used in data set names — it is therefore good practice to give the links meaningful names.

Saturday, January 15, 2011

DataStage Modules

Posted by Venkat ♥ Duvvuri 8:51 PM, under | 1 comment

The DataStage Client components:

Administrator :- Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
Designer :- used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
Manager :- it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
Director :- manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Datastage Administrator view and project properties


Datastage Designer view with a job sequence


Datastage Manager view

Sunday, January 2, 2011

DataStage Execution Flow

Posted by Venkat ♥ Duvvuri 6:50 AM, under | No comments

When you execute a job, the generated OSH and contents of the configuration file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL query optimization plan.

At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and node assignments for each operator, and inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators to prevent deadlocks (for example, in fork-joins). It also defines the number of actual OS processes. Multiple operators/stages are combined within a single OS process as appropriate, to improve performance and optimize resource requirements.

The job score is used to fork processes with communication interconnects for data, message and control3. Processing begins after the job score and processes are created. Job processing ends when either the last row of data is processed by the final operator, a fatal error is encountered by any operator, or the job is halted by DataStage Job Control or human intervention such as DataStage Director STOP.

Job scores are divided into two sections — data sets (partitioning and collecting) and operators (node/operator mapping). Both sections identify sequential or parallel processing.


The execution (orchestra) manages control and message flow across processes and consists of the conductor node and one or more processing nodes as shown in Figure 1-6. Actual data flows from player to player — the conductor and section leader are only used to control process execution through control and message channels.

Conductor is the initial framework process. It creates the Section Leader (SL) processes (one per node), consolidates messages to the DataStage log, and manages orderly shutdown. The Conductor node has the start-up process. The Conductor also communicates with the players.

Note: You can direct the score to a job log by setting $APT_DUMP_SCORE. To identify the Score dump, look for “main program: This step....”.

Section Leader is a process that forks player processes (one per stage) and manages up/down communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file, one section leader will be started for each logical node.

Players are the actual processes associated with the stages. It sends stderr and stdout to the SL, establishes connections to other players for data flow, and cleans up on completion. Each player has to be able to communicate with every other player. There are separate communication channels (pathways) for control, errors, messages and data. The data channel does not go through the section eader/conductor as this would limit scalability.

Data flows directly from upstream operator to downstream operator.