Sunday, June 5, 2011

The tsort operator

Posted by Venkat ♥ Duvvuri 7:07 AM, under DataStage EE | 6 comments

The tsort operator

WebSphere DataStage provides the sort operator, tsort that you can use to sort the records of a data set. The tsort operator can run as either a sequential or a parallel operator. The execution mode of the tsort operator determines its action:

Sequential mode: The tsort operator executes on a single processing node to sort an entire data set.

How to generate Sequence Numbers in DataStage using @INROWNUM and @PARTITIONNUM System Variables

Posted by Venkat ♥ Duvvuri 1:12 AM, under DataStage EE | No comments

This is one of the basic requirement in DataStage, we'll have to generate sequence numbers and then assign the same values to your required O/P field (e.g. 1, 2, 3, …). If you are using the import osh operator (through a stage, e.g. the Sequential File Stage) to read external data, you can use the -recordNumberField parameter.

Solution: Generate Row Number Field with DataStage Transformer Stage

There are number of different ways to solve this problem. Here I will share with you the easiest way to generate and assign sequence numbers using a DataStage Parallel Transformer stage.

@PARTITIONNUM + @NUMPARTITIONS * (@INROWNUM - 1) + 1

Note: This logic only works if your data is evenly balanced i.e., equal number of rows going through each partition.

The above logic uses three DataStage system variables which are listed as below;

@INROWNUM – This system variable contains the row number within the partition. For each partition this variable starts from 1: 1, 2, 3, …
@NUMPARTITIONS – This system variable contains the number of partitions (1, 2, 3, …) the stage is running on.
@PARTITIONNUM – This system variable contains the number (0, 1, 2,…) of the partition that is processing the particular row.
To understand the derivation expression better, let’s see an example.

Example-1: Generating Sequence Numbers Using 2-Node Config File

	2 NODE CONFIG FILE
Sample Field	@INROWNUM	@PARTITIONNUM	@PARTITIONNUM + @NUMPARTITIONS * (@INROWNUM - 1) + 1
A	1	0	1
B	1	1	2
C	2	0	3
D	2	1	4
E	3	0	5
G	3	1	6
H	4	0	7
I	4	1	8
J	5	0	9
K	5	1	10
L	6	0	11
M	6	1	12
N	7	0	13
O	7	1	14
P	8	0	15
Q	8	1	16
R	9	0	17
S	9	1	18
T	10	0	19
U	10	1	20
V	11	0	21

Example-2: Generating Sequence Numbers Using 4-Node Config File

	4 NODE CONFIG FILE
Sample Field	@INROWNUM	@PARTITIONNUM	@PARTITIONNUM + @NUMPARTITIONS * (@INROWNUM - 1) + 1
A	1	0	1
B	1	1	2
C	1	2	3
D	1	3	4
E	2	0	5
G	2	1	6
H	2	2	7
I	2	3	8
J	3	0	9
K	3	1	10
L	3	2	11
M	3	3	12
N	4	0	13
O	4	1	14
P	4	2	15
Q	4	3	16
R	5	0	17
S	5	1	18
T	5	2	19
U	5	3	20
V	6	0	21

Wednesday, March 2, 2011

New features & changes in IBM InfoSphere Information Server 8.5

Posted by Venkat ♥ Duvvuri 7:09 PM, under DataStage EE | 9 comments

New features and changes were introduced in IBM® InfoSphere™ Information Server, Version 8.5 along with documentation updates. The new and changed features and documentation updates are described in the following sections.

InfoSphere Information Server, Version 8.5, new features and changes:

Suite and product module changes

IBM InfoSphere Business Glossary

IBM InfoSphere DataStage

DataStage Parallel Processing

Posted by Venkat ♥ Duvvuri 8:30 AM, under DataStage EE | No comments

Following figure represents one of the simplest jobs you could have — a data source,
a Transformer (conversion) stage, and the data target. The links between the stages represent the flow of data into or out of a stage. In a parallel job, each stage would normally (but not always) correspond to a process. You can have multiple instances of each process to run on the available processors in your system.

A parallel DataStage job incorporates two basic types of parallel processing — pipeline and partitioning. Both of these methods are used at runtime by the Information Server engine to execute the simple job shown in Figure 1-8. To the DataStage developer, this job would appear the same on your Designer canvas, but you can optimize it through advanced properties.

Pipeline parallelism
In the following example, all stages run concurrently, even in a single-node configuration. As data is read from the Oracle source, it is passed to the Transformer stage for transformation, where it is then passed to the DB2 target. Instead of waiting for all source data to be read, as soon as the source data stream starts to produce rows, these are passed to the subsequent stages. This method is called pipeline parallelism, and all three stages in our example operate simultaneously regardless of the degree of parallelism of the configuration file. The Information Server Engine always executes jobs with pipeline parallelism.

If you ran the example job on a system with multiple processors, the stage reading would start on one processor and start filling a pipeline with the data it had read. The transformer stage would start running as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously.

Partition parallelism
When large volumes of data are involved, you can use the power of parallel processing to your best advantage by partitioning the data into a number of separate sets, with each partition being handled by a separate instance of the job stages. Partition parallelism is accomplished at runtime, instead of a manual process that would be required by traditional systems.

The DataStage developer only needs to specify the algorithm to partition the data, not the degree of parallelism or where the job will execute. Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data. At the end of the job the data partitions can be collected back together again and written to a single data source. This is shown in following figure.

Attention: You do not need multiple processors to run in parallel. A single processor is capable of running multiple concurrent processes.

Partition parallelism [Combining pipeline and partition parallelism]
The Information Server engine combines pipeline and partition parallel processing to achieve even greater performance gains. In this scenario you would have stages processing partitioned data and filling pipelines so the next one could start on that partition before the previous one had finished. This is shown in the following figure.

In some circumstances you might want to actually re-partition your data between stages. This could happen, for example, where you want to group data differently. Suppose that you have initially processed data based on customer last name, but now you want to process on data grouped by zip code. You will have to re-partition to ensure that all customers sharing the same zip code are in the same group. DataStage allows you to re-partition between stages as and when necessary. With the Information Server engine, re-partitioning happens in memory between stages, instead of writing to disk.

Venkat ♥ Duvvuri

Sunday, June 5, 2011