Sunday, June 5, 2011

The tsort operator

Posted by Venkat ♥ Duvvuri 7:07 AM, under | 7 comments


The tsort operator

WebSphere DataStage provides the sort operator, tsort that you can use to sort the records of a data set. The tsort operator can run as either a sequential or a parallel operator. The execution mode of the tsort operator determines its action:

  • Sequential mode: The tsort operator executes on a single processing node to sort an entire data set. On completion, the records of the data set are sorted completely.
  • Parallel mode: The tsort operator executes on multiple processing nodes in your system. On completion, the records within each partition of a data set are sorted. This type of sort is called a partition sort.
The following figure shows the difference between these two operating modes:

The tsort operator on the left side of this figure runs sequentially to completely sort the records of a data set. Typically, you use the tsort operator sequentially when you need a total sort of a data set, or when you create a sequential job.

The tsort operator on the right side of this figure executes in parallel and sorts records within each partition. Remember that a parallel operator executes on multiple nodes in a system, where each node receives a partition of a data set. A parallel tsort operator outputs multiple partitions, where the records are sorted within the partition.

Typically, you use a parallel tsort operator as part of a series of operators that requires sorted partitions. For example, you can combine a sort operator with an operator that removes duplicate records from a data set. After the partitions of a data set are sorted, duplicate records in a partition are adjacent.

To perform a parallel sort, you insert a partitioner in front of the tsort operator. This lets you control how the records of a data set are partitioned before the sort. For example, you could hash partition records by a name field, so that all records whose name field begins with the same one-, two-, or three-letter sequence are assigned to the same partition. See "Example: Using a Parallel tsort Operator" for more information.

If you combine a parallel sort with a sort merge collector, you can perform a total sort. A totally sorted data set output is completely ordered, meaning the records in each partition of the output data set are ordered, and the partitions themselves are ordered. See "Performing a Total Sort" for more information.

Example: using a parallel tsort operator

A parallel tsort operator runs on multiple processing nodes in your system to sort the records within each partition of a data set. The default execution mode of the operator is parallel, using the any partitioning method. However, you typically use a partitioning operator with the tsort operator to set an explicit partitioning method.

Choose a partitioning method that is correct for the sorting operation. For example, assume that you are sorting records in a data set based on the last name field of each record. If you randomly allocate records into any partition, records with similar last names are not guaranteed to be in the same partition and are not, therefore, processed by the same node. Similar records can be sorted by an operator only if they are in the same partition of the data set.

A better method of partitioning data in this case is to hash the records by the first five or six characters of the last name. All records containing similar names would be in the same partition and, therefore, would be processed by the same node. The tsort operator could then compare the entire last names, first names, addresses, or other information in the records, to determine the sorting order of the records.

For example:

record ( fname:string[30]; lname:string[30]; )
... | modify -spec "lname_hash:string[6] =   substring[0,6](lname)"
      | hash -key lname_hash | tsort -key lname | ...

WebSphere DataStage supplies a hash partitioner operator that allows you to hash records by one or more fields. See "The hash Partitioner" for more information on the hash operator. You can also use any one of the supplied WebSphere DataStage partitioning methods.

The following example is a modification of the previous example, "Example: Using a Sequential tsort Operator" , to execute the tsort operator in parallel using a hash partitioner operator. In this example, the hash operator partitions records using the integer field a, the primary sorting key. Thus all records containing the same value for field a will be assigned to the same partition. The figure below shows this example:

Shown below is the osh command corresponding to this example:

$ osh " hash -key a -key e < unSortedDS.ds | 
        tsort -key a -key e > sortedDS.ds"

Performing a total sort

The previous section showed an example of a parallel sorter using the hash partitioner to sort the partitions of a data set. When you use a hash partitioner, all records containing the same hash key values are in the same partition and are therefore processed and sorted by the same node.

In contrast to a partition sort, you can also use the psort operator to perform a total sort.

The following figure shows a total sort performed on a data set with two partitions:

In this example, the partitions of the output data set contain sorted records, and the partitions themselves are sorted. As you can see in this example, a total sort requires that all similar and duplicate records are located in the same partition of the data set. Similarity is based on the key fields in a record.

Because each partition of a data set is sorted by a single processing node, another requirement must also be met before a total sort can occur. Not only must similar records be in the same partition, but the partitions themselves should be approximately equal in size so that no one node becomes a processing bottleneck. To meet these two requirements, you use the range partitioner on the input data set before performing the actual sort.

The range partitioner guarantees that all records with the same key fields are in the same partition, but it does more. The range partitioner also calculates partition boundaries, based on the sorting keys, in order to evenly distribute records to the partitions. All records with sorting keys values between the partition boundaries are assigned to the same partition so that the partitions are ordered and that the partitions are approximately the same size. See the next section for more information on using the range partitioning operator.

You need to perform a total sort only when your job requires a completely ordered data set, either as output or as a preprocessing step. For example, you might want to perform a total sort on a mailing list so that the list is sorted by zip code, then by last name within zip code.

For most jobs, a partition sort is the correct sorting component. For example, if you are sorting a data set as a preprocessing step to remove duplicate records, you use a hash partitioner with a partition sort to partition the data set by the sorting key fields; a total sort is unnecessary. A hash partitioner assigns records with the same sorting key fields to the same partition, then sorts the partition. A second operator can compare adjacent records in a partition to determine if they have the same key fields to remove any duplicates.



Hello Sandy,

Thanks a lot for your feedback. It's my pleasure to inform you that if required you can refer the same article on your blog.../

Thanks n Regards
Venkat Duvvuri

Thanks for providing the information on  DataStage Online training. Online training have the benefits of being convenient, flexible and on your own time.

Thanks for your wonderful information which helped us to join Datastage online training

Thank you very much for your good information.
As I noted in the ... given information is very use full to every student who ever want to learn about
Excellent DataStage Online Training

I am robert..Good work buddy... keep continuing... it shows yours commitment towards datastage field...

Post a Comment