Tuesday, December 21, 2010

Data Transformations

Posted by Venkat ♥ Duvvuri 7:49 AM, under | No comments

Data transformation and movement is the process by which source data is selected, converted, and mapped to the format required by targeted systems. The process manipulates data to bring it into compliance with business, domain, and integrity rules and with other data in the target environment. Transformation can take some of the following forms:

Aggregation
Consolidating or summarizing data values into a single value. Collecting daily sales data to be aggregated to the weekly level is a common example of aggregation.

Basic conversion
Ensuring that data types are correctly converted and mapped from source to target columns.

Cleansing
Resolving inconsistencies and fixing the anomalies in source data.

Derivation
Transforming data from multiple sources by using a complex business rule or algorithm.

Enrichment
Combining data from internal or external sources to provide additional meaning to the data.

Normalizing
Reducing the amount of redundant and potentially duplicated data.
Combining
The process of combining data from multiple sources via parallel Lookup, Join, or Merge operations.

Pivoting
Converting records in an input stream to many records in the appropriate table in the data warehouse or data mart.

Sorting
Grouping related records and sequencing data based on data or string values.

Tuesday, December 7, 2010

DataStage Stages and Jobs

Posted by Venkat ♥ Duvvuri 9:48 PM, under | 1 comment

An IBM InfoSphere DataStage job consists of individual stages linked together which describe the flow of data from a data source to a data target. A stage usually has at least one data input and/or one data output. However, some stages can accept more than one data input, and output to more than one stage. Each stage has a set of predefined and editable properties that tell it how to perform or process data. Properties might include the file name for the Sequential File stage, the columns to sort, the transformations to perform, and the database table name for the DB2 stage. These properties are viewed or edited using stage editors. Stages are added to a job and linked together using the Designer. Figure shows some of the stages and their iconic representations.

Stages and links can be grouped in a shared container. Instances of the shared container can then be reused in different parallel jobs. You can also define a local container within a job — this groups stages and links into a single unit, but can only be used within the job in which it is defined. The different types of jobs have different stage types. The stages that are available in the Designer depend on the type of job that is currently open in the Designer. Parallel Job stages are organized into different groups on the Designer palette as follows:
General includes stages such as Container and Link. Data Quality includes stages such as Investigate, Standardize, Reference Match, and Survive.

Database includes stages such as Classic Federation, DB2 UDB, DB2 UDB/Enterprise, Oracle, Sybase, SQL Server®, Teradata, Distributed Transaction, and ODBC.
Development/Debug includes stages such as Peek, Sample, Head, Tail, and Row Generator.
File includes stages such as Complex Flat File, Data Set, Lookup File Set, and Sequential File.
Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join, Lookup, Merge, Remove Duplicates, Slowly Changing Dimension, Surrogate Key Generator, Sort, and Transformer
Real Time includes stages such as Web Services Transformer, WebSphere MQ, and Web Services Client.
Restructure includes stages such as Column Export and Column Import.