Tuesday, December 21, 2010

Data Transformations

Posted by Venkat ♥ Duvvuri 7:49 AM, under DataStage EE | No comments

Data transformation and movement is the process by which source data is selected, converted, and mapped to the format required by targeted systems. The process manipulates data to bring it into compliance with business, domain, and integrity rules and with other data in the target environment. Transformation can take some of the following forms:

Aggregation
Consolidating or summarizing data values into a single value. Collecting daily sales data to be aggregated to the weekly level is a common example of aggregation.

Basic conversion
Ensuring that data types are correctly converted and mapped from source to target columns.

Cleansing
Resolving inconsistencies and fixing the anomalies in source data.

Derivation
Transforming data from multiple sources by using a complex business rule or algorithm.

Enrichment
Combining data from internal or external sources to provide additional meaning to the data.

Normalizing
Reducing the amount of redundant and potentially duplicated data.
Combining
The process of combining data from multiple sources via parallel Lookup, Join, or Merge operations.

Pivoting
Converting records in an input stream to many records in the appropriate table in the data warehouse or data mart.

Sorting
Grouping related records and sequencing data based on data or string values.

Tuesday, December 7, 2010

DataStage Stages and Jobs

Posted by Venkat ♥ Duvvuri 9:48 PM, under DataStage EE | 1 comment

An IBM InfoSphere DataStage job consists of individual stages linked together which describe the flow of data from a data source to a data target. A stage usually has at least one data input and/or one data output. However, some stages can accept more than one data input, and output to more than one stage. Each stage has a set of predefined and editable properties that tell it how to perform or process data. Properties might include the file name for the Sequential File stage, the columns to sort, the transformations to perform, and the database table name for the DB2 stage. These properties are viewed or edited using stage editors. Stages are added to a job and linked together using the Designer. Figure shows some of the stages and their iconic representations.

Stages and links can be grouped in a shared container. Instances of the shared container can then be reused in different parallel jobs. You can also define a local container within a job — this groups stages and links into a single unit, but can only be used within the job in which it is defined. The different types of jobs have different stage types. The stages that are available in the Designer depend on the type of job that is currently open in the Designer. Parallel Job stages are organized into different groups on the Designer palette as follows:
General includes stages such as Container and Link. Data Quality includes stages such as Investigate, Standardize, Reference Match, and Survive.

Database includes stages such as Classic Federation, DB2 UDB, DB2 UDB/Enterprise, Oracle, Sybase, SQL Server®, Teradata, Distributed Transaction, and ODBC.
Development/Debug includes stages such as Peek, Sample, Head, Tail, and Row Generator.
File includes stages such as Complex Flat File, Data Set, Lookup File Set, and Sequential File.
Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join, Lookup, Merge, Remove Duplicates, Slowly Changing Dimension, Surrogate Key Generator, Sort, and Transformer
Real Time includes stages such as Web Services Transformer, WebSphere MQ, and Web Services Client.
Restructure includes stages such as Column Export and Column Import.

Tuesday, November 23, 2010

DataStage Functions

Posted by Venkat ♥ Duvvuri 8:45 AM, under DataStage EE | No comments

Bottom
In its simplest form, IBM InfoSphere DataStage performs data transformation and movement from source systems to target systems in batch and in real time. The data sources might include indexed files, sequential files, relational databases, archives, external data sources, enterprise applications, and message queues.

DataStage manages data that arrives and data that is received on a periodic or scheduled basis. It enables companies to solve large-scale business problems with high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, DataStage can scale to satisfy the demands of ever-growing data volumes, stringent real-time requirements, and ever-shrinking batch windows.

Leveraging the combined suite of IBM Information Server, DataStage can simplify the development of authoritative master data by showing where and how information is stored across source systems. DataStage can also consolidate disparate data into a single, reliable record, cleanses and standardizes information, removes duplicates, and links records together across systems. This master record can be loaded into operational data stores, data warehouses, or master data applications such as IBM MDM using IBM InfoSphere DataStage.
IBM InfoSphere DataStage delivers four core capabilities:

* Connectivity to a wide range of mainframe, legacy, and enterprise applications, databases, file formats, and external information sources.

* Prebuilt library of more than 300 functions including data validation rules and very complex transformations.

* Maximum throughput using a parallel, high-performance processing architecture.

* Enterprise-class capabilities for development, deployment, maintenance, and high-availability. It leverages metadata for analysis and maintenance. It also operates in batch, real time, or as a Web service.
IBM InfoSphere DataStage enables an integral part of the information integration process.

Functions used in IBM InfoSphere DataStage and QualityStage

The functions that are valid in IBM® InfoSphere® DataStage® and QualityStage are also valid in IBM Information Server
FastTrack, The following is a list of functions that are generally used when defining a column derivation in a Transformer stage.

Date and time functions
Logical functions
Mathematical functions
Null handling functions
Number functions
Raw functions
String functions
Vector function
Type conversion functions
Type casting functions
Utility functions

Top | Next

Date and time functions

Bottom
The following table lists the functions that are available in the Date and Time category (Square brackets indicate an argument is optional):

Name	Description	Arguments	Output
DateFromDaysSince	Returns a date by adding an integer to a baseline date	number (int32) [baseline date]	date
DateFromJulianDay	Returns a date from the given julian date	juliandate (uint32)	date
DaysSinceFromDate	Returns the number of days from source date to the given date	source_date given_date	days since (int32)
HoursFromTime	Returns the hour portion of a time	time	hours (int8)
JulianDayFromDate	Returns julian day from the given date	date	julian date (int32)
MicroSecondsFromTime	Returns the microsecond portion from a time	time	microseconds (int32)
MinutesFromTime	Returns the minute portion from a time	time	minutes (int8)
MonthDayFromDate	Returns the day of the month given the date	date	day (int8)
MonthFromDate	Returns the month number given the date	date	month number (int8)
NextWeekdayFromDate	Returns the date of the specified day of the week soonest after the source date	source date day of week (string)	date
PreviousWeekdayFromDate	Returns the date of the specified day of the week most recent before the source date	source date day of week (string)	date
SecondsFromTime	Returns the second portion from a time	time	seconds (dfloat)
SecondsSinceFromTimestamp	Returns the number of seconds between two timestamps	timestamp base timestamp	seconds (dfloat)
TimeDate	Returns the system time and date as a formatted string	-	system time and date (string)
TimeFromMidnightSeconds	Returns the time given the number of seconds since midnight	seconds (dfloat)	time
TimestampFromDateTime	Returns a timestamp form the given date and time	date time	timestamp
TimestampFromSecondsSince	Returns the timestamp from the number of seconds from the base timestamp	seconds (dfloat) [base timestamp]	timestamp
TimestampFromTimet	Returns a timestamp from the given unix time_t value	timet (int32)	timestamp
TimetFromTimestamp	Returns a unix time_t value from the given timestamp	timestamp	timet (int32)
WeekdayFromDate	Returns the day number of the week from the given date. Origin day optionally specifies the day regarded as the first in the week and is Sunday by default	date [origin day]	day (int8)
YeardayFromDate	Returns the day number in the year from the given date	date	day (int16)
YearFromDate	Returns the year from the given date	date	year (int16)
YearweekFromDate	Returns the week number in the year from the given date	date	week (int16)

Date, Time, and Timestamp functions that specify dates, times, or timestamps in the argument use strings with specific formats: For a date, the format is %yyyy-%mm-%dd For a time, the format is %hh:%nn:%ss, or, if extended to include microseconds, %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. For a timestamp the format is %yyyy-%mm-%dd %hh:%nn:%ss, or, if extended to include microseconds, %yyyy-%mm-%dd %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. This applies to the arguments date, baseline date, given date, time, timestamp, and base timestamp. Functions that have days of week in the argument take a string specifying the day of the week, this applies to day of week and origin day.

Top | Next

Logical Functions

Previous | Bottom
The following table lists the functions available in the Logical category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
Not	Returns the complement of the logical value of an expression	expression	Complement (int8)
BitAnd	Returns the bitwise AND of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitOr	Returns the bitwise OR of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitXOr	Returns the bitwise Exclusive OR of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitExpand	Returns a string containing the binary representation in "1"s and "0"s of the given integer	number (uint64)	string
BitCompress	Returns the integer made from the string argument, which contains a binary representation of "1"s and "0"s.	number (string)	number (uint64)
SetBit	Returns an integer with specific bits set to a specific state, where origfield is the input value to perform the action on, bitlist is a string containing a list of comma separated bit numbers to set the state of, and bitstate is either 1 or 0, indicating which state to set those bits.	origfield (uint64) bitlist (string) bitstate (uint8)	number (uint64)

Top | Next

Mathematical Functions

Previous | Bottom
The following table lists the functions available in the Mathematical category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
Abs	Absolute value of any numeric expression	number (int32)	result (dfloat)
Acos	Calculates the trigonometric arc-cosine of an expression	number (dfloat)	result (dfloat)
Asin	Calculates the trigonometric arc-sine of an expression	number (dfloat)	result (dfloat)
Atan	Calculates the trigonometric arc-tangent of an expression	number (dfloat)	result (dfloat)
Ceil	Calculates the smallest dfloat value greater than or equal to the given decimal value	number (decimal)	result (dfloat)
Cos	Calculates the trigonometric cosine of an expression	number (dfloat)	result (dfloat)
Cosh	Calculates the hyperbolic cosine of an expression	number (dfloat)	result (dfloat)
Div	Outputs the whole part of the real division of two real numbers (dividend, divisor)	dividend (dfloat) divisor (dfloat)	result (dfloat)
Exp	Calculates the result of base 'e' raised to the power designated by the value of the expression	number (dfloat)	result (dfloat)
Fabs	Calculates the absolute value of the given value	number (dfloat)	result (dfloat)
Floor	Calculates the largest dfloat value less than or equal to the given decimal value	number (decimal)	result (dfloat)
Ldexp	Calculates a number from an exponent and mantissa	mantissa (dfloat) exponent (int32)	result (dfloat)
Llabs	Returns the absolute value of the given integer	number (uint64)	result (int64)
Ln	Calculates the natural logarithm of an expression in base 'e'	number (dfloat)	result (dfloat)
Log10	Returns the log to the base 10 of the given value	number (dfloat)	result (dfloat)
Max	Returns the greater of the two argument values	number 1 (int32) number 2(int32)	result (int32)
Min	Returns the lower of the two argument values	number 1 (int32) number 2 (int32)	result (int32)
Mod	Calculates the modulo (the remainder) of two expressions (dividend, divisor)	dividend (int32) divisor (int32)	result (int32)
Neg	Negate a number	number (dfloat)	result (dfloat)
Pwr	Calculates the value of an expression when raised to a specified power (expression, power)	expression (dfloat) power (dfloat)	result (dfloat)
Rand	Return a psuedo random integer between 0 and 2³²-1	-	result (uint32)
Random	Returns a random number between 0 2³²-1	-	result (uint32)
Sin	Calculates the trigonometric sine of an angle	number (dfloat)	result (dfloat)
Sinh	Calculates the hyperbolic sine of an expression	number (dfloat)	result (dfloat)
Sqrt	Calculates the square root of a number	number (dfloat)	result (dfloat)
Tan	Calculates the trigonometric tangent of an angle	number (dfloat)	result (dfloat)
Tanh	Calculates the hyperbolic tangent of an expression	number (dfloat)	result (dfloat)

Top | Next

Null handling functions

Previous | Bottom
The following table lists the functions available in the Null Handling category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
IsNotNull	Returns true when an expression does not evaluate to the null value	any	true/false (int8)
IsNull	Returns true when an expression evaluates to the null value	any	true/false (int8)
MakeNull	Change an in-band null to out of band null	any (column) string (string)	-
NullToEmpty	Returns an empty string if input column is null, otherwise returns the input column value	input column	input column value or empty string
NullToZero	Returns zero if input column is null, otherwise returns the input column value	input column	input column value or zero
NullToValue	Returns specified value if input column is null, otherwise returns the input column value	input column, value	input column value or value
SetNull	Assign a null value to the target column	-	-

Hint: true = 1 false = 0

Top | Next

Number functions

Previous | Bottom
The following table lists the functions available in the Number category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
MantissaFromDecimal	Returns the mantissa from the given decimal	number (decimal)	result (dfloat)
MantissaFromDFloat	Returns the mantissa from the given dfloat	number (dfloat)	result (dfloat)

Top | Next

Raw functions

Previous | Bottom
The following table lists the functions available in the Raw category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
RawLength	Returns the length of a raw string	input string (raw)	Result (int32)

Top | Next

String functions

Previous | Bottom
The following table lists the functions available in the String category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
AlNum	Return whether the given string consists of alphanumeric characters	string (string)	true/false (int8)
Alpha	Returns 1 if string is purely alphabetic	string (string)	result (int8)
CompactWhiteSpace	Return the string after reducing all consective whitespace to a single space	string (string)	result (string)
Compare	Compares two strings for sorting	string1 (string) string2 (string) [justification (L or R)]	result (int8)
ComparNoCase	Case insensitive comparison of two strings	string1 (string) string2 (string)	result (int8)
ComparNum	Compare the first n characters of the two strings	string1 (string) string2 (string) length (int16)	result (int8)
CompareNumNoCase	Caseless comparison of the first n characters of the two strings	string1 (string) string2 (string) length (int16)	result (int8)
Convert	Converts specified characters in a string to designated replacement characters	fromlist (string) tolist (string) expression (string)	result (string)
Count	Count number of times a substring occurs in a string	string (string) substring (string)	result (int32)
Dcount	Count number of delimited fields in a string	string (string) delimiter (string)	result (int32)
DownCase	Change all uppercase letters in a string to lowercase	string (string)	result (string)
DQuote	Enclose a string in double quotation marks	string (string)	result (string)
Field	Return 1 or more delimited substrings	string (string) delimiter (string) occurrence (int32) [number (int32)]	result (string)
Index	Find starting character position of substring	string (string) substring (string) occurrence (int32)	result (int32)
Left	Leftmost n characters of string	string (string) number (int32)	result (string)
Len	Length of string in characters	string (string)	result (int32)
Num	Return 1 if string can be converted to a number	string (string)	result (int8)
PadString	Return the string padded with the optional pad character and optional length	string (string) padlength (int32)	result (string)
Right	Rightmost n characters of string	string (string) number (int32)	result (string)
Soundex	Returns a string which identifies a set of words that are (roughly) phonetically alike based on the standard, open algorithm for SOUNDEX evaluation	string (string)	result (string)
Space	Return a string of N space characters	length (int32)	result (string)
Squote	Enclose a string in single quotation marks	string (string)	result (string)
Str	Repeat a string	string (string) repeats (int32)	result (string)
StripWhiteSpace	Return the string after stripping all whitespace from it	string (string)	result (string)
Trim	Remove all leading and trailing spaces and tabs plus reduce internal occurrences to one	string (string) [stripchar (string)] [options (string)]	result (string)
TrimB	Remove all trailing spaces and tabs	string (string)	result (string)
TrimF	Remove all leading spaces and tabs	string (string)	result (string)
Trim Leading Trailing	Returns a string with leading and trailing whitespace removed	string (string)	result (string)
Upcase	Change all lowercase letters in a string to uppercase	string (string)	result (string)

Hint: true = 1 false = 0
Possible options for the Trim function are:
* L Removes leading occurrences of character.
* T Removes trailing occurrences of character.
* B Removes leading and trailing occurrences of character.
* R Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.
* A Removes all occurrences of character.
* F Removes leading spaces and tabs.
* E Removes trailing spaces and tabs.
* D Removes leading and trailing spaces and tabs, and reduces multiple spaces and tabs to single ones.

Top | Next

Vector function

Previous | Bottom
The following function can be used within expressions to access an element in a vector column. The vector index starts at 0.

Name	Description	Arguments	Output
ElementAt	Accesses an element of a vector	input column index (int)	element of vector

This can be used as part of, or the whole of an expression. For example, an expression to add 1 to the third element of an vector input column 'InLink.col1' would be: ElementAt(InLink.col1, 2) + 1

Top | Next

Type Conversion Functions

Previous | Bottom
The following table lists the functions available in the Type Conversion category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
DateToString	Return the string representation of the given date	date [format (string)]	result (string)
DecimalToDecimal	Returns the given decimal in decimal representation with specified precision and scale	decimal (decimal) [rtype (string)] [packedflag (int8)]	result (decimal)
DecimalToDFloat	Returns the given decimal in dfloat representation	number (decimal) ["fix_zero"]	result (dfloat)
DecimalToString	Return the string representation of the given decimal	number (decimal) ["fix_zero"]	result (string)
DfloatToDecimal	Returns the given dfloat in decimal representation	number (dfloat) [rtype (string)]	result (decimal)
DfloatToStringNoExp	Returns the given dfloat in its string representation with no exponent, using the specified scale	number (dfloat) scale (string)	result (string)
IsValid	Return whether the given string is valid for the given type. Valid types are "date", "decimal", "dfloat", "sfloat", "int8", "uint8", "int16", "uint16", "int32", "uint32", "int64", "uint64", "raw", "string", "time", "timestamp". "ustring"	type (string) format (string)	result (int8)
StringToDate	Returns a date from the given string in the given format	date (string) format (string)	date
StringToDecimal	Returns the given string in decimal representation	string (string) [rtype (string)]	result (decimal)
StringToRaw	Returns a string in raw representation	string (string)	result (raw)
StringToTime	Returns a time representation of the given string	string (string) [format (string)]	time
StringToTimestamp	Returns a timestamp representation of the given string	string (string) [format (string)]	timestamp
TimestampToDate	Returns a date from the given timestamp	timestamp	date
TimestampToString	Return the string representation of the given timestamp	timestamp [format (string)]	result (string)
TimestampToTime	Returns the time from a given timestamp	timestamp	time
TimeToString	Return the string representation of the given time	time [format (string)]	result (string)
StringToUstring	Returns a ustring from the given string, optionally using the specified map (otherwise uses project default)	string (string) [,mapname (string)]	result (ustring)
UstringToString	Returns a string from the given ustring, optionally using the specified map (otherwise uses project default)	string(ustring) [,mapname (string)]	result (string)

Rtype: The rtype argument is a string, and should contain one of the following:
* ceil: Round the source field toward positive infinity. E.g, 1.4 -> 2, -1.6 -> -1.
* floor: Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2.
* round_inf: Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.
* trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1.
The default is trunc_zero.
Format string: Date, Time, and Timestamp functions that take a format string (e.g., timetostring(time, stringformat)) need to have the date format specified. The format strings are described in Date and time formats. Where your dates, times, or timestamps convert to or from ustrings, InfoSphere DataStage will pick this up automatically. In these cases the separators in your format string (for example, `:' or `-') can themselves be Unicode characters.
fix_zero: By default decimal numbers comprising all zeros are treated as invalid. If the string fix_zero is specified as a second argument, then all zero decimal values are regarded as valid.

Top | Next

Type "casting" functions

Previous | Bottom
There is a special class of type conversion function to help you when performing mathematical calculations using numeric fields. For example, if you have a calculation using an output column of type float derived from an input column of type integer in a Parallel Transformer stage the result will be derived as an integer regardless of its float type. If you want a non-integral result for a calculation using integral operands, you can use the following functions (which act in a similar way as casting in C) to cast the integer operands into non-integral operands:

Name	Description	Arguments	Output
AsDouble	Treat the given number as a double	number (number)	number (double)
AsFloat	Treat the given number as a float	number (number)	number (float)
AsInteger	Treat the given number as an integer	number (number)	number (int)

Top

Utility functions

Previous | Bottom
The following table lists the functions available in the Utility category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
GetEnvironment	Returns the value of the given environment variable	environment variable (string)	result (string)
NextSKChain	Returns the value of the surrogate key column for the next record in the chain, or `value` for the newest record	`value` (number)	result (int64)
NextSurrogateKey	Returns the value of the next surrogate key	None	result (int64)
PrevSKChain	Returns the value of the surrogate key column for the previous record in the chain, or `value` for the first record	`value` (number)	result (int64)

Top

Thursday, November 11, 2010

DataStage FAQs and Best Practices

Posted by Venkat ♥ Duvvuri 8:39 PM, under FAQ's and Best Practices | 5 comments

1. General Datastage issues

1.1. What are the ways to execute datastage jobs?

A job can be run using a few different methods:
* from Datastage Director (menu Job -> Run now...)
* from command line using a dsjob command
* Datastage routine can run a job (DsRunJob command)
* by a job sequencer

1.2. How to invoke a Datastage shell command?

Datastage shell commands can be invoked from :
* Datastage administrator (projects tab -> Command)
* Telnet client connected to the datastage server

1.3. How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout
If it still doesn't help go to the datastage shell and invoke the following command: ds.tools.It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

1.4. How to run and schedule a job from command line?

To run a job from command line use a dsjob command
Command Syntax: dsjob [-file | [-server ][-user ][-password ]] []
The command can be placed in a batch file and run in a system scheduler.

1.5. How to release a lock held by jobs?

Go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

1.6. User privileges for the default DataStage roles?

The role privileges are:
* DataStage Developer - user with full access to all areas of a DataStage project
* DataStage Operator - has privileges to run and manage deployed DataStage jobs
* -none- - no permission to log on to DataStage

1.7. What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:
* FILE.STAT command
* ANALYZE.FILE command

1.8. Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage.

1.9. How to send notifications from Datastage as a text message (sms) to a cell phone

There is a few possible methods of sending sms messages from Datastage. However, there is no easy way to do this directly from Datastage and all methods described below will require some effort.
The easiest way of doing that from the Datastage standpoint is to configure an SMTP (email) server as a mobile phone gateway. In that case, a Notification Activity can be used to send message with a job log and any desired details. DSSendMail Before-job or After-job subroutine can be also used to send sms messages.
If configured properly, the recipients email address will have the following format: 600123456@oursmsgateway.com
If there is no possibility of configuring a mail server to send text messages, you can to work it around by using an external application run directly from the operational system. There is a whole bunch of unix scripts and applications to send sms messages.
In that solution, you will need to create a batch script which will take care of sending messages and invoke it from Datastage using ExecDOS or ExecSh subroutines passing the required parameters (like phone number and message body).
Please keep in mind that all these solutions may require a contact to the local cellphone provider first and, depending on the country, it may not be free of charge and in some cases the provider may not support the capability at all.

2. Datastage development and job design

2.1. Error in Link collector - Stage does not support in-process active-to-active inputs or outputs

To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

2.2. What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';
Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'

2.3. what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.
* Informational - is a green line, DSLogInfo() type message.
* The Final Warning Text - the red fatal, the message which is included in the sequence abort message

2.4. Error in STPstage - SOURCE Procedures must have an output link

The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

2.5. How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.
It can be designed in the following way:
* source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
* A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping
* Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

2.6. Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

2.7. Error in STPstage - STDPROC property required for stage xxx

The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs even if the Procedure call syntax is filled in correctly.
To get rid of error fill in the 'Procedure name' field.

2.8. Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT

2.9. Datastage routine which reads the first line from a text file

Note! work dir and file1 are parameters passed to the routine.

* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

READSEQ FILE1.RECORD FROM H.FILE1 ELSE
Call DSLogWarn("******************** File is empty", "JobControl")
END

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1

END

2.10. How to test a datastage routine or transform?

To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results.
Datastage will remember all the test arguments during future tests.

2.11. When hashed files should be used? What are the benefits or using them?

Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

2.12. How to construct a container and deconstruct it or switch between local and shared?

To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

2.13. Corresponding datastage data types to ORACLE types?

Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format.
The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.
There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char.

2.14. How to adjust commit interval when loading data to the database?

In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)
Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction.

2.15. What is the use of INROWNUM and OUTROWNUM datastage variables?

@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:
* @INROWNUM counts incoming rows to a transformer in a datastage job
* @OUTROWNUM counts oucoming rows from a transformer in a datastage job
These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing.
They play similiar role as sequences in Oracle.
If set to 0 the commit will be issued at the end of a successfull transaction.

2.16. Datastage trim function cuts out more characters than expected

By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.
To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")

2.17. Database update actions in ORACLE stage

The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:
* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

2.18. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"

2.19. ERROR 81021 Calling subroutine DSR_RECORD ACTION=2

Error message:

DataStage Repository Interface:
Error calling subroutine: DSR_RECORD (Action=2);
check DataStage is set up correctly in project
Development (Internal Error (81021))

Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.

The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow.

Basically the cause of a problem is a failure between DataStage client and the server communication.

The solution to the issue is:
# Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully.
# execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help

Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

2.20. How to check Datastage internal error descriptions

# To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

2.21. Error timeout waiting for mutex

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers.

There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).

2.22. ERROR 30107 Subroutine failed to complete successfully

Error message:
Error calling subroutine:
DSR_RECORD (Action=2); or *DataStage*DSR_SELECT (Action=7);
check DataStage is set up correctly in project Development
(Subroutine failed to complete successfully(30107))

Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers

The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production).
The solution to the issue is:
# Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell

2.23. Datastage Designer hangs when editing job activity properties

The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.

The solution of the problem is very simple. Just Download and install the “XP SP2 patch” for the Datastage client.
It can be found on the IBM client support site (need to log in):
https://www.ascential.com/eservice/public/welcome.do

Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.
Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at WDISupport@us.ibm.com

2.24. Can Datastage use Excel files as a data input?

# Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here:
Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK
# Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data.

Venkat ♥ Duvvuri

Tuesday, December 21, 2010

Data Transformations

Tuesday, December 7, 2010

DataStage Stages and Jobs

Tuesday, November 23, 2010

DataStage Functions

Functions used in IBM InfoSphere DataStage and QualityStage

Date and time functions

Logical Functions

Mathematical Functions

Null handling functions

Number functions

Raw functions

String functions

Vector function

Type Conversion Functions

Type "casting" functions

Utility functions

Thursday, November 11, 2010

DataStage FAQs and Best Practices

Categories

Labels

Blog Archive

Blog Archive

Popular Posts