Tuesday, December 21, 2010

Data Transformations

Posted by Venkat ♥ Duvvuri 7:49 AM, under DataStage EE | No comments

Data transformation and movement is the process by which source data is selected, converted, and mapped to the format required by targeted systems. The process manipulates data to bring it into compliance with business, domain, and integrity rules and with other data in the target environment. Transformation can take some of the following forms:

Aggregation
Consolidating or summarizing data values into a single value. Collecting daily sales data to be aggregated to the weekly level is a common example of aggregation.

Basic conversion
Ensuring that data types are correctly converted and mapped from source to target columns.

Cleansing
Resolving inconsistencies and fixing the anomalies in source data.

Derivation
Transforming data from multiple sources by using a complex business rule or algorithm.

Enrichment
Combining data from internal or external sources to provide additional meaning to the data.

Normalizing
Reducing the amount of redundant and potentially duplicated data.
Combining
The process of combining data from multiple sources via parallel Lookup, Join, or Merge operations.

Pivoting
Converting records in an input stream to many records in the appropriate table in the data warehouse or data mart.

Sorting
Grouping related records and sequencing data based on data or string values.

Tuesday, December 7, 2010

DataStage Stages and Jobs

Posted by Venkat ♥ Duvvuri 9:48 PM, under DataStage EE | 1 comment

An IBM InfoSphere DataStage job consists of individual stages linked together which describe the flow of data from a data source to a data target. A stage usually has at least one data input and/or one data output. However, some stages can accept more than one data input, and output to more than one stage. Each stage has a set of predefined and editable properties that tell it how to perform or process data. Properties might include the file name for the Sequential File stage, the columns to sort, the transformations to perform, and the database table name for the DB2 stage. These properties are viewed or edited using stage editors. Stages are added to a job and linked together using the Designer. Figure shows some of the stages and their iconic representations.

Stages and links can be grouped in a shared container. Instances of the shared container can then be reused in different parallel jobs. You can also define a local container within a job — this groups stages and links into a single unit, but can only be used within the job in which it is defined. The different types of jobs have different stage types. The stages that are available in the Designer depend on the type of job that is currently open in the Designer. Parallel Job stages are organized into different groups on the Designer palette as follows:
General includes stages such as Container and Link. Data Quality includes stages such as Investigate, Standardize, Reference Match, and Survive.

Database includes stages such as Classic Federation, DB2 UDB, DB2 UDB/Enterprise, Oracle, Sybase, SQL Server®, Teradata, Distributed Transaction, and ODBC.
Development/Debug includes stages such as Peek, Sample, Head, Tail, and Row Generator.
File includes stages such as Complex Flat File, Data Set, Lookup File Set, and Sequential File.
Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join, Lookup, Merge, Remove Duplicates, Slowly Changing Dimension, Surrogate Key Generator, Sort, and Transformer
Real Time includes stages such as Web Services Transformer, WebSphere MQ, and Web Services Client.
Restructure includes stages such as Column Export and Column Import.

Tuesday, November 23, 2010

DataStage Functions

Posted by Venkat ♥ Duvvuri 8:45 AM, under DataStage EE | No comments

Bottom
In its simplest form, IBM InfoSphere DataStage performs data transformation and movement from source systems to target systems in batch and in real time. The data sources might include indexed files, sequential files, relational databases, archives, external data sources, enterprise applications, and message queues.

DataStage manages data that arrives and data that is received on a periodic or scheduled basis. It enables companies to solve large-scale business problems with high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, DataStage can scale to satisfy the demands of ever-growing data volumes, stringent real-time requirements, and ever-shrinking batch windows.

Leveraging the combined suite of IBM Information Server, DataStage can simplify the development of authoritative master data by showing where and how information is stored across source systems. DataStage can also consolidate disparate data into a single, reliable record, cleanses and standardizes information, removes duplicates, and links records together across systems. This master record can be loaded into operational data stores, data warehouses, or master data applications such as IBM MDM using IBM InfoSphere DataStage.
IBM InfoSphere DataStage delivers four core capabilities:

* Connectivity to a wide range of mainframe, legacy, and enterprise applications, databases, file formats, and external information sources.

* Prebuilt library of more than 300 functions including data validation rules and very complex transformations.

* Maximum throughput using a parallel, high-performance processing architecture.

* Enterprise-class capabilities for development, deployment, maintenance, and high-availability. It leverages metadata for analysis and maintenance. It also operates in batch, real time, or as a Web service.
IBM InfoSphere DataStage enables an integral part of the information integration process.

Functions used in IBM InfoSphere DataStage and QualityStage

The functions that are valid in IBM® InfoSphere® DataStage® and QualityStage are also valid in IBM Information Server
FastTrack, The following is a list of functions that are generally used when defining a column derivation in a Transformer stage.

Date and time functions
Logical functions
Mathematical functions
Null handling functions
Number functions
Raw functions
String functions
Vector function
Type conversion functions
Type casting functions
Utility functions

Top | Next

Date and time functions

Bottom
The following table lists the functions that are available in the Date and Time category (Square brackets indicate an argument is optional):

Name	Description	Arguments	Output
DateFromDaysSince	Returns a date by adding an integer to a baseline date	number (int32) [baseline date]	date
DateFromJulianDay	Returns a date from the given julian date	juliandate (uint32)	date
DaysSinceFromDate	Returns the number of days from source date to the given date	source_date given_date	days since (int32)
HoursFromTime	Returns the hour portion of a time	time	hours (int8)
JulianDayFromDate	Returns julian day from the given date	date	julian date (int32)
MicroSecondsFromTime	Returns the microsecond portion from a time	time	microseconds (int32)
MinutesFromTime	Returns the minute portion from a time	time	minutes (int8)
MonthDayFromDate	Returns the day of the month given the date	date	day (int8)
MonthFromDate	Returns the month number given the date	date	month number (int8)
NextWeekdayFromDate	Returns the date of the specified day of the week soonest after the source date	source date day of week (string)	date
PreviousWeekdayFromDate	Returns the date of the specified day of the week most recent before the source date	source date day of week (string)	date
SecondsFromTime	Returns the second portion from a time	time	seconds (dfloat)
SecondsSinceFromTimestamp	Returns the number of seconds between two timestamps	timestamp base timestamp	seconds (dfloat)
TimeDate	Returns the system time and date as a formatted string	-	system time and date (string)
TimeFromMidnightSeconds	Returns the time given the number of seconds since midnight	seconds (dfloat)	time
TimestampFromDateTime	Returns a timestamp form the given date and time	date time	timestamp
TimestampFromSecondsSince	Returns the timestamp from the number of seconds from the base timestamp	seconds (dfloat) [base timestamp]	timestamp
TimestampFromTimet	Returns a timestamp from the given unix time_t value	timet (int32)	timestamp
TimetFromTimestamp	Returns a unix time_t value from the given timestamp	timestamp	timet (int32)
WeekdayFromDate	Returns the day number of the week from the given date. Origin day optionally specifies the day regarded as the first in the week and is Sunday by default	date [origin day]	day (int8)
YeardayFromDate	Returns the day number in the year from the given date	date	day (int16)
YearFromDate	Returns the year from the given date	date	year (int16)
YearweekFromDate	Returns the week number in the year from the given date	date	week (int16)

Date, Time, and Timestamp functions that specify dates, times, or timestamps in the argument use strings with specific formats: For a date, the format is %yyyy-%mm-%dd For a time, the format is %hh:%nn:%ss, or, if extended to include microseconds, %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. For a timestamp the format is %yyyy-%mm-%dd %hh:%nn:%ss, or, if extended to include microseconds, %yyyy-%mm-%dd %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. This applies to the arguments date, baseline date, given date, time, timestamp, and base timestamp. Functions that have days of week in the argument take a string specifying the day of the week, this applies to day of week and origin day.

Top | Next

Logical Functions

Previous | Bottom
The following table lists the functions available in the Logical category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
Not	Returns the complement of the logical value of an expression	expression	Complement (int8)
BitAnd	Returns the bitwise AND of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitOr	Returns the bitwise OR of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitXOr	Returns the bitwise Exclusive OR of the two integer arguments	number 1 (uint64) number 2 (uint64)	number (uint64)
BitExpand	Returns a string containing the binary representation in "1"s and "0"s of the given integer	number (uint64)	string
BitCompress	Returns the integer made from the string argument, which contains a binary representation of "1"s and "0"s.	number (string)	number (uint64)
SetBit	Returns an integer with specific bits set to a specific state, where origfield is the input value to perform the action on, bitlist is a string containing a list of comma separated bit numbers to set the state of, and bitstate is either 1 or 0, indicating which state to set those bits.	origfield (uint64) bitlist (string) bitstate (uint8)	number (uint64)

Top | Next

Mathematical Functions

Previous | Bottom
The following table lists the functions available in the Mathematical category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
Abs	Absolute value of any numeric expression	number (int32)	result (dfloat)
Acos	Calculates the trigonometric arc-cosine of an expression	number (dfloat)	result (dfloat)
Asin	Calculates the trigonometric arc-sine of an expression	number (dfloat)	result (dfloat)
Atan	Calculates the trigonometric arc-tangent of an expression	number (dfloat)	result (dfloat)
Ceil	Calculates the smallest dfloat value greater than or equal to the given decimal value	number (decimal)	result (dfloat)
Cos	Calculates the trigonometric cosine of an expression	number (dfloat)	result (dfloat)
Cosh	Calculates the hyperbolic cosine of an expression	number (dfloat)	result (dfloat)
Div	Outputs the whole part of the real division of two real numbers (dividend, divisor)	dividend (dfloat) divisor (dfloat)	result (dfloat)
Exp	Calculates the result of base 'e' raised to the power designated by the value of the expression	number (dfloat)	result (dfloat)
Fabs	Calculates the absolute value of the given value	number (dfloat)	result (dfloat)
Floor	Calculates the largest dfloat value less than or equal to the given decimal value	number (decimal)	result (dfloat)
Ldexp	Calculates a number from an exponent and mantissa	mantissa (dfloat) exponent (int32)	result (dfloat)
Llabs	Returns the absolute value of the given integer	number (uint64)	result (int64)
Ln	Calculates the natural logarithm of an expression in base 'e'	number (dfloat)	result (dfloat)
Log10	Returns the log to the base 10 of the given value	number (dfloat)	result (dfloat)
Max	Returns the greater of the two argument values	number 1 (int32) number 2(int32)	result (int32)
Min	Returns the lower of the two argument values	number 1 (int32) number 2 (int32)	result (int32)
Mod	Calculates the modulo (the remainder) of two expressions (dividend, divisor)	dividend (int32) divisor (int32)	result (int32)
Neg	Negate a number	number (dfloat)	result (dfloat)
Pwr	Calculates the value of an expression when raised to a specified power (expression, power)	expression (dfloat) power (dfloat)	result (dfloat)
Rand	Return a psuedo random integer between 0 and 2³²-1	-	result (uint32)
Random	Returns a random number between 0 2³²-1	-	result (uint32)
Sin	Calculates the trigonometric sine of an angle	number (dfloat)	result (dfloat)
Sinh	Calculates the hyperbolic sine of an expression	number (dfloat)	result (dfloat)
Sqrt	Calculates the square root of a number	number (dfloat)	result (dfloat)
Tan	Calculates the trigonometric tangent of an angle	number (dfloat)	result (dfloat)
Tanh	Calculates the hyperbolic tangent of an expression	number (dfloat)	result (dfloat)

Top | Next

Null handling functions

Previous | Bottom
The following table lists the functions available in the Null Handling category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
IsNotNull	Returns true when an expression does not evaluate to the null value	any	true/false (int8)
IsNull	Returns true when an expression evaluates to the null value	any	true/false (int8)
MakeNull	Change an in-band null to out of band null	any (column) string (string)	-
NullToEmpty	Returns an empty string if input column is null, otherwise returns the input column value	input column	input column value or empty string
NullToZero	Returns zero if input column is null, otherwise returns the input column value	input column	input column value or zero
NullToValue	Returns specified value if input column is null, otherwise returns the input column value	input column, value	input column value or value
SetNull	Assign a null value to the target column	-	-

Hint: true = 1 false = 0

Top | Next

Number functions

Previous | Bottom
The following table lists the functions available in the Number category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
MantissaFromDecimal	Returns the mantissa from the given decimal	number (decimal)	result (dfloat)
MantissaFromDFloat	Returns the mantissa from the given dfloat	number (dfloat)	result (dfloat)

Top | Next

Raw functions

Previous | Bottom
The following table lists the functions available in the Raw category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
RawLength	Returns the length of a raw string	input string (raw)	Result (int32)

Top | Next

String functions

Previous | Bottom
The following table lists the functions available in the String category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
AlNum	Return whether the given string consists of alphanumeric characters	string (string)	true/false (int8)
Alpha	Returns 1 if string is purely alphabetic	string (string)	result (int8)
CompactWhiteSpace	Return the string after reducing all consective whitespace to a single space	string (string)	result (string)
Compare	Compares two strings for sorting	string1 (string) string2 (string) [justification (L or R)]	result (int8)
ComparNoCase	Case insensitive comparison of two strings	string1 (string) string2 (string)	result (int8)
ComparNum	Compare the first n characters of the two strings	string1 (string) string2 (string) length (int16)	result (int8)
CompareNumNoCase	Caseless comparison of the first n characters of the two strings	string1 (string) string2 (string) length (int16)	result (int8)
Convert	Converts specified characters in a string to designated replacement characters	fromlist (string) tolist (string) expression (string)	result (string)
Count	Count number of times a substring occurs in a string	string (string) substring (string)	result (int32)
Dcount	Count number of delimited fields in a string	string (string) delimiter (string)	result (int32)
DownCase	Change all uppercase letters in a string to lowercase	string (string)	result (string)
DQuote	Enclose a string in double quotation marks	string (string)	result (string)
Field	Return 1 or more delimited substrings	string (string) delimiter (string) occurrence (int32) [number (int32)]	result (string)
Index	Find starting character position of substring	string (string) substring (string) occurrence (int32)	result (int32)
Left	Leftmost n characters of string	string (string) number (int32)	result (string)
Len	Length of string in characters	string (string)	result (int32)
Num	Return 1 if string can be converted to a number	string (string)	result (int8)
PadString	Return the string padded with the optional pad character and optional length	string (string) padlength (int32)	result (string)
Right	Rightmost n characters of string	string (string) number (int32)	result (string)
Soundex	Returns a string which identifies a set of words that are (roughly) phonetically alike based on the standard, open algorithm for SOUNDEX evaluation	string (string)	result (string)
Space	Return a string of N space characters	length (int32)	result (string)
Squote	Enclose a string in single quotation marks	string (string)	result (string)
Str	Repeat a string	string (string) repeats (int32)	result (string)
StripWhiteSpace	Return the string after stripping all whitespace from it	string (string)	result (string)
Trim	Remove all leading and trailing spaces and tabs plus reduce internal occurrences to one	string (string) [stripchar (string)] [options (string)]	result (string)
TrimB	Remove all trailing spaces and tabs	string (string)	result (string)
TrimF	Remove all leading spaces and tabs	string (string)	result (string)
Trim Leading Trailing	Returns a string with leading and trailing whitespace removed	string (string)	result (string)
Upcase	Change all lowercase letters in a string to uppercase	string (string)	result (string)

Hint: true = 1 false = 0
Possible options for the Trim function are:
* L Removes leading occurrences of character.
* T Removes trailing occurrences of character.
* B Removes leading and trailing occurrences of character.
* R Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.
* A Removes all occurrences of character.
* F Removes leading spaces and tabs.
* E Removes trailing spaces and tabs.
* D Removes leading and trailing spaces and tabs, and reduces multiple spaces and tabs to single ones.

Top | Next

Vector function

Previous | Bottom
The following function can be used within expressions to access an element in a vector column. The vector index starts at 0.

Name	Description	Arguments	Output
ElementAt	Accesses an element of a vector	input column index (int)	element of vector

This can be used as part of, or the whole of an expression. For example, an expression to add 1 to the third element of an vector input column 'InLink.col1' would be: ElementAt(InLink.col1, 2) + 1

Top | Next

Type Conversion Functions

Previous | Bottom
The following table lists the functions available in the Type Conversion category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
DateToString	Return the string representation of the given date	date [format (string)]	result (string)
DecimalToDecimal	Returns the given decimal in decimal representation with specified precision and scale	decimal (decimal) [rtype (string)] [packedflag (int8)]	result (decimal)
DecimalToDFloat	Returns the given decimal in dfloat representation	number (decimal) ["fix_zero"]	result (dfloat)
DecimalToString	Return the string representation of the given decimal	number (decimal) ["fix_zero"]	result (string)
DfloatToDecimal	Returns the given dfloat in decimal representation	number (dfloat) [rtype (string)]	result (decimal)
DfloatToStringNoExp	Returns the given dfloat in its string representation with no exponent, using the specified scale	number (dfloat) scale (string)	result (string)
IsValid	Return whether the given string is valid for the given type. Valid types are "date", "decimal", "dfloat", "sfloat", "int8", "uint8", "int16", "uint16", "int32", "uint32", "int64", "uint64", "raw", "string", "time", "timestamp". "ustring"	type (string) format (string)	result (int8)
StringToDate	Returns a date from the given string in the given format	date (string) format (string)	date
StringToDecimal	Returns the given string in decimal representation	string (string) [rtype (string)]	result (decimal)
StringToRaw	Returns a string in raw representation	string (string)	result (raw)
StringToTime	Returns a time representation of the given string	string (string) [format (string)]	time
StringToTimestamp	Returns a timestamp representation of the given string	string (string) [format (string)]	timestamp
TimestampToDate	Returns a date from the given timestamp	timestamp	date
TimestampToString	Return the string representation of the given timestamp	timestamp [format (string)]	result (string)
TimestampToTime	Returns the time from a given timestamp	timestamp	time
TimeToString	Return the string representation of the given time	time [format (string)]	result (string)
StringToUstring	Returns a ustring from the given string, optionally using the specified map (otherwise uses project default)	string (string) [,mapname (string)]	result (ustring)
UstringToString	Returns a string from the given ustring, optionally using the specified map (otherwise uses project default)	string(ustring) [,mapname (string)]	result (string)

Rtype: The rtype argument is a string, and should contain one of the following:
* ceil: Round the source field toward positive infinity. E.g, 1.4 -> 2, -1.6 -> -1.
* floor: Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2.
* round_inf: Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.
* trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1.
The default is trunc_zero.
Format string: Date, Time, and Timestamp functions that take a format string (e.g., timetostring(time, stringformat)) need to have the date format specified. The format strings are described in Date and time formats. Where your dates, times, or timestamps convert to or from ustrings, InfoSphere DataStage will pick this up automatically. In these cases the separators in your format string (for example, `:' or `-') can themselves be Unicode characters.
fix_zero: By default decimal numbers comprising all zeros are treated as invalid. If the string fix_zero is specified as a second argument, then all zero decimal values are regarded as valid.

Top | Next

Type "casting" functions

Previous | Bottom
There is a special class of type conversion function to help you when performing mathematical calculations using numeric fields. For example, if you have a calculation using an output column of type float derived from an input column of type integer in a Parallel Transformer stage the result will be derived as an integer regardless of its float type. If you want a non-integral result for a calculation using integral operands, you can use the following functions (which act in a similar way as casting in C) to cast the integer operands into non-integral operands:

Name	Description	Arguments	Output
AsDouble	Treat the given number as a double	number (number)	number (double)
AsFloat	Treat the given number as a float	number (number)	number (float)
AsInteger	Treat the given number as an integer	number (number)	number (int)

Top

Utility functions

Previous | Bottom
The following table lists the functions available in the Utility category (square brackets indicate an argument is optional):

Name	Description	Arguments	Output
GetEnvironment	Returns the value of the given environment variable	environment variable (string)	result (string)
NextSKChain	Returns the value of the surrogate key column for the next record in the chain, or `value` for the newest record	`value` (number)	result (int64)
NextSurrogateKey	Returns the value of the next surrogate key	None	result (int64)
PrevSKChain	Returns the value of the surrogate key column for the previous record in the chain, or `value` for the first record	`value` (number)	result (int64)

Top

Thursday, November 11, 2010

DataStage FAQs and Best Practices

Posted by Venkat ♥ Duvvuri 8:39 PM, under FAQ's and Best Practices | 5 comments

1. General Datastage issues

1.1. What are the ways to execute datastage jobs?

A job can be run using a few different methods:
* from Datastage Director (menu Job -> Run now...)
* from command line using a dsjob command
* Datastage routine can run a job (DsRunJob command)
* by a job sequencer

1.2. How to invoke a Datastage shell command?

Datastage shell commands can be invoked from :
* Datastage administrator (projects tab -> Command)
* Telnet client connected to the datastage server

1.3. How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout
If it still doesn't help go to the datastage shell and invoke the following command: ds.tools.It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

1.4. How to run and schedule a job from command line?

To run a job from command line use a dsjob command
Command Syntax: dsjob [-file | [-server ][-user ][-password ]] []
The command can be placed in a batch file and run in a system scheduler.

1.5. How to release a lock held by jobs?

Go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

1.6. User privileges for the default DataStage roles?

The role privileges are:
* DataStage Developer - user with full access to all areas of a DataStage project
* DataStage Operator - has privileges to run and manage deployed DataStage jobs
* -none- - no permission to log on to DataStage

1.7. What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:
* FILE.STAT command
* ANALYZE.FILE command

1.8. Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage.

1.9. How to send notifications from Datastage as a text message (sms) to a cell phone

There is a few possible methods of sending sms messages from Datastage. However, there is no easy way to do this directly from Datastage and all methods described below will require some effort.
The easiest way of doing that from the Datastage standpoint is to configure an SMTP (email) server as a mobile phone gateway. In that case, a Notification Activity can be used to send message with a job log and any desired details. DSSendMail Before-job or After-job subroutine can be also used to send sms messages.
If configured properly, the recipients email address will have the following format: 600123456@oursmsgateway.com
If there is no possibility of configuring a mail server to send text messages, you can to work it around by using an external application run directly from the operational system. There is a whole bunch of unix scripts and applications to send sms messages.
In that solution, you will need to create a batch script which will take care of sending messages and invoke it from Datastage using ExecDOS or ExecSh subroutines passing the required parameters (like phone number and message body).
Please keep in mind that all these solutions may require a contact to the local cellphone provider first and, depending on the country, it may not be free of charge and in some cases the provider may not support the capability at all.

2. Datastage development and job design

2.1. Error in Link collector - Stage does not support in-process active-to-active inputs or outputs

To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

2.2. What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';
Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'

2.3. what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.
* Informational - is a green line, DSLogInfo() type message.
* The Final Warning Text - the red fatal, the message which is included in the sequence abort message

2.4. Error in STPstage - SOURCE Procedures must have an output link

The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

2.5. How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.
It can be designed in the following way:
* source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
* A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping
* Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

2.6. Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

2.7. Error in STPstage - STDPROC property required for stage xxx

The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs even if the Procedure call syntax is filled in correctly.
To get rid of error fill in the 'Procedure name' field.

2.8. Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT

2.9. Datastage routine which reads the first line from a text file

Note! work dir and file1 are parameters passed to the routine.

* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

READSEQ FILE1.RECORD FROM H.FILE1 ELSE
Call DSLogWarn("******************** File is empty", "JobControl")
END

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1

END

2.10. How to test a datastage routine or transform?

To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results.
Datastage will remember all the test arguments during future tests.

2.11. When hashed files should be used? What are the benefits or using them?

Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

2.12. How to construct a container and deconstruct it or switch between local and shared?

To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

2.13. Corresponding datastage data types to ORACLE types?

Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format.
The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.
There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char.

2.14. How to adjust commit interval when loading data to the database?

In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)
Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction.

2.15. What is the use of INROWNUM and OUTROWNUM datastage variables?

@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:
* @INROWNUM counts incoming rows to a transformer in a datastage job
* @OUTROWNUM counts oucoming rows from a transformer in a datastage job
These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing.
They play similiar role as sequences in Oracle.
If set to 0 the commit will be issued at the end of a successfull transaction.

2.16. Datastage trim function cuts out more characters than expected

By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.
To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")

2.17. Database update actions in ORACLE stage

The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:
* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

2.18. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"

2.19. ERROR 81021 Calling subroutine DSR_RECORD ACTION=2

Error message:

DataStage Repository Interface:
Error calling subroutine: DSR_RECORD (Action=2);
check DataStage is set up correctly in project
Development (Internal Error (81021))

Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.

The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow.

Basically the cause of a problem is a failure between DataStage client and the server communication.

The solution to the issue is:
# Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully.
# execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help

Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

2.20. How to check Datastage internal error descriptions

# To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

2.21. Error timeout waiting for mutex

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers.

There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).

2.22. ERROR 30107 Subroutine failed to complete successfully

Error message:
Error calling subroutine:
DSR_RECORD (Action=2); or *DataStage*DSR_SELECT (Action=7);
check DataStage is set up correctly in project Development
(Subroutine failed to complete successfully(30107))

Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers

The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production).
The solution to the issue is:
# Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell

2.23. Datastage Designer hangs when editing job activity properties

The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.

The solution of the problem is very simple. Just Download and install the “XP SP2 patch” for the Datastage client.
It can be found on the IBM client support site (need to log in):
https://www.ascential.com/eservice/public/welcome.do

Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.
Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at WDISupport@us.ibm.com

2.24. Can Datastage use Excel files as a data input?

# Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here:
Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK
# Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data.

Wednesday, October 27, 2010

Interview Questions

Posted by Venkat ♥ Duvvuri 8:38 AM, under FAQ's and Best Practices | 8 comments

1. How to remove duplicate rows?
We have various ways to do the same.
ServerJobs
-Using Hash File.
-Using Stage variable
Detailed description as below. we have to define two stage variable for example stgV1 and stgV2.
Derivation...............Stage Variable
-----------------------------------------
stgV1......................stgV2
IputKeyColm(Id).....stgV1

We have to write a constraint like as below
if stgV1<>stgV2 ------------->this stream is to produce unique records
if stgV1=stgV2 ------------->this stream is to produce duplicate records
hope I am clear with the same....
Parallel Jobs
-Using Stage Variables
-Using Remove Duplicates Stage
-Using Perform sort option [specifically need to select unique option] under PartitioningTab
-Using SortStage

The Duplicates can be eliminated by loading the corresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.
You can delete duplicate records from source itself using data stage taking option as user defined query, instead of taking table read option. and u can use remove duplicate stage in data stage. and using of hash file as source also based on the hash key.

2. What are the Repository Tables in Data Stage and what are they?
A data warehouse is a repository (centralized as well as distributed) of Data, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In data stage I/O and Transfer, under interface tab: input, out put & transfer pages. U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The Data Stage client components are: Administrator Administers Data Stage projects and conducts housekeeping on the server Designer Creates Data Stage jobs that are compiled into executable programs Director Used to run and monitor the Data Stage jobs Manager Allows you to view and edit the contents of the repository.

3. Differentiate Database data and Data warehouse data?
By Database, one means OLTP (On Line Transaction Processing). This can be the source systems or the ODS (Operational Data Store), which contains the transactional data.
Database data only for transactional purpose & Data warehouse data only for analytical purpose for decision making.
Data base contains detail data. Data warehouse contains summary data

4.If a Data Stage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?
By specifying Check pointing in job sequence properties, if we restart the job. Then job will start by skipping up to the failed record. This option is available in 7.5 edition.

5. What is merge and how it can be done please explain with simple example taking 2 tables.......?
Merge is used to join two tables. It takes the Key columns sort them in Ascending or descending order. Let us consider two table i.e Emp, Dept. If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables

6. It is possible to call one job in another job in server jobs?
I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs.
Steps will be Edit --> Job Properties --> Job Control
Click on Add Job and select the desired job.

7. What are the environment variables in data stage? Give some examples?
There are the variables used at the project or job level. We can use them to configure the job i.e. We can associate the configuration file(Without this u can not run your job), increase the sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see most of the environment variables.

8. What happens if RCP is disable ?
Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stages whose output connects to the shared container input, then Meta data will be propagated at run time, so there is no need to map it at design time.
If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

9. How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Data stage as the ETL tool? I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data?
You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out

10. What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made?
This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performance.

11. What does a config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

The APT Configuration file is having the information of resource disk, node pool, and scratch information, node information in the since it contains the how many nodes we given to run the jobs, because based on the nodes only data stage will create processors at back end while running the jobs, resource disk means this is the place where exactly jobs will be loading, scratch information will be useful whenever we using the lookups in the jobs
APT Configuration file is the configuration file which defines the nodes for the specific project and actual disk storage.

12. Can we use shared container as lookup in data stage server jobs?
I am using Data Stage 7.5, UNIX. We can use shared container more than one time in the job. There is any limit to use it. why because in my job i used the Shared container at 6 flows. At any time only 2 flows are working. can you please share the info on this.
No. We cant use container as lookup.
Container is used to reduce the complexity view to simple view of stages or jobs, so there will be only jobs available inside and not source table or anything.

13. How to find the number of rows in a sequential file?
Using Row Count System variable
Use the aggregator stage to use the count record facility

14. What is the difference between validated ok and compiled in data stage?
When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made:
- Connections are made to the data sources or data warehouse.
- SQL SELECT statements are prepared.
- Files are opened. Intermediate files in Hashed File, Universe, or ODBC stages that use the local data source are created, if they do not already exist.

15. What are the differences between the data stage 7.0 and 7.5 in server jobs?
There is lot of Differences: There are lots of new stages are available in DS7.5 For
Eg: CDC Stage Stored procedure Stage etc…

16. Is it possible to run parallel jobs in server jobs?
No, it is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs

17. What is DS Administrator used for - did u use it?
The Administrator enables you to set up Data Stage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.

18. What is Modulus and Splitting in Dynamic Hashed File?
The modulus size can be increased by contacting your Unix Admin.
In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "Splitting".
The modulus size can be increase/decrease by contacting your Data Stage Admin

19. How can ETL access excel file to Data mart?
Open the ODBC Data Source Administrator found in the control panel/administrative tools. Under the system DSN tab, add the Driver to Microsoft Excel. Then you will be able to access the XLS file from Data stage.

20.What is the exact difference between Join, Merge and Lookup Stage?
The exact difference between Join, Merge and lookup is The three stages differ mainly in the memory they use Data Stage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use:
if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as many as input links.
The three stages differs with each other with respect to
1. Input Requirements
2. Treatment of unmatched records
3. Memory Usage
Join requires less memory usage
Lookup requires more memory usage and
Merge requires less memory usage

21. What about System variables?
Data Stage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.
@DATE The internal date when the program started. See the Date function.
@DAY The day of the month extracted from the value in @DATE.
@FALSE The compiler replaces the value with 0.
@FM A field mark, Char(254).
@IM An item mark, Char(255).
@INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.
@OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.
@LOGNAME The user login name.
@MONTH The current extracted from the value in @DATE.
@NULL The null value.
@NULL.STR The internal representation of the null value, Char(128).
@PATH The pathname of the current Data Stage project.
@SCHEMA The schema name of the current Data Stage project.
@SM A sub value mark (a delimiter used in Universe files), Char(252).
@SYSTEM.RETURN.CODE Status codes returned by system processes or commands.
@TIME The internal time when the program started. See the Time function.
@TM A text mark (a delimiter used in Universe files), Char(251).
@TRUE The compiler replaces the value with 1.
@USERNO The user number.
@VM A value mark (a delimiter used in Universe files), Char(253).
@WHO The name of the current Data Stage project directory.
@YEAR The current year extracted from @DATE.
REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

22. Importance of Surrogate Key in Data warehousing?
The concept of surrogate comes into play when there is slowly changing dimension in a table. In such condition there is a need of a key by which we can identify the changes made in the dimensions. These slowly changing dimensions can be of three type namely SCD1, SCD2, SCD3.These are system generated key. Mainly they are just the sequence of numbers or can be alphanumeric values also.

23. How to handle the rejected rows in data stage?
We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constraints in the properties of the Transformer2)Use REJECTED in the expression editor of the Constraint Create a hash file as a temporary storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either of the two steps above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File.

24. How can you do incremental load in data stage?
Incremental load means daily load. When ever you are selecting data from source, select the records which are loaded or updated between the timestamp of last successful load and today’s load start date and time. For this u have to pass parameters for those two dates. Store the last run date and time in a file and read the parameter through job parameters and state second argument as current date and time.

25. How do we do the automation of data stage jobs?
We can call Data stage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.

26. Can anyone tell me how to extract data from more than 1 heterogeneous Sources? Mean, example 1 sequential file, Sybase, Oracle in a single Job?
Yes you can extract the data from two heterogeneous sources in data stages using the transformer stage it's so simple you need to just form a link between the two sources in the transformer stage that's it Bye Hammed

27. What is merge? And how to use merge?
Merge is a stage that is available in both parallel and server jobs.
The merge stage is used to join two tables (server/parallel) or two ables/datasets(parallel). Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table. Merge is used to merge the two tables, but when master and update records are merged only both have same merge key columns.

28. What's the difference between Data stage Developers and Data stage Designers. What are the skills required for this?
Data stage developer is one how will code the jobs. Data stage designer is how will design the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

29.What is the OCI? And how to use the ETL Tools?
OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data.

30.What are the different types of lookups in data stage?
There are two types of lookup stage and look up file set
Lookup: Lookup reference to another stage or Database to get the data from it and transforms to other database.
Lookup File Set: It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .
There are two types of lookups
1. Normal look up
2. Sparse look up
Normal look up: To perform this look up data will be stored in the memory first and then look up will be performed bue to which it takes more execution time

Sparse look up: Sql query will be directly fired on the database related record due to which execution is faster than normal look up. For a single job we can use only one file with sparse look up where as no: of normal lookups can be used. But once we mention file to do sparse lookup it will not accept normal look up after that. So make sure to mention all your normal lookups first and sparse lookups at the last.

31. How to implement type2 slowly changing dimension in data stage? Give me with example?
Slow changing dimension is a common problem in Data warehousing. For example: There exists a customer called Lisa in a company ABC and she lives in New York. Later she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem.
Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people.
Type 3: The original record is modified to reflect the changes.
In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.
In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. Example a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the New York information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.

32.What is DS Manager used for - did u use it?
The Manager is a graphical tool that enables you to view and manage the contents of the Data Stage Repository

33. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Minimize the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator) Use SQL Code while extracting the data
Handle the nulls
Minimize the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Generally we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance. Also, check the order of execution of the routines.
Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
This would also minimize overflow on the hash file.
If possible, break the input into multiple threads and run multiple instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tumbles' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.

34. Using a constraint to filter a record set is much slower than performing a SELECT? WHERE?
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

35. What is the order of execution done internally in the transformer with the stage editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions.

36. What is the difference between In-process and Inter-process?
In-process
You can improve the performance of most Data Stage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row.
Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks.

Inter-process
Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor.
Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks.

37.What is DS Designer used for - did u use it?
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links.

38. How to handle Date conventions in Data stage? Convert a mm/dd/yyyy format to yyyy-dd-mm?
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") .

39. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?
Data will partition on both the keys! Hardly it will take more for execution.

40. What is purpose of using the key and difference between Surrogate keys and natural key?
We use keys to provide relationships between the entities (Tables). By using primary and foreign key relationship, we can maintain integrity of the data.
The natural key is the one coming from the OLTP system.
The surrogate key is the artificial key which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarios surrogate keys play a major role.
Natural key: key that is formed of attributes that already exist in the real world. e.g.SSN id.

Surrogate key: it can be thought of as a replacement of the natural key that has no business meaning. E.g. sequence in oracle.
I think there is a bug in the data stage 8.0.1 surrogate key generator. It seems to be very much inconsistent.

41. Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool?
We can use Data Stage Extract Pack for SAP R/3 and Data Stage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with Data Stage Version 7.5

42. How to kill the job in data stage?
By using Cleanup Resource or By killing the respective process ID

43. How we can call the routine in data stage job? Explain with steps?
Routines are used for implementing the business logic they are two types 1) Before Sub Routines and 2)After Sub Routine steps double click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options( Before / After Sub Routines).

44. What are the most important aspects that a beginner must consider doin his first DS project?
He should be good at Data Warehousing Concepts and he should be familiar with all stages

45. How I can convert Server Jobs into Parallel Jobs?
I have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.
I don't think this conversion might be possible.
Using ipc stage or link partitioner/link collector can incorporate a n amount of parallelism in the server jobs.
My requirement is like this:
Here is the codification suggested:
SALE_HEADER_XXXXX_YYYYMMDD.PSV
SALE_LINE_XXXXX_YYYYMMDD.PSV

XXXXX = LVM sequence to ensure unicity and continuity of file exchanges
Caution, there will an increment to implement.
YYYYMMDD = LVM date of file creation

COMPRESSION AND DELIVERY TO: SALE_HEADER_XXXXX_YYYYMMDD.ZIP AND SALE_LINE_XXXXX_YYYYMMDD.ZIP
If we run that job the target file names are like this
sale_header_1_20060206 & sale_line_1_20060206.

If we run next time means the target files we like this sale_header_2_20060206 & sale_line_2_20060206.
If we run the same in next day means the target files we want like this
sale_header_3_20060306 & sale_line_3_20060306.
i.e., whenever we run the same job the target files automatically changes its filename to
filename_increment to previous number(previousnumber + 1)_currentdate;
Please do needful by repling this question..
This can be done by using unix script
1. Keep? the Target filename as constant name xxx.psv
2. Once the job completed, invoke the Unix Script through After job routine - ExecSh
3. The script should get the number used in previous file and increment it by 1, After that move the file from xxx.psv to filename_(previousnumber + 1)_currentdate.psv and then delete the xxx.psv file.This is the?Easiest way to implement.

46.What is the difference between symmetrically parallel processing, massively parallel processing?
Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processors communicate via shared memory and have single operating system.
Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispersed. The processor has their own operations system and communicate via high speed network

47. What is the purpose of exception activity in data stage 7.5?
The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer.

48. Where we use link partitioner in data stage job? explain with example?
We use Link Partitioner in Data Stage Server Jobs. The Link Partitioner stage is an active stage which takes one input and allows you to distribute partitioned rows to up to 64 output links.
The 100 employees salaries in one organization .i want to load >300 emps in one target and <250 and >450 in one target that time go for link partioner stage

49. How to parameterize a field in a sequential file? I am using Data stage as ETL Tool, Sequential file as source?
We cannot parameterize a particular field in a sequential file, instead we can parameterize the source file name in a sequential file.

50. How to drop the index before loading data in target and how to rebuild it in data stage?
This can be achieved by "Direct Load" option of SQL Loaded utility.
If we are using OCI & database stage, we can use Before/After sql property in stage properties.

51.If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows?
It overwrites the file

52. Can we retain the removed duplicate values...If yes, how can we do that?
It is possible to access the same job two users at a time in data stage
No, it is not possible to access the same job two users at the same time. DS will produce the following error: "Job is accessed by other user"
No it is not possible for the same job to be accessed by two users, because if one of the job is being used by one user then if the second one tries to open the same job then you get the error as "Job (job name) is being accessed by another user"
--In Data stage 8 it is possible to open the job as Read-Only. In order to do some modification in the read-only job, take a copy of the job as modification cannot be done in a read only job. This feature is not available in Data stage 7.5.

53. What is job control? How it is developed? Explain with steps?
Controlling Data stage jobs through some other Data stage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Data stage macros in Routines.
To execute one job from other job, following steps needs to be followed in Routines.
1. Attach job using DS Attach job function.
2. Run the other job using DS Run job function
3. Stop the job using DSS top Job function

54. Where actually the flat files store? What is the path?
Flat files stores the data and the path can be given in general tab of the sequential file stage

55. Will the data stage consider the second constraint in the transformer once the first condition is satisfied ( if the link ordering is given)?
Will Data stage consider the second constraint in the transformer if the first constraint is satisfied (if link ordering is given)?"
Answer: Yes.

56. What is a project? Specify its various components?
You always enter Data Stage through a Data Stage project. When you start a Data Stage client you are prompted to connect to a project. Each project contains:
Data Stage jobs:
Built-in components: These are predefined components used in a job.
User-defined components: These are customized components created using the Data Stage Manager or Data Stage Designer
You always enter Data Stage through a Data Stage project. When you start a Data Stage client you are prompted to connect to a project. Each project contains:

57. Briefly describe the various client components?
There are four client components
Data Stage Designer: A design interface used to create Data Stage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server.
Data Stage Director: A user interface used to validate, schedule, run, and monitor Data Stage jobs.
Data Stage Manager: A user interface used to view and edit the contents of the Repository.
Data Stage Administrator: A user interface used to configure Data Stage projects and users.

58. What are the Steps involved in development of a job in Data Stage?
The steps required are:
Select the data source stage depending upon the sources for ex: flat file, database, xml etc
Select the required stages for transformation logic such as transformer, link collector, link practitioner, Aggregator, merge etc.
Select the final target stage where u want to load the data either it is data warehouse, data mart, ODS, staging etc

59. What are constraints and derivation?
Explain the process of taking backup in Data Stage?
What are the different types of lookups available in Data Stage?
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG etc.

60. How does Data Stage handle the user security?
We have to create users in the Administrators and give the necessary privileges to users.

61. What is meaning of file extender in data stage server jobs?
Can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in data stage jobs?
What is meaning of file extender in data stage server jobs?
Can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in data stage jobs?
File extender means the adding the columns or records to the already existing the file, in the data stage. We can run the data stage job from one job to another job in data stage.

62. What is the difference between DRS and ODBC stage?
To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your Data Stage server for it to work.

Dynamic Relational Stage was leveraged for People soft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation.
ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target...

63. How to use rank & update strategy in data stage?
Don't mix informatica with Data stage.
In Data stage, we don’t have such kind of stages.

64. What is the max capacity of Hash file in Data Stage?
Take a look at the uvconfig file:
# 64BIT_FILES - This sets the default mode used to
# create static hashed and dynamic files.
# A value of 0 results in the creation of 32-bit
# files. 32-bit files have a maximum file size of
# 2 gigabytes. A value of 1 results in the creation
# of 64-bit files (ONLY valid on 64-bit capable platforms).
# The maximum file size for 64-bit
# files is system dependent. The default behavior
# may be overridden by keywords on certain commands.
64BIT_FILES 0

65. How to implement type2 slowly changing dimensions in data stage? Explain with example?
We can handle SCD in the following ways
Type 1: Just use, Insert rows Else Update rows? Or? Update rows Else Insert rows? in update action of target
Type 2: Use the steps as follows
a) U have use one hash file to Look-Up the target
b) Take 3 instances of target
c) Give different conditions depending on the process
d) Give different update actions in target
e) Use system variables like Sysdate and Null.

66. What is merge and how it can be done please explain with simple example taking 2 tables .......?
Merge is used to join two tables. It takes the Key columns sort them in Ascending or descending order. Let us consider two table i.e. Emp, Dept. If we want to join these two tables we are having Deptno as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables

67. How can ETL excel file to Data mart?
Open the ODBC Data Source Administrator found in the controlpanel/administrative tools. Under the system DSN tab, add the Driver to Microsoft Excel.
Then you'll be able to access the XLS file from Data stage.

68. How do we do the automation of ds jobs?
We can call Data stage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director.

69. What are constraints and derivation?
Explain the process of taking backup in Data Stage?
What are the different types of lookups available in Data Stage?
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further. Derivation is a method of deriving the fields, for example if you need to get some SUM,AVG etc.

70. How does Data Stage handle the user security?
We have to create users in the Administrators and give the necessary privileges to users.
What is meaning of file extender in data stage server jobs.
can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in ds jobs.

71. What is meaning of file extender in data stage server jobs.
can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in ds jobs.
File extender means the adding the columns or records to the already existing the file, in the data stage, we can run the data stage job from one job to another job in data stage.

72. What is the difference between DRS and ODBC stage?
To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your Data Stage server for it to work.
Dynamic Relational Stage was leveraged for PeopleSoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation.
ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target

73. How to use rank and update strategy in data stage?
Don't mix informatica with Data stage.
In Data stage, we don’t have such kind of stages.

74. What is the max capacity of Hash file in Data Stage?
Take a look at the uvconfig file:
# 64BIT_FILES - This sets the default mode used to
# create static hashed and dynamic files.
# A value of 0 results in the creation of 32-bit
# files. 32-bit files have a maximum file size of
# 2 gigabytes. A value of 1 results in the creation
# of 64-bit files (ONLY valid on 64-bit capable platforms).
# The maximum file size for 64-bit
# files is system dependent. The default behavior
# may be overridden by keywords on certain commands.
64BIT_FILES 0

75. How to implement type2 slowly changing dimensions in data stage? Explain with example?
We can handle SCD in the following ways
Type 1: Just use, Insert rows Else Update rows?
Or ? Update rows Else Insert rows?, in update action of target
Type 2: Use the steps as follows
a) U have use one hash file to Look-Up the target
b) Take 3 instances of target
c) Give different conditions depending on the process
d) Give different update actions in target
e) Use system variables like Sysdate and Null.

76. How can we improve the performance of Data Stage?
Trun in-process buffer and transaction size

77. What is data set? and what is file set?
File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel mode The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like ODBC in server jobs

78. How the hash file is doing lookup in server jobs? How is it comparing the key values?
Hashed File is used for two purposes: 1. Remove Duplicate Records 2. Then Used for reference lookups. The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion. By using hashed algorithm and the key valued the lookup is faster.

79. What are orabulk and bcp stages?
ORABULK is used to load bulk data into single table of target oracle database.
BCP is used to load bulk data into a single table for Microsoft sql server and Sybase.

80. How is data stage 4.0 functionally different from the enterprise edition now?? what are the exact changes?
There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc..........

81. Data Stage from Staging to MDW is only running at 1 row per second! What do we do to emedy?
I am assuming that there are too many stages, which is causing problem and providing the solution.
In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarrassing to have a tool and still write a query but that is best at times.
If there are too many look ups that are being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance.
Other things in general that could be looked in:
1) For massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk.
2) Enable row buffering and set appropriate size for row buffering
3) It is important to use appropriate objects between stages for performance
What user variable activity when it used how it used! Where it is used with real example
By using This User variable activity we can create some variables in the job sequence, this variables r available for all the activities in that sequence.
Most probably this activity is @ starting of the job sequence

82. How can I specify a filter command for processing data while defining sequential file output data?
We have some thing called as after job subroutine and before subroutine, with then we can execute the UNIX commands.
Here we can use the sort summand or the filter CD command

83. What are validations you perform after creating jobs in designer? What r the different type of errors u faced during loading and how u solves them?
Check for Parameters.
And check for input files are existed or not and also check for input tables existed or not and also usernames, data source names, passwords like that

84. What is the difference between Data stage and Data stage TX?
Its a critical question to answer, but one thing i can tell u that Data stage Tx is not a ETL tool & this is not a new version of Data stage 7.5.
Tx is used for ODS source, this much i know
DS TX is not a ETL tool. It comes in a category of EAI. TX helps in connecting various frontend to diff back ends. it has a inbuilt plug-in for many diff sources and targets. In short it can make ur application platform independent. It can also parse data.

85. If you are running your Parallel Job on 4 node configuration file and you have 10 stages on the canvas, how many processes does data stage creates?
Answer is 40 [considering number of stages and the player processes]
--You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40
Note: It may vary based on the operator combinability.

86. Does Enterprise Edition only add the parallel processing for better performance?
Are any stages/transformations available in the enterprise edition only?
Data Stage Standard Edition was previously called Data Stage and Data Stage Server Edition. Data Stage Enterprise Edition was originally called Orchestrate, and then renamed to Parallel Extender when purchased by Acential. Data Stage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for UNIX, it now supports Windows, Linux and Unix System Services on mainframes. ? Data Stage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage.

87. How can you implement Complex Jobs in data stage?
Complex design means having more joins and more look ups. Then that job design will be called as complex job. We can easily implement any complex design in Data Stage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job. Use not more than 7 look ups for a transformer otherwise go for including one more transformer.

88. How can u implement slowly changed dimensions in data stage? Explain?
2) can u join flat file and database in data stage? How?
Yes, we can do it in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First1. Take the flat file which you are having and use a Merge Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements.

89. How to implement routines in data stage?
There are 3 kind of routines is there in Data stage.
1. Server routines which will used in server jobs. These routines will write in BASIC Language
2. Parallel routines which will use in parallel jobs. These routines will write in C/C++ Language
3. Mainframe routines which will used in mainframe jobs

90. What is difference between server jobs & parallel jobs?
Server jobs: These are available if you have installed Data Stage Server. They run on the Data Stage Server, connecting to other data sources as necessary.
Parallel jobs: These are only available if you have installed Enterprise Edition. These run on Data Stage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.
In server job sequence we can run the jobs sequentially only. That’s one after another.
But in Parallel sequence we can run it parallel if there is no dependency b/w jobs.

91. How we use NLS function in Data stage? What are advantages of NLS function? Where we can use that one? Explain briefly?
By using NLS function we can do the following
- Process the data in a wide range of languages
- Use Local formats for dates, times and money
- Sort the data according to the local rules
If NLS is installed, various extra features appear in the product.
For Server jobs, NLS is implemented in Data Stage Server engine
For Parallel jobs, NLS is implemented using the ICU library.

92. What happens out put of hash file is connected to transformer ..? What error it thoughts?
If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code.

93. What is version Control?
Version Control stores different versions of DS jobs, run different versions of same job
Reverts to previous version of a job view version histories

94. What is ' inserting for update ' in data stage?
I think 'insert to update' is updated value is inserted to maintain history

95. How can we pass parameters to job by using file?
You can do this, by passing parameters from UNIX file, and then calling the execution of a data stage job. the ds job has the parameters defined (which are passed by Unix)

96. Where does Unix script of data stage executes weather in client machine or in server. Suppose if it executes on server then it will execute?
Data stage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

97. Defaults nodes for data stage parallel Edition?
Actually the Number of Nodes depends on the number of processors in your system. If your system is supporting two processors we will get two nodes by default.

98. How do you merge two files in DS?
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

99. What is DS Director used for - did u use it?
Data stage Director is GUI to monitor, run, validate & schedule data stage server jobs.

100. What are types of Hashed File?
Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic
ii) Specific.
Default Hashed file is "Dynamic - Type30.

101. How do you pass filename as the parameter for a job?
1. Go to Data Stage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file.
2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.

102. How can we create Containers?
There are two types of containers
1. Local Container
2. Shared Container
Local container is available for that particular Job only.
Where as Shared Containers can be used anywhere in the project.
Local container:
Step1:Select the stages required
Step2:Edit>ConstructContainer>Local
Shared Container:
Step1:Select the stages required
Step2:Edit>ConstructContainer>Shared
Shared containers are stored in the SharedContainers branch of the Tree Structure

103. How can we improve the performance of Data Stage jobs?
Performance and tuning of DS jobs:
1. Establish Baselines
2. Avoid the Use of only one flow for tuning/performance testing
3. Work in increment
4. Evaluate data skew
5. Isolate and solve
6. Distribute file systems to eliminate bottlenecks
7. Do not involve the RDBMS in initial testing
8. Understand and evaluate the tuning knobs available.

125. What are the Job parameters?
These Parameters are used to provide Administrative access and change run time values of the job.
EDIT>JOBPARAMETERS
In that Parameters Tab we can define the name, prompt, type, value

104. What are all the third party tools used in Data Stage?
Autosys, TNG, event coordinator are some of them that I know and worked with

105. How can we implement Lookup in Data Stage Server jobs?
The DB2 stage can be used for lookups.
In the Enterprise Edition, the Lookup stage can be used for doing lookups.
We can implement lookups in server jobs using Hash files and Transformer stage. Lookup stages are available in parallel jobs only.

106. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
"AUTOSYS": Through autosys u can automate the job by invoking the shell script written to schedule the data stage jobs.

107. What are Sequencers?
A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two modes: ALL modes. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire. ANY mode, in this mode, output triggers can be fired if any of the sequencer inputs are TRUE

108. Explain the differences between Oracle8i/9i?
Mutli processing, databases more dimensional modeling

109. What r XML files and how do you read data from XML files and what stage to be used?
In the pallet there is Real time stages like xml-input, xml-output, xml-transformer

110 How do you pass the parameter to the job sequence if the job is running at night?
Two ways
1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.
2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter.

111. How do you remove duplicates without using remove duplicate stage?
In the target make the column as the key column and run the job.
hi , by using sort stage we remove duplicates as making options allow duplicates= false.

112. What is the flow of loading data into fact & dimensional tables?
Here is the sequence of loading a data warehouse.
1. The source data is first loading into the staging area, where data cleansing takes place.
2. The data from staging area is then loaded into dimensions/lookups.
3.Finally the Fact tables are loaded from the corresponding source tables from the staging area.

113. what is the difference between DNS DHCP ?
File system of windows? compare with nis & dns
No answer available currently. Be the first one to reply to this question by submitting your answer from the form below.

114. Functionality of Link Partioner and Link Collector?
server jobs mainly execute the jobs in sequential fashion,the ipc stage as well as link partioner and link collector will simulate the parllel mode of execution over the sever jobs having single cpu Link Partitioner : It receives data on a single input link and diverts the data to a maximum no.of 64 output links and the data processed by the same stage having same meta dataLink Collector : It will collects the data from 64 inputlinks, merges it into a single data flowand loads to target. these both r active stagesand the design and mode of execution of serverjobs has to be decidead by the designer
Link Partitioner : It actually splits data into various partitions or data flows using various
partition methods .
Link Collector : It collects the data coming from partitions, merges it into a single data flow and loads to target.
Containers : Usage and Types?
Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project. ?
There are two types of shared container:?
1.Server shared container. Used in server jobs (can also be used in parallel jobs).?
2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).regards jagan

115. What are Stage Variables, Derivations and Constants?
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
In datastage transformer stage can mention Constraint to columns deviate the output to desired manner not constant

116. How to improve the performance of hash file?
You can improve performance of hashed file by
1 .Preloading hash file into memory --this can be done by enabling preloading options in hash file output stage
2. Write caching options --.It makes data written into cache before being flushed to disk. you can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written
3 .Preallocating-- Estimating the approx size of the hash file so that file needs not to be splitted to often after write operation

117. What does separation option in static hash-file mean?
The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids.
When a hashed file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hash file = modulus(no. groups) * Separations (buffer size)

118. How to remove duplicates in server job?
1) Use a hashed file stage or
2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using -u parameter or
3) using a Sort stage

119. If I add a new environment variable in Windows, how can I access it in Data Stage?
U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties

120. There are three different types of user-created stages available for PX.
What are they? Which would you use? What are the disadvantage for using each type?
These are the three different stages:
i) Custom ii) Build iii) Wrapped

121. I want to process 3 files in sequentially one by one, how can I do that? While processing the files it should fetch files automatically?
If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...
I think in datastage8.0.1 there is an option in the sequence job namely loop via which the purpose can be achieved

122. What is difference between data stage and informatica?
Here is a very good articles on these differences... whic hhelps to get an idea.. Basically it's depends on what you are tring to accomplish

123. What are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday?
If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups?
If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK.

124. Whets difference between operational data stage (ODS) & data warehouse?
A data warehouse is a decision support database for organizational needs. It is subject oriented, on volatile, integrated and time variant collect of data.
ODS (Operational Data Source) is a integrated collection of related information. It contains maximum 90 days information.

125. What is iconv and oconv functions?
Iconv( )-----converts string to internal storage formatOconv( )----converts an expression to an output format

126. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

127. Did you work in UNIX environment?
Some times u need to write Unix programs in back round like batch programs ! Because data stage can invoke a batch processing in every 24 hrs. so.......Unix must... so that we can run the Unix program, in back round even min/ hrs

128. What is Hash file stage and what is it used for?
We can also use the Hash File stage to avoid / remove duplicate rows by specifying the hash key on a particular filled

129. Did you Parameterize the job or hard-coded the values in the jobs?
Always parameterized the job. Either the values are coming from Job Properties or from a ?Parameter Manager? ? a third part tool. There is no way you will hard? Code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against it.

130. Suppose if there are million records did you use OCI? if not then what stage do you prefer?
Using Orabulk

131. What are environment variables? What is the use of this?
Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable. For that we can got to DS Admin .

132. What is data set? And what is file set?
File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like odbc in server jobs

133. How the hash file is doing lookup in server jobs? How is it comparing the key values?
Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups. The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion. By using hashed algorithm and the key valued the lookup is faster.

134. What are orabulk and bcp stages?
ORABULK is used to load bulk data into single table of target oracle database.
BCP is used to load bulk data into a single table for Microsoft sql server and Sybase.

135. How is datastage 4.0 functionally different from the enterprise edition now?? what are the exact changes?
There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc..........

136. Data Stage from Staging to MDW is only running at 1 row per second! What do we do to remedy?
I am assuming that there are too many stages, which is causing problem and providing the solution.
In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarrassing to have a tool and still write a query but that is best at times. If there are too many look ups that are being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance.
Other things in general that could be looked in:
1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk.
2) Enable row buffering and set appropriate size for row buffering
3) It is important to use appropriate objects between stages for performance

137. What user variable activity when it used how it used! Where it is used with real example?
By using This User variable activity we can create some variables in the job sequence, this variables r available for all the activities in that sequence.
Most probably this activity is @ starting of the job sequence

138. How can I specify a filter command for processing data while defining sequential file output data?
We have some thing called as after job subroutine and before subroutine, with then we can execute the Unix commands.
Here we can use the sort summand or the filter cdommand

139. What is the difference between Data stage and Data stage TX?
Its a critical question to answer, but one thing i can tell u that Data stage Tx is not a ETL tool & this is not a new version of Data stage 7.5.
Tx is used for ODS source ,this much i know
DS TX is not a ETL tool. It comes in a category of EAI.
TX helps in connecting varios frontend to diff backends.it has a inbuilt plugin for many diff sources and targets.In short it can make ur application platform independent.It can also parse data.

140. If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does data stage create?
Answer is 40
You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40

141. Does Enterprise Edition only add the parallel processing for better performance?
Are any stages/transformations available in the enterprise edition only?
? Data Stage Standard Edition was previously called Data Stage and Data Stage Server Edition. ? Data Stage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. ? DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes. ? Data Stage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage.

142. How can you implement Complex Jobs in data stage?
Complex design means having more joins and more look ups. Then that job design will be called as complex job. We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.

143. How can u implement slowly changed dimensions in data stage? Explain?
2) can u join flat file and database in data stage? How?
Yes, we can do it in an indirect way. First create a job which can populate the data from database into a Sequencial file and name it as Seq_First1. Take theflat file which you are having and use a Merge Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements.

144. I want to process 3 files in sequentially one by one , how can i do that. while processing the files it should fetch files automatically?
If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...
I think in datastage8.0.1 there is an option in the sequence job namely loop via which the purpose can be achieved

145. What is difference between server jobs & parallel jobs?
Here is the diff
Server jobs: These are available if you have installed Data Stage Server. They run on the Data Stage Server, connecting to other data sources as necessary.
Parallel jobs: These are only available if you have installed Enterprise Edition. These run on Data Stage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.
In server job sequence we can run the jobs sequentially only. That’s one after another.
But in Parallel sequence we can run it parallel if there are no dependency b/w jobs.
146. How we use NLS function in Data stage? What are advantages of NLS function? where we can use that one? Explain briefly?
By using NLS function we can do the following
- Process the data in a wide range of languages
- Use Local formats for dates, times and money
- Sort the data according to the local rules
If NLS is installed, various extra features appear in the product.
For Server jobs, NLS is implemented in Data Stage Server engine
For Parallel jobs, NLS is implemented using the ICU library.

147. What happens out put of hash file is connected to transformer .. What error it troughs?
If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code.

148. What is ' insert for update ' in data stage?
I think 'insert to update' is updated value is inserted to maintain history

149. How can we pass parameters to job by using file.?
You can do this, by passing parameters from Unix file, and then calling the execution of a data stage job. the ds job has the parameters defined (which are passed by Unix)

150. Defaults nodes for data stage parallel Edition?
Actually the Number of Nodes depend on the number of processors in your system. If your system is supporting two processors we will get two nodes by default.

151. How can we implement Lookup in Data Stage Server jobs?
The DB2 stage can be used for lookups.
In the Enterprise Edition, the Lookup stage can be used for doing lookups.
We can implement lookups in server jobs using Hash files and Transformer stage. Lookup stages are available in parallel jobs only.

152. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs.

153. what is the difference between DNS DHCP ?
file system of windows?
compare with nis & dns
No answer available currently. Be the first one to reply to this question by submitting your answer from the form below.

154. Functionality of Link Partitioner and Link Collector?
server jobs mainly execute the jobs in sequential fashion,the ipc stage as well as link partioner and link collector will simulate the parllel mode of execution over the sever jobs having single cpu Link Partitioner : It receives data on a single input link and diverts the data to a maximum no.of 64 output links and the data processed by the same stage having same meta dataLink Collector : It will collects the data from 64 inputlinks, merges it into a single data flowand loads to target. these both r active stagesand the design and mode of execution of serverjobs has to be decidead by the designer
Link Partitioner : It actually splits data into various partitions or data flows using various
partition methods .
Link Collector : It collects the data coming from partitions, merges it into a single data flow and loads to target.
Containers : Usage and Types?
Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project. ?
There are two types of shared container:?
1.Server shared container. Used in server jobs (can also be used in parallel jobs).?
2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).

Venkat ♥ Duvvuri

Tuesday, December 21, 2010

Data Transformations

Tuesday, December 7, 2010

DataStage Stages and Jobs

Tuesday, November 23, 2010

DataStage Functions

Functions used in IBM InfoSphere DataStage and QualityStage

Date and time functions

Logical Functions

Mathematical Functions

Null handling functions

Number functions

Raw functions

String functions

Vector function

Type Conversion Functions

Type "casting" functions

Utility functions

Thursday, November 11, 2010

DataStage FAQs and Best Practices

Wednesday, October 27, 2010

Interview Questions

Categories

Labels

Blog Archive

Blog Archive

Popular Posts