IBM InfoSphere DataStage

Latest information:
Governance updates
Data Quality updates
Integration updates
Administration and management updates
Deprecated features

Governance updates

Information Governance Catalog New
InfoSphere Information Governance Catalog
Information Server Enterprise Search
Information Server Governance Monitor

Information Governance Catalog New

New in 11.7

This version of Information Governance Catalog has a new user interface which improves user experience.
You can search for assets in your entire enterprise by using enhanced search that takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage.
You can browse the assets in your catalog and narrow down the results by using filters.
You can use producers, which are applications that collect relevant data on systems like Db2®, Hive, Hadoop Distributed File System (HDFS), Oracle, or Teradata, to improve the quality of search results.
You can use unstructured data sources to enrich your data with information that has no clear structure - emails, instant messages, word-processing documents, or audio or video files.
You can view a graphical depiction of relationships between assets in the graph explorer.
You can add comments and ratings to assets.
On the home page, you can display the most important information such as assets with highest rates, your collections, or information assets statistics.
Administrators can create custom profiles to display customized details page of assets.

Information Server Enterprise Search

New in 11.7

Information Server Enterprise Search is a stand-alone application which enables you to explore data in your enterprise.
You can search for assets in your entire enterprise by using enhanced search that takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage.
You can use producers, which are applications that collect relevant data on systems like Db2®, Hive, Hadoop Distributed File System (HDFS), Oracle, or Teradata, to improve the quality of search results.
When you view basic information about an asset in the search results, you can open the details page in the application of origin such as Information Governance Catalog.
You can view a graphical depiction of relationships between assets in the graph explorer.
You can add comments and ratings to assets.

InfoSphere Information Governance Catalog

New in 11.7

You can discover new asset family - unstructured data sources. These assets are synchronized from IBM Stored IQ, and represent data with no clear structure. New asset types are instances, volumes, infosets and filters.
You can export your own asset types that you added to Information Governance Catalog by using REST API bundles. You can later reimport such assets and merge them with the existing ones.
When using REST API, two new properties are displayed in the basic information about an asset: asset group and class name.
When searching for assets by using REST API, you can specify context, such as the name of a host, to narrow down search results.
When searching for assets by using REST API, you can use new parameter IncludeHistory to specify whether the asset history (if applicable) is included in the details of returned assets.
You can create, edit and delete custom attributes by using REST API.

Information Server Governance Monitor

New in 11.7

Information Server Governance Monitor is a tool where you can monitor the status and health of the data in your enterprise.
On the Curation Dashboard, you can see whether the data in your enterprise is cataloged, classified and governed.
On the Quality Dashboard, you can review the overall quality of the data in your enterprise, including scoring and quality dimensions.

Data Quality updates

InfoSphere Information Analyzer

InfoSphere Information Analyzer

New in 11.7

You can create and run automation rules to automate the process of applying and running rule definitions and quality dimensions.
Terms can be automatically assigned to data sets and columns when you use automated discovery to import data or when you run a column analysis.
You can run automated discovery to import and analyze new data sets from a data connection. With one click you can register data sets and add metadata from the data connection to a default workspace, run column and quality analysis, and assign business terms to imported assets.
You can create new data quality dimensions in order to detect if a row, column, or cell contains a particular data quality problem.
When you run column analysis, a single frequency distribution table can be generated for each project rather than each column in the analysis.
You can use different credentials for running analysis jobs than the ones you use to import data by using by using InfoSphere Metadata Asset Manager. You can set the credentials for a specific project.
When you run quality and column analysis, you can update settings for a specific analysis. For example, you can run the analysis with a data sample, specify how you want analysis statistics generated and stored, or apply advanced engine settings.
You can run stored procedures for SQL Server and Oracle to get the analysis results from Information Analysis Database.
You can export the results of the overlap analysis.
New data class type 'Script' is introduced. By using this script classifier, you can classify your data by creating a custom script snippet.

Integration updates

DataStage Flow Designer
InfoSphere DataStage
Connectivity
InfoSphere Information Server on Hadoop

DataStage Flow Designer

New in 11.7

You can use DataStage Flow Designer, a web-based tool, to create, edit, load, and run DataStage jobs.
You can search for jobs by using built-in search.
Metadata is automatically propagated to subsequent stages in a flow.
When compilation errors occur, all of them are highlighted and a hover over each stage provides details, which makes it easier to correct the errors.
If you are new to the product, you can take a quick tour to learn how to use its features.

InfoSphere DataStage

New in 11.7

By using DataStage Product Insights, you can connect your InfoSphere DataStage installation to IBM Cloud Product Insights. Such connection gives you ability to review your installation details and metrics like CPU usage, memory usage, active jobs, jobs that failed, and jobs that completed.
Data Masking stage supports Optim Data Privacy Providers version 11.3.0.5.

Connectivity

New in 11.7

HBase connector is supported. You can use HBase connector to connect to tables stored in the HBase database and perform the following operations:
- Read data from or write data to HBase database.
- Read data in parallel mode.
- Use HBase table as a lookup table in sparse or normal mode.
- Kerberos keytab locality is supported.
Hive connector supports the following features:
- Modulus partition mode and minimum maximum partition mode during the read operation are supported.
- Kerberos keytab locality is supported.
- Connector supports connection to Hive on Amazon EMR.
Kafka connector supports the following features:
- Continuous mode, where incoming topic messages are consumed without stopping the connector.
- Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.
- TLS connection to Kafka.
- Kerberos keytab locality is supported.
IBM MQ version 9 is supported.
IBM InfoSphere Data Replication CDC Replication version 11.3 is supported.
SQL Server Operator supports SQL Server Native Client 11.
Sybase Operator supports unichar and univarchar datatypes in Sybase Adaptive Server Enterprise.
Amazon S3 connector supports connecting by using a HTTP proxy server.
File connector supports the following features:
- Native HDFS FileSystem mode is supported.
- You can import metadata from the ORC files.
- New data types are supported for reading and writing the Parquet formatted files: Date / Time and Timestamp.
JDBC connector is certified to connect to MongoDB and Amazon Redshift.

InfoSphere Information Server on Hadoop

New in 11.7

The following Hadoop distributions are supported: MapR 5.2.2, Cloudera CDH 5.13, and Hortonworks HDP 2.6.2.
Parallel engine configuration files on YARN can include nodes which are not part of Hadoop cluster, so Hadoop jobs can access relational databases outside the cluster.
Parallel jobs on YARN can use Hadoop Shuffle space for scratch files.
Parallel jobs give clearer messages when they are preempted by the YARN resource manager.
Ambari scripts have been enhanced to improve administration.

Administration and management updates

Managing metadata

Managing metadata

New in 11.7

You can run automated discovery by using command line to discover the database schema available for a given database or the files and folders for a given file system.

Deprecated features

Features deprecated in 11.7

Using InfoSphere Data Click to move data in IBM InfoSphere BigInsights is no longer supported. In previous releases, InfoSphere Data Click was used to copy selected database tables, data files, data file folders, and Amazon S3 buckets from the catalog to a target distributed file system, such as a Hadoop Distributed File System (HDFS) in IBM InfoSphere BigInsights.
The following asset types are deprecated in Information Governance Catalog:
- Machine Profiles
- Blueprint Director
- CDC Mapping Document
- Warehouse Mapping Document
- Information Server Reports
- External Assets
Data policy asset type is deprecated in Information Analyzer.

By default the oraread operator is set to run in sequential mode. To enable parallel mode on oraread , you have to add the partition table property. This is set in the Source properties section. Then you specify the table name of this property. If more than one table is being used in the select statement, then specify only one of the tables used.

You can then specify a Partitioning Algorithm to use when partitioning the table across nodes.

It is recommend that you add the environment APT_ORAREAD_PARALLEL_ALGORITHM to the job and set its value to ROWID_HASH

APT_ORAREAD_PARALLEL_ALGORITHM

This environment variable is used to determine which partitioning algorithm to use during Oracle Enterprise parallel read operations. The algorithm defines how the stage divides the input dataset into subsets so that each parallel instance of the stage reads one subset of the data. The environment variable can be set to one of the following values:

ROUND_ROBIN - The operator divides the rows from the input dataset in a round-robin fashion using modulus function applied on the row identifier (ROWID) values of the rows within the storage blocks in which they reside

ROWID_HASH - The operator divides the rows from the input dataset in an approximately random fashion using modulus function applied on the hash codes calculated from the rowid values of the rows (recommended)

ROWID_RANGE - The operator divides the rows from the input dataset by taking into account the physical collocation of rows in the table segment and splitting the overall range of rowid values into sub-ranges. This is the default option and is a preferred option for use.

If the environment variable is not defined or is set to a value other than the values listed above, the ROWID_RANGE value is used by default.

If the ROWID_RANGE option is selected (either explicitly or implicitly) and the Oracle user does not have access to DBA_EXTENTS dictionary view, or the target table is an index-organized table (IOT) or a view, then the stage cannot use ROWID_RANGE algorithm and it automatically switches at runtime to using ROWID_HASH partitioning algorithm instead.

Article Source: IBM Support Guide

Venkat ♥ Duvvuri

Thursday, December 21, 2017

New features and changes in InfoSphere Information Server Version 11.7

Governance updates

Information Governance Catalog New

Information Server Enterprise Search

InfoSphere Information Governance Catalog

Information Server Governance Monitor

Data Quality updates

InfoSphere Information Analyzer

Integration updates

DataStage Flow Designer

InfoSphere DataStage

Connectivity

InfoSphere Information Server on Hadoop

Administration and management updates

Managing metadata

Deprecated features

Saturday, August 30, 2014

How do I make the Oracle EE Stage Read Operator Run in Parallel?

Saturday, May 10, 2014

DataStage Scenario on Finding Unique Distance from Source and Destination

Sunday, July 7, 2013

Performance Optimization technique to handle variable length data in Data Set

Categories

Labels

Blog Archive

Blog Archive

Popular Posts