Using IBM InfoSphere BigInsights to accelerate big data time-to-value

IBM Software

Thought Leadership White Paper

December 2013

Using IBM InfoSphere BigInsights to

accelerate big data time-to-value

Using IBM InfoSphere BigInsights to accelerate big data time-to-value

Contents

2 IBM InfoSphere BigInsights 2.1 overview

3 Accelerating deployments by tapping into Hadoop

community innovation

3 Leveraging existing SQL skills and solutions

4 Enabling user-driven analytics and data provisioning

5 IBM BigSheets

6 InfoSphere BigInsights Web Console

7 Analytics accelerators

10 Leveraging in-motion and at-rest analytics

10 Integration with popular modeling and predictive

analytics solutions

10 Conclusion

IBM InfoSphere BigInsights 2.1 overview

IBM® InfoSphere® BigInsights™ 2.1 is an Apache Hadoop-

based, hardware-agnostic software platform that provides new

ways of using diverse and large-scale data collections. This white

paper describes commonly used capabilities in InfoSphere

BigInsights 2.1 that allow organizations to cost-effectively ana-

lyze a wide variety and large volume of data to gain insights that

were not previously possible.

InfoSphere BigInsights 2.1 is focused on providing enterprises

with the capabilities they need to meet critical business require-

ments while maintaining compatibility with the Hadoop project.

InfoSphere BigInsights 2.1 includes a variety of IBM technolo-

gies that enhance and extend the value of open-source Hadoop

software to facilitate faster time-to-value, including application

accelerators, analytical facilities, development tools, platform

improvements and enterprise software integration. While

InfoSphere BigInsights offers a wide range of capabilities that

extend beyond the Hadoop functionality, IBM has taken an opt-

in approach: you can use the IBM extensions to Hadoop based

on your needs rather than being forced to use the extensions that

come with InfoSphere BigInsights 2.1.

To help you initiate big data projects quickly, InfoSphere

BigInsights 2.1 offers a number of enhancements, including a

collection of popular open-source and IBM technologies that

can be grouped into the following categories:



Accelerating deployments by tapping into Hadoop

community innovation



Leveraging existing SQL skills and solutions



Enabling user-driven analytics and data provisioning



Supporting human-oriented information discovery and

topic generation



Leveraging in-motion and at-rest analytics



Integrating with popular modeling and predictive

analytics solutions

IBM Software

InfoSphere BigInsights: A highly compatible platform

Third-party applications, partner solutions and custom

development projects that are compatible with the following

InfoSphere BigInsights support versions are expected to work

without changes beyond updating data locations.

• Apache Hadoop (1.1.1), a 64-bit Linux version of the

IBM SDK for Java 6, and Java

• Avro (1.7.2), a data serialization system

• Chukwa (0.5.0), a data collection system for monitoring large

distributed file systems

• Fair Scheduler, for basic management of job submission

• Flume (1.3.0), a distributed, reliable and highly available

service for efficiently moving large amounts of data

around a cluster

• HBase (0.94.3), a non-relational distributed database

written in Java

• HCatalog (0.4.0), a table and storage management

service for Hadoop

• Hive (0.9.0), a data warehouse infrastructure that facilitates

both data extraction, transformation and loading (ETL),

and the analysis of large data sets that are stored in the

Hadoop Distributed File System (HDFS)

• IBM InfoSphere BigInsights Jaql, a query language designed

for JavaScript Object Notation (JSON), primarily used to

analyze large-scale semi-structured data

• Lucene (3.3.0), a high-performance, full-featured text

search engine library written entirely in Java

• Oozie (3.2.0), a workflow coordination manager

• Orchestrator, an advanced MapReduce job control system

that uses a JSON format to describe job graphs and the

relationships between them

• Pig (0.10.0), a platform for analyzing large data sets,

consisting of a high-level language for expressing data

analysis programs and an infrastructure for evaluating

those programs

• Sqoop (1.4.2), a tool that imports information from

structured databases and related Hadoop systems

into Hadoop clusters

• ZooKeeper (3.4.5), a centralized service for maintaining

configuration information that provides distributed

synchronization and group services

This paper describes how these enhancements help extend the

value of open-source Hadoop with the capabilities organizations

need to cost-effectively support emerging big data workloads.

Accelerating deployments by tapping into

Hadoop community innovation

The IBM commitment to the Hadoop open-source software

components in InfoSphere BigInsights 2.1 helps facilitate third-

party interoperability and supports the ongoing development of

new features and functionality. Organizations with existing

MapReduce, Hive, Pig and Sqoop projects can leverage that

work on InfoSphere BigInsights 2.1 if the version levels are all

compatible and directory structures are mirrored.

Leveraging existing SQL skills and

solutions

Legacy applications depend on SQL to access stored data, and

SQL is the de facto language used to query structured data;

as a result, most organizations have deep and abundant SQL

skills. IBM customers have been asking for ways to leverage their

SQL skills with Hadoop to lower the barrier to getting started

with Hadoop and to facilitate interoperability with existing

SQL-oriented tools and applications. IBM is enabling customers

to do exactly that with the introduction of IBM Big SQL, a data

warehouse system for Hadoop that is used to summarize, query

and analyze data that is stored in InfoSphere BigInsights 2.1.

Big SQL uses JDBC or ODBC drivers to access data that is

stored in InfoSphere BigInsights in the same way that users

access databases from their enterprise applications. You can use

the Big SQL server to execute standard SQL queries, and to

execute multiple queries concurrently (see Figure 1).

Using IBM InfoSphere BigInsights to accelerate big data time-to-value

Big SQL provides support for large ad hoc queries by using

MapReduce parallelism and point queries, which are low-latency

queries that return information quickly to reduce response time

and provide improved access to data. The Big SQL server is

multi-threaded, so scalability is limited only by the performance

and number of CPUs on the computer that runs the server. If

you want to issue larger queries, you can increase the hardware

performance of the server computer that Big SQL runs on, or

chain multiple Big SQL servers together to increase throughput.

Big SQL enables anyone with existing SQL skills to be immedi-

ately productive, thereby minimizing project timelines and low-

ering the financial commitment to a given project. With Big

SQL, all your data is SQL-accessible, allowing you to choose the

storage format best suited for your application.

Enabling user-driven analytics and data

provisioning

To gain new insights and improve business results, you need an

environment that is well suited to exploring and discovering data

relationships and correlations. With the right technology, you

can extend the value of your data warehouse by bringing in new

types of data and driving new types of analysis. One of the most

common deployment patterns for InfoSphere BigInsights is

known as a Data Exploration Zone. The Data Exploration Zone

provides an environment with the capabilities you need to ana-

lyze information in its raw form—whether it is structured or

unstructured data—with tools such as text analytics, data mining,

entity analytics and machine learning. You could use the data in

this zone for exploratory analytics, or send it to a data warehouse

for deeper analytics, providing greater flexibility in how you

work with and analyze data.

InfoSphere BigInsights

Application

SQL language

JDBC/ODBC driver

Data sources

Hive tables HBase tables CSV ﬁles

JDBC/ODBC server

Big SQL engine

Figure 1. An overview of IBM Big SQL.

IBM Software

To deliver the correct information as rapidly as possible, the data

warehouse supporting these systems must be optimized for the

right balance of analytics performance and operational query

throughput. InfoSphere BigInsights 2.1 provides multiple user-

driven capabilities for creating new data collections from raw

data sets, expanding existing data sets from traditional relational

sources, and performing ad hoc analytics against the data with-

out requiring assistance from IT.

Data can be cleansed, transformed, integrated and aggregated

to prepare it for utilizations. Once prepared, data can be

published to common constructs such as Hive, or can be

staged for use by an external solution such as IBM PureData™

System for Analytics.

IBM BigSheets

IBM BigSheets is a browser-based analytic tool designed to

break large amounts of data into consumable, situation-specific

business contexts. Easily accessed from the InfoSphere

BigInsights console, BigSheets can collect data from multiple

sources, including web crawling, data import/export, data

sampling, social media data collection and analysis, machine

data processing and analysis, ad hoc queries and more

(see Figure 2). BigSheets can also be used with data that is

loaded through other means such as Flume or IBM InfoSphere

Information Server.

Figure 2. Overview of IBM BigSheets.

Using IBM InfoSphere BigInsights to accelerate big data time-to-value

Once InfoSphere BigInsights collects the data, BigSheets users

load the data of interest into a master workbook. From there,

BigSheets allows you to format and explore the data by building

sheets (which resemble spreadsheets) in workbooks that are based

on the master workbook. You can combine columns from differ-

ent workbooks, run formulas and filter data. These manipula-

tions form the basis of your analysis.

BigSheets generates and executes the code necessary to perform

all of the data manipulation work automatically, allowing you to

work in a visual paradigm rather than working at a lower script-

ing or Java level. You can also combine data with text analytics

functions within InfoSphere BigInsights to filter and manipulate

data and drill further into information to derive valuable insights

out of raw data.

After refining the data and running the analytics, you can apply

visualizations such as tag clouds, bar charts, maps and pie charts.

These visualizations provide a consumable output for your data

that highlights relationships and distills insights from previously

disconnected data.

InfoSphere BigInsights Web Console

A role-based Welcome page within InfoSphere BigInsights 2.1

dynamically populates data collections and jobs for each user.

The software includes applications that can be used for complet-

ing various data management tasks. These preinstalled applica-

tions have the same properties as applications you create and can

be used as jumping-off points for big data projects. Users can

create and share new jobs as they are developed, making the

Web Console an increasingly powerful starting point for work-

ing with InfoSphere BigInsights. The following list represents

just some of the applications provided out of the box:



Ad hoc Hive Query: Use the Ad hoc Hive Query application to

create your own customized Hive queries to analyze your data.



Ad hoc Jaql Query: Use the Ad hoc Jaql Query application to

create your own customized Jaql queries to analyze your data.



Ad hoc Pig Query: Use the Ad hoc Pig Query application to

create your own customized Pig queries to analyze your data.



Ad hoc R Script: This application is used to run an R script.

Because Oozie assigns the R script to run on a less-used

cluster node, R script must be installed on all nodes of your

cluster. The R script reads from and writes to files on local

rather than HDFS directories. The Ad hoc R Script applica-

tion, therefore, can copy input files into required local

directories and move output files to HDFS directories.



BoardReader: The BoardReader application searches for,

locates and displays information from multiple web sources,

such as online forums, message boards, blogs, news sources

and videos.



Data Download: This application is used to download data

from the IBM developerWorks® resource for developers.

After you agree to the developerWorks terms and conditions,

you can then select a sample data set from the data set

drop-down list or access some other data set by entering

a URL.



Data Sampling: Given a large data set and parameters, this

application generates a representative data sample. The

application samples the input data by using a uniform random

sample (without replacement). The application outputs

the results to a file whose format is the same format as the

input file.



Data Subset: The Data Subset application is used to create a

subset of your complete data. You can then improve perfor-

mance by analyzing the data subset by structure, content and

file format.



Database Export: This application writes data from files on the

HDFS to a table in the relational database management

system, and uses a Java program to export the data that is

stored on the HDFS into a table in the database. The input

data is stored in files on DFS. The format of the input can be

CSV or JSON.

IBM Software



Database Import: This application loads data from a relational

database management system into a file on the HDFS. It uses

a Java program to import data from a database and write it to

a file on the HDFS. You can specify the SQL (Select) query to

define the data that is to be imported from the database. The

data that is retrieved from the database is then written out to a

file on the HDFS in CSV or JSON format.



Distributed Copy: By using a MapReduce job, you can copy

data from a remote source to the HDFS or from the HDFS

to a remote source. Use the Distributed Copy application to

copy files and directories from one source to another.



HBase: The HBase application enables you to export rows of

data from an HBase table through the InfoSphere BigInsights

console. You can export rows of data from an HBase table as a

JSON file. The application requires parameters to export the

data; it does not accept HBase queries from you.



Web Crawler: The Web Crawler application is an automated

program that methodically tracks Internet pages and collects

data. It also compares the size and contents of a file against the

version of that file stored in InfoSphere BigInsights.



Web REST Import: This application fetches content from a

specified URL location and stores the content in a specified

HDFS directory.

These applications can be modified and then published to

specific users or a class of users based on their security rights to

provide them with multiple options for starting their projects.

Analytics accelerators

IBM offers several analytic accelerators that greatly reduce the

time-to-value of your big data applications. These accelerators

provide business logic, data processing capabilities and visualiza-

tion for specific use cases. By using accelerators, you can apply

advanced analytics to help integrate and manage the variety,

velocity and volume of data that constantly enters your organiza-

tion. Accelerators also create a development environment for

building new custom analytic applications that are tailored to the

specific needs of your organization.

Two accelerators are shipped with InfoSphere BigInsights:

IBM Accelerator for Machine Data Analytics and

IBM Accelerator for Social Data Analytics. Two accelerators are

shipped with InfoSphere Streams: IBM Accelerator for Social

Data Analytics and IBM Accelerator for Telecommunications

Event Data Analytics. These accelerators cover many common

use cases and are easily extended for enterprise-specific needs.

IBM Accelerator for Social Data Analytics

Data from social media forums contains valuable information

about user preferences. However, accessing and working with

that information requires large-scale import, configuration and

analysis capabilities. The IBM Accelerator for Social Data

Analytics, which has a built-in understanding of how to

work with social data sources, extracts relevant information

from tweets, boards and blogs and then builds social profiles

of users based on specific use cases and industries. A typical

workflow consists of importing your data files and then

configuring, indexing and analyzing the data.

With IBM Accelerator for Social Data Analytics, you can:



Import and analyze social media data, identifying user

characteristics such as gender, locations, names and hobbies



Develop comprehensive user profiles across messages

and sources



Associate profiles with expressions of sentiment, buzz, intent

and ownership around brands, products and companies

Using IBM InfoSphere BigInsights to accelerate big data time-to-value

The IBM Accelerator for Social Data Analytics is commonly

used to collect data to enrich customer analytics—and unlike

many social listening tools, can be used to help identify social

activity down to a known customer.

IBM Accelerator for Machine Data Analytics

The IBM Accelerator for Machine Data Analytics can ingest,

parse and extract a variety of machine data from sources such as

machine data files, log files, smart devices and telemetry, and

help process that data in minutes instead of days or weeks. It

helps organizations gain insights into operations, customer expe-

riences, transactions and behavior that may identify infrastruc-

ture issues and changes in customer preferences, or trap events

that can drive systems of engagement. Many IBM customers use

the IBM Accelerator for Machine Data Analytics to proactively

boost operational efficiency, troubleshoot problems, investigate

security incidents and monitor end-to-end infrastructure to

avoid service degradation or outages.

A typical introductory workflow consists of organizing and

importing batches of data, and then extracting, indexing, search-

ing, transforming and analyzing the data. With IBM Accelerator

for Machine Data Analytics, you can:



Search within and across multiple machine data entries based

on a text search, faceted search or a timeline-based search to

find events



Enrich the context of machine data by adding and extracting

log types into the existing repository



Link and correlate events across systems



Uncover patterns

InfoSphere BigInsights Text Analytics

InfoSphere BigInsights Text Analytics is a powerful and

declarative information extraction system that excels at creating

structured information from text inputs, allowing users to

gain actionable insights from the underlying text data. The

InfoSphere BigInsights Text Analytics module was custom-

designed to leverage a Hadoop-oriented processing model.

It is extremely fast and capable of handling large amounts of

unstructured information very quickly compared to traditional

text analytics approaches. The InfoSphere BigInsights Text

Analytics module is also declarative, meaning it can be readily

adapted to your specific analytics needs using a SQL-like

method that is simply not possible with conventional text tools.

This helps lower costs as well as providing a level of comfort

that is unique in the Apache Hadoop space.

Text Analytics is included in the Eclipse development environ-

ment as part of the InfoSphere BigInsights Text Analytics

Workflow perspective. Text Analytics Eclipse development

tools can be used to develop and test extractors in Eclipse.

Once you have identified and selected the extractor you would

like to use, you can publish it into the InfoSphere BigInsights

Console as an application, which an administrator can then

deploy and make available for consumption by users across

InfoSphere BigInsights.

The Welcome page of the InfoSphere BigInsights Console

includes information about how to enable your Eclipse environ-

ment for developing applications using InfoSphere BigInsights.

Once the extractor is published and the application is deployed

in the InfoSphere BigInsights console, it can then be run as a

BigSheets function or as part of a workflow.

IBM Software

Results from a Text Analytics application can be exported to

BigSheets, Dashboard and other InfoSphere BigInsights

components for further analysis (see Figure 3).

IBM InfoSphere Data Explorer

A key part of your data exploration activities is allowing for

human-driven exploration and rapid understanding of the

information you have on hand. It facilitates end-user-driven

creation of topics of interest and automatic discovery of related

information, and lets users quickly build and deploy interactive

web applications that are commonly used in customer insight

and customer service environments. InfoSphere Data Explorer

Engine servers can receive data in real time from a cluster of

InfoSphere BigInsights servers or InfoSphere Streams servers.

InfoSphere Data Explorer can also push relevant data to users

of information applications and also enables federated access

to other IBM products.

Figure 3. Overview of the InfoSphere BigInsights Text Analytics Workflow.

Using IBM InfoSphere BigInsights to accelerate big data time-to-value

Leveraging in-motion and at-rest

analytics

Organizations are increasingly deploying analytics and applica-

tions that span in-motion (or real-time) and at-rest use cases.

The generation of insight that spans in-motion and at-rest use

cases requires data and analytics to flow across both types of

environments. InfoSphere BigInsights 2.1 accelerates the ability

to span in-motion and at-rest information handling by leverag-

ing IBM InfoSphere Streams.

InfoSphere Streams is a high-performance computing platform

that allows user-developed applications to rapidly ingest, analyze

and correlate information as it arrives from thousands of real-

time sources. For streaming data, InfoSphere Streams can con-

tinuously analyze massive amounts of data with very low latency,

enabling you to quickly react to trends and events as they unfold.

Programmers can instruct InfoSphere Streams to write data as

needed to InfoSphere BigInsights for deep analysis of trends

over time. The results of this analysis can be captured and fed

back to InfoSphere Streams to fine-tune application logic and

actions. To reduce deployment time and costs, InfoSphere

Streams applications map naturally to distributed InfoSphere

BigInsights stores.

Integration with popular modeling and

predictive analytics solutions

IBM customers have been asking for easy ways to use a variety of

predictive modeling solutions with Hadoop-based discovery

zones to quicken time-to-value and better utilize existing solu-

tions and skills. InfoSphere BigInsights meets this request by

supporting the most common modeling and analytics packages

available, including SAS, IBM SPSS® and R. Users familiar

with these modeling environments can use data in InfoSphere

BigInsights for their analysis. This allows users to continue to

work in a familiar environment while gaining access to new,

richer information than previously possible.

Each analytics package has different levels of Hadoop support

within the InfoSphere BigInsights environment. SAS and other

environments mainly use BigInsights as a data-sourcing environ-

ment, allowing you to utilize information in structures such as

Hive. Some packages, such as SPSS Catalyst for InfoSphere

BigInsights, allow for model development and execution to be

done directly on the InfoSphere BigInsights platform.

SPSS Catalyst enhances analytical productivity and shortens

time-to-value by helping to automate portions of data prepara-

tion, automatically interpreting results and presenting analyses in

interactive visuals with clear, concise summaries. SPSS Analytic

Catalyst running with InfoSphere BigInsights allows automated

key driver identification with sophisticated algorithms, automatic

testing and regression-based techniques. SPSS Analytic Catalyst

also provides interactive visuals and plain language summaries of

predictive analytics findings that show insights at a glance, along

with supporting explanations and statistical details.

Conclusion

InfoSphere BigInsights 2.1 provides a unique set of capabilities

that combine the innovation from the Apache Hadoop ecosys-

tem with robust support for traditional skill sets and already-

installed tools. The ability to leverage existing skills and tools

through open-source capabilities helps drive lower total cost of

ownership and faster time-to-value.

Notes

For more information

To learn more about InfoSphere BigInsights and the InfoSphere

BigInsights for Hadoop Quick Start Edition, please contact

your IBM representative or IBM Business Partner, or visit

the following websites:



ibm.com/software/data/infosphere/biginsights



ibm.com/infosphere/quickstart

About the author

Tom Deutsch (@thomasdeutsch) serves as a Program Director

for the IBM Big Data team. He played a formative role in the

transition of Hadoop-based technology from IBM Research to

the IBM Software Group, and he continues to be involved with

IBM Research big data activities as well as transitions from

IBM Research to commercial products. Deutsch created the

InfoSphere BigInsights Hadoop-based product and spent several

years helping customers with Hadoop, InfoSphere BigInsights

and InfoSphere Streams technologies, including identifying

architecture fit, developing business strategies and managing

early-stage projects across more than 200 customer engage-

ments. With more than 20 years in the industry and as a

veteran of two startups, Deutsch is an expert on the technical,

strategic and business information management issues facing

the enterprise today.

IBM Corporation

Software Group

Route 100

Somers, NY 10589

Produced in the United States of America

December 2013

IBM, the IBM logo, ibm.com, BigInsights, developerWorks, InfoSphere,

PureData, and SPSS are trademarks of International Business Machines

Corp., registered in many jurisdictions worldwide. Other product and service

names might be trademarks of IBM or other companies. A current list of

IBM trademarks is available on the web at “Copyright and trademark

information” at

ibm.com/legal/copytrade.shtml

Java and all Java-based trademarks and logos are trademarks or registered

trademarks of Oracle and/or its affiliates.

Linux is a registered trademark of Linus Torvalds in the United States,

other countries, or both.

This document is current as of the initial date of publication and may be

changed by IBM at any time. Not all offerings are available in every country

in which IBM operates.

It is the user’s responsibility to evaluate and verify the operation of any other

products or programs with IBM products and programs.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED

“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR

IMPLIED, INCLUDING WITHOUT ANY WARRANTIES

OF MERCHANTABILITY, FITNESS FOR A PARTICULAR

PURPOSE AND ANY WARRANTY OR CONDITION OF

NON-INFRINGEMENT. IBM products are warranted according to the

terms and conditions of the agreements under which they are provided.

This paper does not cover all of the more than 30 capabilities that

InfoSphere BigInsights contributes beyond what is available with open-

source Hadoop distributions. For comprehensive product documentation

on all InfoSphere BigInsights 2.1 features, please see the Product Center

documentation at http://pic.dhe.ibm.com/infocenter/bigins/v2r1/index.jsp?

topic=%2Fcom.ibm.swg.im.infosphere.biginsights.tut.doc%2Fdoc%

2Ftut_Introduction.html

IMW14684-USEN-01

Please Recycle