3
IBM Software
InfoSphere BigInsights: A highly compatible platform
Third-party applications, partner solutions and custom
development projects that are compatible with the following
InfoSphere BigInsights support versions are expected to work
without changes beyond updating data locations.
• Apache Hadoop (1.1.1), a 64-bit Linux version of the
IBM SDK for Java 6, and Java
• Avro (1.7.2), a data serialization system
• Chukwa (0.5.0), a data collection system for monitoring large
distributed file systems
• Fair Scheduler, for basic management of job submission
• Flume (1.3.0), a distributed, reliable and highly available
service for efficiently moving large amounts of data
around a cluster
• HBase (0.94.3), a non-relational distributed database
written in Java
• HCatalog (0.4.0), a table and storage management
service for Hadoop
• Hive (0.9.0), a data warehouse infrastructure that facilitates
both data extraction, transformation and loading (ETL),
and the analysis of large data sets that are stored in the
Hadoop Distributed File System (HDFS)
• IBM InfoSphere BigInsights Jaql, a query language designed
for JavaScript Object Notation (JSON), primarily used to
analyze large-scale semi-structured data
• Lucene (3.3.0), a high-performance, full-featured text
search engine library written entirely in Java
• Oozie (3.2.0), a workflow coordination manager
• Orchestrator, an advanced MapReduce job control system
that uses a JSON format to describe job graphs and the
relationships between them
• Pig (0.10.0), a platform for analyzing large data sets,
consisting of a high-level language for expressing data
analysis programs and an infrastructure for evaluating
those programs
• Sqoop (1.4.2), a tool that imports information from
structured databases and related Hadoop systems
into Hadoop clusters
• ZooKeeper (3.4.5), a centralized service for maintaining
configuration information that provides distributed
synchronization and group services
This paper describes how these enhancements help extend the
value of open-source Hadoop with the capabilities organizations
need to cost-effectively support emerging big data workloads.
Accelerating deployments by tapping into
Hadoop community innovation
The IBM commitment to the Hadoop open-source software
components in InfoSphere BigInsights 2.1 helps facilitate third-
party interoperability and supports the ongoing development of
new features and functionality. Organizations with existing
MapReduce, Hive, Pig and Sqoop projects can leverage that
work on InfoSphere BigInsights 2.1 if the version levels are all
compatible and directory structures are mirrored.
Leveraging existing SQL skills and
solutions
Legacy applications depend on SQL to access stored data, and
SQL is the de facto language used to query structured data;
as a result, most organizations have deep and abundant SQL
skills. IBM customers have been asking for ways to leverage their
SQL skills with Hadoop to lower the barrier to getting started
with Hadoop and to facilitate interoperability with existing
SQL-oriented tools and applications. IBM is enabling customers
to do exactly that with the introduction of IBM Big SQL, a data
warehouse system for Hadoop that is used to summarize, query
and analyze data that is stored in InfoSphere BigInsights 2.1.
Big SQL uses JDBC or ODBC drivers to access data that is
stored in InfoSphere BigInsights in the same way that users
access databases from their enterprise applications. You can use
the Big SQL server to execute standard SQL queries, and to
execute multiple queries concurrently (see Figure 1).