(Big Data).

Transcripción

(Big Data).
Meetup
Cluster de Hadoop con BigSQL
Luis Reina
Information Management
IBM Software Group
@luisrei
[email protected]
1
© 2014 IBM Corporation
¿Qué es un Meetup?
2
© 2014 IBM Corporation
Meetup Big Data (Madrid)
Big Data Developers in Madrid
http://www.meetup.com/Big-Data-Developers-in-Madrid
3
© 2014 IBM Corporation
AGENDA
Introducción
09:45-10:45 ¿Qué es Hadoop?
10:45-11:15 Café
11:15-11:30 El Ecosistema entorno a Hadoop
11:30-11:45 IBM BigInsights
11:45-12:15 BigSQL
12:15-12:30 IBM Bluemix (Hadoop Cloud)
12:30-13:30 Laboratorio de Big SQL
09:30-09:45
4
© 2014 IBM Corporation
INTRODUCCIÓN
5
© 2014 IBM Corporation
Consideraciones sobre Big Data
Multiples Definiciones de Big Data
Big Data está de moda: Todo el Mundo habla de
“Big Data”
Big Data es el Problema
Herramientas/Desarrollos para Convertir la Amenaza en
Oportunidad.
Servidores Baratos y Software Open Source pero
no hay que menospreciar el esfuerzo/coste de
Analizar Big Data.
6
© 2014 IBM Corporation
¿Qué es Big Data? y ¿Qué no es Big Data?
La Frontera no esta 100% clara
DATOS
TRADICIONALES
Bases de Datos Relacionales
Datos Transaccionales OLTP
Datos de un ERP
…….
7
BIG DATA
Datos de Redes Sociales (tweeter,
fabebook)
Logs de IT, Web.
Datos de Sensores
…………
© 2014 IBM Corporation
¿Qué es Big Data? y ¿Qué no es Big Data?
DATOS que debido a su:
VOLUMEN,
VELOCIDAD o
VARIEDAD (formato)
Es DIFICIL o IMPRACTICO
ANALIZAR
Con medios TRADICIONALES.
8
© 2014 IBM Corporation
¿Cúal es la Solución para abordar Big Data?
Clasificación de Big Data:
Datos en Reposo:
– Los Datos analizados están almacenados.
– Ejemplos: Información de logs, facebook, twitter, etc.
– Solución: Hadoop (open source).
Datos en Movimiento:
– Los Datos son analizados en vuelo, en tiempo real, según se generán sin
esperar a almacenarlos.
– Ejemplos: Sensores, Información de fraude, etc.
– Solución: IBM Infosphere Streams.
9
9
© 2014 IBM Corporation
¿Qué es Hadoop?
10
© 2014 IBM Corporation
Hadoop es un framework de desarrollo y
Entorno de Ejecución
Un framework de desarrollo y un entorno de
ejecución para realizar aplicaciones capaces de
procesar gran volumen de datos (Big Data).
Las aplicaciones generadas son de tipo batch y
de lectura intensiva.
Basado en tecnología de Google.
Es Open Source (gratuito): Apache Hadoop
http://hadoop.apache.org/
11
© 2014 IBM Corporation
Hadoop no es un Sistema Gestor de Base de Datos
Es un framework de desarrollo y ejecución no un SGBD.
No pretende sustituir los Data Warehouse actuales.
Las aplicaciones generadas usan CPU y disco de
ordenadores baratos de tipo “commodity”.
Las aplicaciones funcionan en Cluster de muchas
máquinas trabajando en paralelo.
Se pueden añadir máquinas sin cambiar las aplicaciones,
ni como se cargan los datos, ni los formatos de datos.
Si se “rompe” una máquina otra realiza su trabajo.
12
© 2014 IBM Corporation
Origen y Evolución de Hadoop
Wins Terabyte
sort benchmark
Publishes
MapReduce,
GFS Paper
early research
13
Apache OpenSource
MapReduce & HDFS
projects created
Runs 4,000
node Hadoop
Cluster
open source dev
momentum
Launches SQL
Support for
Hadoop
Big Insights
announced
Releases
CHD3
initial success stories
Commercialization
© 2014 IBM Corporation
Hadoop tiene 2 componentes clave
Sistema de Ficheros: HDFS
– Donde Hadoop almacena los datos.
– Usa discos locales pero trababa como un gran sistema
de ficheros entre multiples nodos.
Map/Reduce
– Algoritmo para procesar los datos en el cluster.
– Son 2 pasos MAP y REDUCE.
– Divide y Vencerás
14
© 2014 IBM Corporation
HDFS es un Sistema de Ficheros para el Cluster
HDFS= HADOOP Distributed FILESYSTEM
HDFS es un sistema de ficheros para almacenar
los datos que se van a analizar.
Es un único sistema de ficheros distribuido.
Los datos se reparten por todo el cluster.
Cada nodo del cluster tiene un “cachito” de los
datos .
Esto “cachitos” se llamas bloques y son de
64MB por defecto.
15
© 2014 IBM Corporation
HDFS
HDFS asume que un nodo puede fallar replicando los
datos en multiples nodos.
Por defecto 3 copias
No hay una SAN o NAS los nodos tienen discos locales
solamente.
Los nodos pueden hablar entre si para rebalancear y
mover los de datos si fuese necesario.
Existe un Nodo (NodeName) que guarda la información
de quien tiene que (metadatos), i.e. en que nodo estan
que datos.
Las aplicaciones no se tienen que preocupar de la
ubicación de los datos.
16
© 2014 IBM Corporation
DEMO HDFS: VMware
Herramienta VMware: Me permite disponer de uno o más
maquinas virtuales (con distintos Sistema Operativos) dentro de
mi sistema operativo nativo.
– Ejecutar Linux dentro de Windows.
– Ejecutar Windows dentro de Linux.
– Ejecutar varios sistemas operativos a la vez.
17
© 2014 IBM Corporation
DEMO HDFS
Sistema de Fichero POSIX (Portable Operating System Interface)
HDFS no es POSIX
Hadoop shell
– hadoop dfs –<comando>
Ej:
hadoop dfs –ls
hadoop dfs –mkdir …
hadoop dfs –put…
hadoop dfs –get
Interfaz Web Apache Hadoop
– HDFS Name Node: http://<hostname>:50070
Interfaz Web de BigInsights
– http://<hostname>:8080
18
© 2014 IBM Corporation
¿Qué es Map/Reduce?
Algoritmo para analizar los datos.
Partimos de que se han distribuido los datos por el
cluster (HDFS).
El programa que analiza estos datos hace uso del
algoritmo Map/Reduce. Estos programas se llaman
Jobs que se dividen en Tareas (Tasks) de tipo Map
y Reduce
Paso 1: Tarea Map
Convierte los datos en Tuplas: (clave, valor)
Paso 2: Tarea Reduce
Reduce el número de Tuplas generadas por Map (e.g.
agregando)
19
© 2014 IBM Corporation
Ejemplo de MapReduce
Contar el número de apariciones de cada palabra
Hola Mundo Adios Mundo
Datos
Entrada
Proceso
Map
(paralelo)
Hola Meetup
Map 1 emite:
< Hola, 1>
< Mundo, 1>
< Adios, 1>
< Mundo, 1>
Map 2 emite:
< Hola, 1>
< Meetup, 1>
Reduce (salida final):
Proceso
Reduce
20
<
<
<
<
Adios, 1>
Meetup, 1>
Hola, 2>
Mundo, 2>
© 2014 IBM Corporation
Como se ejecuta una aplicación Hadoop
Nodos de Datos de Hadoop
public static class TokenizerMapper
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
private final static IntWritable
one = new IntWritable(1);
one = new IntWritable(1);
private Text word = new Text();
private Text word = new Text();
public void map(Object key, Text val, Context
public void map(Object key, Text val, Context
StringTokenizer itr =
StringTokenizer itr =
new StringTokenizer(val.toString());
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
word.set(itr.nextToken());
context.write(word, one);
context.write(word, one);
}
}
}
}
}
}
public static class IntSumReducer
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
private IntWritable result = new IntWritable();
public void reduce(Text key,
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
Iterable<IntWritable> val, Context context){
int sum = 0;
int sum = 0;
for (IntWritable v : val) {
for (IntWritable v : val) {
sum += v.get();
sum += v.get();
1. Fase de Map
Lanzar las tareas
MAP al cluster
2. Shuffle
. . .
. . .
3. Fase de Reduce
Aplicación MapReduce
Shuffle
Resultado
21
Devuelve un único
conjunto de Resultados
© 2014 IBM Corporation
¿Cómo crear Aplicaciones Hadoop? (Jobs Map/Reduce)
• Desarrollos Map/reduce en JAVA
Difícil
Muy Complejo
• PIG
Lenguaje Open/Source de más alto nivel
Estándar
• HIVE
PIG
Lenguaje Open/Source
Similar al SQL
• JAQL
Lenguaje similar a PIG, mayor funcionalidad
• Herramientas tipo BigSheets
Navegador/Hoja de Cálculo
No requiere desarrollo
Fácil
22
© 2014 IBM Corporation
DEMO Map/Reduce
Programa que Cuenta Palabras:
hadoop jar hadoop-examples.jar wordcount <in-dir> <out-dir>
Ejemplo:
hadoop jar hadoop-example.jar wordcount /tmp/datos /salida
Interfaz Web Apache Hadoop
– http://<hostname>:50030
Interfaz Web BigInsights
– http://<hostname>:8080
23
© 2014 IBM Corporation
Ecosistema de Hadoop
24
© 2014 IBM Corporation
Entorno a la idea de Hadoop existe un
rico Ecosistema
El Ecosistema enriquece
Hadoop.
Incluye herramientas como:
– Flume: Para cargar datos en Hadoop
(HDFS).
– HBase: Base de datos sobre HDFS.
– Oozie: Control de Flujos (workflow).
– Lucene: Indexador y motor de
búsqueda en HDFS.
– Jaql, Pig, Hive: Lenguajes de alto
nivel que generan Map/Reduce jobs.
– Text Analytics: Análisis de Texto
usando Map/Reduce.
– Zookeeper: Coordinador.
– ………………………………………
25
PIG
ZooKeeper
© 2014 IBM Corporation
(Full Text Search)
BigIndex
(Text Analytics)
SystemT
(Big Data Shell)
Jaql
(Data Warehouse SQL)
Hive
(ETL)
Pig
(Interactive Storage)
HBase
(Data Collection)
Flume
(Workflow)
Oozie
(Coordination)
Zookeeper
(Management)
Web Console
Ejemplo de Herramientas del Ecosistema
MapReduce
(Distributed Computation Framework)
HDFS (or GPFS)
(Distributed File System)
Blue Boxes components only available with IBM BigInsights product.
26
© 2014 IBM Corporation
Hive
Lenguaje de consultas
para acceder a Hadoop.
Lenguaje “Similar” al
SQL.
Hive Query Language (HQL).
Genera Map/Reduce
Paralelismo MPP.
Tiene un catalogo de
datos (Hive Catalog).
Un uso de caso de
Hadoop es “Data
Warehouse
Augmentation”.
27
27
Tabla Hive y Consulta
CREATE EXTERNAL TABLE Tabla_Hive
(
Nombre STRING,
Direccion STRING,
Edad
INT)
)
COMMENT ”Table de Ejemplo”
ROW FORMAT DELIMITED
FIELDS TERMINATED by ”\t”
STORED AS TEXTFILE
LOCATION ”/datos/tabla_hive/”;
SELECT * FROM Tabla_Hive;
© 2014
2013 IBM
IBM Corporation
Corporation
©
Hive tiene limitaciones
Se parece pero no es SQL estándar.
No SQL92.
No esta pensado para consultas “online”.
Proceso Batch para consultas y cargas de datos.
No soporta Inserts y Updates.
Tipos de Datos limitados (e.g. varchar, decimal).
No Soporta de Subqueries.
Soporte de Sintaxis de Joins limitada.
Driver jdbc/odbc limitado.
28
28
© 2014
2013 IBM
IBM Corporation
Corporation
©
HBASE
Base de datos NoSQL Open Source.
Basado en los papeles de Google de Big Table
[2006].
Permite grandes volumenes.
• 2011 Cluster de 1000 nodos con Petabytes de datos (3x)
Tolerante a fallos, escalable de forma horizontal,
alto rendimiento.
Usa el sistema de fichero de Hadoop HDFS para
almacenar los dato.
Permite Real Time:
Modificaciones y consultas (Hadoop es batch y lecturas)
Proporciona a Hadoop una forma de hacer modificaciones
y Real Time.
29
29
© 2014
2013 IBM
IBM Corporation
Corporation
©
Limitaciones de HBASE
No es una base de datos Relacional.
Lenguaje no es SQL (scan, get, put,etc)
No permite tener índices secundarios
No permite transaccionalidad entre varias filas.
Carece de un optimizador de consultas.
Consume mucho espacio.
30
30
© 2014
2013 IBM
IBM Corporation
Corporation
©
BIG SQL
SQL NATIVO: ANSI SQL
92+
Drivers JDBC/ODBC.
Usa Map Reduce para
paralelismo.
Acceso directo para
consultas rápidas.
Muchas fuentes de datos
Hbase
Ficheros CSV, delimitados
Ficheros JSON.
Tablas HIVE.
………………
31
31
Application
SQL
JDBC / ODBC Driver
JDBC / ODBC Server
Big SQL Engine
Data Sources
Hive Tables
HBase tables
CSV Files
© 2014
2013 IBM
IBM Corporation
Corporation
©
BigInsights
32
© 2014 IBM Corporation
¿Qué es IBM BigInsights?
Producto basado en Hadoop.
Mejora Hadoop para que sea “Enterprise Ready” añadiendo
distintos elementos como:
• Administración.
• Seguridad.
• Sistema de ficheros GPFS.
• Capacidades analíticas avanzadas de IBM Research.
• Workflow.
• Aprovisonamiento.
• Facilidad de Uso (BigSheets).
Se integra con las Bases de datos y Warehouse existentes
de IBM: DB2 Infosphere Warehouse, Smart Analytics y
Netezza.
33
33
© 2014 IBM Corporation
IBM Significantly Enhances Hadoop
IBM Innovation
• Scalable
– New nodes can be added on the fly.
• Affordable
– Massively parallel computing on
commodity servers
• Flexible
– Hadoop is schema-less, and can
absorb any type of data.
• Fault Tolerant
– Through MapReduce software
framework
34
34
• Performance & reliability
– Adaptive MapReduce,
Compression, Indexing, Flexible
Scheduler
• Analytic Accelerators
• Productivity Accelerators
– Web-based UIs
– Tools to leverage existing skills
– End-user visualization
• Enterprise Integration
– To extend & enrich your information
supply chain.
© 2014 IBM Corporation
BigInsights Enterprise Edition
Optional
IBM and
partner
offerings
Analytics and discovery
Text
processing
engine and
library
Accelerator for
social data
analysis
BigSheets
Accelerator for
machine data
analysis
Big R
Infrastructure
Integrated
installer
Text compression
Adaptive
MapReduce
Enhanced
security
Web Crawler
Boardreader
Distrib file copy
...
DB export
DB import
Machine
learning
Data
processing
Jaql
Pig
Oozie
HBase
Hive
Lucene
GPFS
GPFS –FPO
–FPO
ZooKeeper
HCatalog
MapReduce
HDFS
Connectivity and Integration
JDBC
Flume
35
Sqoop
Data Explorer
DB2
Netezza
Guardium
DataStage
IBM
Administrative and
development tools
Ad hoc query
Big SQL
Indexing
Flexible
scheduler
“Apps”
Open Source
Streams
Web console
• Monitor cluster health, jobs, etc.
• Add / remove nodes
• Start / stop services
• Inspect job status
• Inspect workflow status
• Deploy applications
• Launch apps / jobs
• Work with distrib file system
•Work with spreadsheet interface
•Support REST-based API
• Create / view alerts
•...
Eclipse tools
• Text analytics
• MapReduce programming
• Jaql, Hive, Pig development
• BigSheets plug-in development
• Oozie workflow generation
Cognos BI
© 2014 IBM Corporation
Consola Web de BigInsight
Gestiona BigInsights
–
–
–
–
Inspeccionar el Sistema
Añadir/Quitar nodos
Arrancar/Parar servicios
Ejecutar y monitorizar jobs
(aplicaciones)
– Explorar el sistema de ficheros.
– .…………………….
Lanza Aplicaciones
– Herramienta de análisis con formato
de hoja de cálculo.
– Aplicaciones preconstruidas
(suministradas por IBM o
desarrolladas por el usuario)
Publica Aplicaciones
36
© 2014 IBM Corporation
Ejecutar aplicaciones desde la Consola Web
37
© 2014 IBM Corporation
Desarrollos con Eclipse
Eclipse based development tools
For JAQL, Hive, Java MapReduce, Text Analytics
38
© 2014 IBM Corporation
Quickly drag and drop to create new applications
39
© 2014 IBM Corporation
Application Accelerators
Quickly build, deploy custom applications in high-value areas
IBM Accelerator for Social Data Analytics
• B2C businesses
• Sample applications: Customer acquisition / retention, Customer
Segmentation or Micro Segmentation, Marketing Campaign Optimization,
Lead generation, Brand Management or Surveillance
• Ships with BigInsights v2 and Streams v3
IBM Accelerator for Machine Data Analytics
• Cross-industry: manufacturing, oil & gas, energy and utility, healthcare,
travel and transportation, CPG, Retail, etc.
• Operational efficiency monitoring, security incident investigation. proactive
maintenance, troubleshooting, outage prevention, efficiency tracking, etc
• Ships with BigInsights v2
IBM Accelerator for Telco Event Data Analytics
• Telcos
• Campaign management, real-time promotion, fraud detection, service
assurance and network monitoring,
• Ships with Streams v3, but works with BigInsights or PureSparta for
Analytics (a.k.a. Netezza)
40
© 2014 IBM Corporation
Big Sheets: Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console – e.g., file system
data, output from Web crawl, etc.
41
© 2014 IBM Corporation
Big Sheets: Collection Operations
Work with built-in “sheets” editor
Add / delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style syntax
Apply built-in or custom macro
functions….
42
© 2014 IBM Corporation
What is Text Analytics?
High Performance and Scalable rule based Information Extraction Engine.
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
–
–
Provides sophisticated tooling to help build, test, and refine rules.
Developer tools, an easy to use text analytics language, and a set of
extractors for fast adoption.
Multilingual support, including support for DBCS languages.
Developed at IBM Research since 2004: System T
Embedded in several IBM products
–
Infosphere Warehouse
–
Infosphere Streams.
–
Lotus Notes
–
Cognos Consumer Insights
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
43
© 2014 IBM Corporation
Annotator Query Language (AQL)
Language to create rules for Text Analytics.
SQL Like Language.
Fully declarative text analytics language.
Once compiled produced an AOG plan to work in the data.
No “black boxes” or modules that can’t be customized.
Tooling for easy customization because you are abstracted from the
programmatic details.
Competing solutions make use of locked up black-box modules that cannot be
customized, which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern <N.match> <U.match>
as match
from Number N, Unit U;
44
© 2014 IBM Corporation
BigInsights Text Analytics Components
Eclipse Tools
– Develop and maintain extractors in AQL
AQL Language
Pre-compiled extractor library
– Western languages: Named Entities (person,
organization, location, phone, URL, email,
date/time) and financial events (merger,
acquisition, company earnings)
– Chinese/Japanese: Named Entities (Person,
Organization, Location)
Optimizer
Compiled
Plan
Jaql Text Analytics module
– Execute extractors on the cluster from Jaql
Text Analytics Java API
– Invoke Text Analytics directly from your
application
Input
Document
Extracted
objects
BigInsights
Cluster
45
© 2014 IBM Corporation
Text Analytic: Simple Example
Football World Cup 2010, one team distinguished well
from the rest winning the final. Early in the second
half, Netherlands’ striker, Arjen Robben, had a chance
to score, but the awesome keeper for Spain, Iker
Casillas made the save. Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win.
World Cup 2010 Highlights
46
Arjen Robben
Striker
Netherlands
Iker Casillas
Andres Iniesta
Keeper
Spain
Winger
Spain
© 2014 IBM Corporation
Large Scale Indexing, Faceted Search
Designed to improve text searches over big data
Indexing characteristics
– Based on Apache Lucene
– Parallel index
•
Index operation is run in parallel, but the index is stored in one physical index.
•
•
•
Index is too large to be contained in one physical index
Index is distributed into shards, representing one logical index
Each query is evaluated against all shards
– Distributed index
Faceted search - categorization, drill down
47
© 2014 IBM Corporation
Adaptive MapReduce Performance
Fully compatible with Hadoop jobs
Tests out to be 20 – 50% faster
Self adjusts based on nature if
Hadoop job.
Broadcast join example
– Large startup cost (100sec) vs.
– Imperfect balance between maps
Adaptive Mappers to the rescue
– Split size matters much less
– Default size (64MB) performs much
better
48
© 2014 IBM Corporation
BigInsights LZO Compression
IBM LZO compression:
– Fast, flexible compression.
– Splittable vs GNU that is not.
– Similar to GNU-based LZO compression, but no index file
needed.
– Fixed-size compression blocks automatically created
Original source:
Compressed
representation
49
Fixed size
© 2014 IBM Corporation
Comparación de GPFS y HDFS
50
Sistema de Ficheros
GPFS
HDFS
Robusted
No punto único de fallo
Vulnerabilidad del
NameNode
Integridad de Datos
Alta
Posibilidad de perdida de
datos
Escalabilidad
Miles de nodos
Miles de Nodos
Cumplimiento POSIX
Completo
Limitado
Gestión de Datos
Securidad, Copias de
Seguridad, Replicación
Limitado
Rendimiento
Map Reduce
Bueno
Bueno
Rendimiento de
Aplicaciones Tradicionales
Bueno
Rendimiento pobre en
lecturas y escrituras de tipo
random
© 2014 IBM Corporation
Integración
Sample UDFs to
submit BigInsights jobs,
consume results
Netezza
DB2
JDBC
Streams
Jaql read/write
DataStage
DB2
LUW,
IW with
DPF
Netezza
BigInsights
JDBC
DBMS
Jaql read/write
51
© 2014 IBM Corporation
Integración con DB2
52
© 2014 IBM Corporation
Integración con DB2
53
© 2014 IBM Corporation
Más Información de BigInsights (1/3)
Wiki de BigInsight con enlaces, demos, forums, etc.
http://www.ibm.com/developerworks/wiki/biginsights/
54
54
© 2014 IBM Corporation
Más Información de BigInsights (2/3)
In the Cloud
– Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise
Cloud, or on private clouds.
– Pay only for the resources used.
In the Classroom
– Via IBM Education
– Online at www.bigdatauniversity.com
On Your Cluster
– Download Basic Edition from ibm.com.
With the BigInsights Community
– Technical portal at http://tinyurl.com/biginsights
– Links to demos, papers, forum, downloads, etc.
55
© 2014 IBM Corporation
Más Información de BigInsights (3/3)
Get Educated
– IBM Big Data: ibm.com/bigdata
– IBMBigDataHub.com
– BigDataUniversity.com
– IBV study on big data
– Books / analyst papers
Schedule a Big Data Workshop
– Free of charge
– Best practices
– Industry use cases
– Business uses
– Business value assessment
56
© 2014 IBM Corporation
BigSQL
57
© 2014 IBM Corporation
IBM inventor de las Bases de Datos
1960s: Navigational DBMS
– IMS (hierarchical)
1970s-1980s: Relational DBMS
– SQL
– System R, System Z, DB2
1990s: Data Warehouse
– Dimensional model, ETL, MDM
Today: Big Data
– Big SQL
58
58
Ted Codd
© 2014 IBM Corporation
SQL para Hadoop: ¿Por qué?
Data warehouse augmentation is
the leading Hadoop use case
1
Pre-Processing Hub
Streams
Real-time
processing
2
Query-able Archive
BigInsights
Information
Integration
BigInsights
Landing zone
for all data
Data Warehouse
Data Warehouse
3
Exploratory Analysis
Can combine
with
unstructured
information
Data Warehouse
MapReduce is difficult
– MapReduce Java API is tedious and
requires programming expertise
– Unfamiliar languages (ie. Pig) also require special skills
SQL support opens the data to a much wider audience
– Familiar, widely known syntax
– Common catalog for identifying data and structure
– Declarative – clear separation of the what (the data you’re after) vs. the how (processing)
59
59
© 2014 IBM Corporation
Big SQL: Acceso Nativo SQL a Hadoop
Native SQL access to data
stored in BigInsights
Application
– ANSI SQL 92+
– Standard syntax support (joins, data types, …)
SQL
Real JDBC/ODBC drivers
–
–
–
–
JDBC / ODBC Driver
Prepared statements
Cancel support
Database metadata API support
Secure socket connections (SSL)
JDBC / ODBC Server
Big SQL Engine
Optimization
– Leveraging MapReduce parallelism
or…
– Direct access for low-latency queries
Data Sources
Varied data sources
–
–
–
–
60
60
HBase (including secondary indexes)
CSV, Delimited files, Sequence files
JSON
Hive tables
Hive Tables
HBase tables
CSV Files
BigInsights
© 2014 IBM Corporation
Arquitectura
Big SQL shares catalogs with
Hive via the Hive metastore
Application
– Each can query the others tables
SQL engine analyzes incoming
queries
– Separates portion(s) to execute at
the server vs. portion(s) to
execute on the cluster
– Re-writes query if necessary for
improved performance
– Determines appropriate storage
handler for data
– Produces execution plan
– Executes and coordinates query
SQL Language
JDBC / ODBC Driver
Big SQL Server
Network Protocol
SQL Engine
Job Tracker
Name Node
Head Node
Head Node
•••
Storage Handlers
Del
Files
SEQ
Files
HBase RDBMS
•••
Head Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Hive Metastore
Head Node
Task
Tracker
Data
Node
Region
Server
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
BigInsights Cluster
61
© 2014 IBM Corporation
Herramientas Estándar de BI
Cognos BI server can push down
many computations to BigInsights
– Big SQL directs this processing to
happen on BigInsights instead of the
Cognos BI Server
Faster response times
– Increased opportunity for query
processing to occur closer to the data
Cognos BI Server
Explore &
Analyze
Report & Act
SQL
Interface
via JDBC
Application
Free from the limitations of Hive
(latency, SQL language support)
(Map-Reduce)
Storage
(HBase, HDFS)
InfoSphere BigInsights
62
62
© 2014 IBM Corporation
Herramienta Estándar: SQuirreL SQL
Using existing SQL tooling against BigData
Support for “standard” authentication!!
(not supported for Hive, but supported by Big SQL!)
63
63
© 2014
2013 IBM
IBM Corporation
Corporation
©
Herramienta Estándar: Eclipse
Using existing SQL tooling against BigData
Same setup as for existing SQL sources!!
Support for “standard” authentication!!
64
64
© 2014 IBM Corporation
BigSQL desde la Consola de BigInsights
In Quick Links, select to run Big SQL queries from the console
Type in query, or cut and paste from SQL script. Hit Run.
65
© 2014 IBM Corporation
BigSQL: Tablas
BigSQL supports create table and many data types including
varchar, decimals, etc.
CREATE TABLE TPCH.CUSTOMER ( C_CUSTKEY INTEGER, C_NAME VARCHAR(25),
C_ADDRESS VARCHAR(40), C_NATIONKEY INTEGER, C_PHONE CHAR(15),
C_ACCTBAL FLOAT, C_MKTSEGMENT CHAR(10), C_COMMENT VARCHAR(117) )
row format delimited fields terminated by '|'
stored as textfile
WITH HINTS(accessMode='local');
Hive does not support datatypes like varchar and decimal
CREATE TABLE TPCH.CUSTOMER ( C_CUSTKEY INTEGER, C_NAME VARCHAR(25),
C_ADDRESS VARCHAR(40), C_NATIONKEY INTEGER, C_PHONE CHAR(15),
C_ACCTBAL FLOAT, C_MKTSEGMENT CHAR(10), C_COMMENT VARCHAR(117) )
row format delimited fields terminated by '|'
stored as textfile
WITH HINTS(accessMode='local');
66
66
© 2014 IBM Corporation
LOCATION and EXTERNAL
LOCATION keyword
–
–
–
–
Allows explicit data placement
Specifies a directory containing table data
All files in directory are assumed to be data
If not provided then the table directory is
created in the hive warehouse dir
EXTERNAL keyword
–
–
–
–
67
Big SQL does not manage the data
Requires LOCATION keyword
Data is assumed to already exist
Dropping the table leaves the original data
intact
create
create table
table user
user
((
user_id
user_id int
int not
not null,
null,
fname
varchar(20)
fname
varchar(20) not
not null,
null,
lname
varchar(30)
not
null
lname
varchar(30) not null
))
...
...
location
location ' '/users/bob/tables/user';
/users/bob/tables/user';
create
create external
external table
table user
user
((
user_id
user_id int
int not
not null,
null,
fname
varchar(20)
fname
varchar(20) not
not null,
null,
lname
varchar(30)
not
null
lname
varchar(30) not null
))
...
...
location
location ' '/users/bob/tables/user';
/users/bob/tables/user';
© 2014 IBM Corporation
System Catalog Tables
Big SQL maintains a number of system tables
– These are virtual tables that do not live in the Hive catalogs
Catalog tables
– These are views over the Hive catalogs
– Live in the syscat schema
68
Name
Description
tables
Contains all tables and the schem in which they reside
columns
Details all table columns
schemas
Lists all defined schemas
indexcolumns
Lists all defined indexes
© 2014 IBM Corporation
Big SQL – Create Schema
CREATE SCHEMA example with more clauses:
CREATE SCHEMA
IF NOT EXISTS
sales
COMMENT 'This schema is for sales team'
LOCATION '/user/sales'
WITH DBPROPERTIES
(
'owner' = 'John Doe' ,
'alternateContactInfo' = 'Mary Doe'
);
DBPROPERTIES is a set of user defined properties
Tip: When trying Big SQL on shared cluster, create a personal
schema for all your tables etc to avoid interfering with others’ work.
69
© 2014 IBM Corporation
Tipos de Datos
Big SQL supports the following data types
tinyint
smallint
int[eger]
bigint
boolean
float
double
real
timestamp
string
varchar(len)
char(len)
binary
binary(len)
varbinary(len)
With the following caveats:
–
–
–
–
–
70
tinyint is an alias for smallint
real is an alias for float
char(len) is an alias for varchar(len)
binary(len) is an alias for varbinary(len)
string is treated like varchar(32768), binary like binary(32768)
© 2014 IBM Corporation
Funciones SQL
Wide variety of built-in functions
– Numeric
abs
ceil
floor
ln
log10
mod
power
sqrt
sign
width_bucket
cos
sin
tan
acos
asin
atan
cosh
sinh
tanh
_add_days
_add_months
_add_years
localtimestamp
_age
_day_of_week
_day_of_year
_week_of_year
_days_between
_months_between
_years_between
_ymdint_between
_first_of_month
_last_of_month
extract
char_length
trim
octet_length
upper
lower
substring
position
index
translate
– Trigonometric
– Date
– String
71
© 2014 IBM Corporation
Agregaciones
Standard aggregates
max
min
sum
count
var_samp
stdev
stdev_samp
stdev_pop
var_pop
percentile
corr
covar_samp
covar_pop
regr_avgx
regr_avgy
regr_count
regr_slope
regr_intercept
regr_r2
regr_sxx
regr_syy
Windowed aggregates
rank
ntile
percentile_cont
first_value
lead
dense_rank
tertial
percentile_disc
last_value
ratio_to_report
percentile_rank
cume_dist
lag
nth_value
Not all aggregates are currently fully parallelizable
– However, filtering, sorting, and grouping of the data to feed aggregation is still parallel
– Where possible we will work on providing inexact but parallelizable variants
72
© 2014 IBM Corporation
Joins
BigSQL supports both Standard and ANSI join syntax
select ...
from tpch.orders, tpch.lineitem
where o_orderkey = l_orderkey
select ...
from tpch.orders join
tpch.lineitem
on o_orderkey =
l_orderkey
Hive supports joins via ANSI join syntax only
select ...
from tpch.orders, tpch.lineitem
where o_orderkey = l_orderkey
select ...
from tpch.orders join
tpch.lineitem
on o_orderkey =
l_orderkey
73
73
© 2014 IBM Corporation
SQL Support – Subqueries
BigSQL supports subqueries
select c1,
(select count(*) from t2)
from t1
...
select c1
from t1
where c2 > (select ...)
Hive does not support subqueries
select c1,
(select count(*) from t2)
from t1
...
74
74
select c1
from t1
where c2 > (select ...)
© 2014
2013 IBM
IBM Corporation
Corporation
©
SQL Support – Aggregates
BigSQL supports windowed aggregates
select *
from (select rank() over (order by age asc) as my_rank,
empno,
name,
age
from employee2) as t
where my_rank <= 4;
Hive does not support windowed aggregates
select *
from (select rank() over (order by age asc) as my_rank,
empno,
name,
age
from employee2) as t
where my_rank <= 4;
75
75
© 2014
2013 IBM
IBM Corporation
Corporation
©
Client Drivers
Big SQL provides standards compliant native drivers
– Details on driver features and usage will be covered later
Type 4 JDBC 3.0 Driver
– Prepared statements
• Including result set and parameter marker metadata
– Cancel support
– Database metadata API support
• Retrieve up tables, columns, types, etc.
– Secure socket connections (SSL)
ODBC Driver
– Same feature set as JDBC driver
– Supported platforms: x86 Linux 64 bit, windows 32 and 64 bit
76
© 2014 IBM Corporation
"Point queries"
MapReduce incurs measurable overhead for the sake of resiliency
– Each mapper/reducer may involve JVM startup/shutdown
– Intermediate data is written to disk so partial failures can restart just the failed portion of
the query
– Job scheduling overhead
– Overhead can be as high as 20-30 seconds per job
For small data sets or certain data sources (e.g. HBase) MapReduce may be
unnecessary
Big SQL provides the ability to run queries entirely in the server, providing
milliseconds response time
– Automatically chosen for very simple queries:
SELECT c1, c2 FROM T1
– Can be provided as a query hint:
SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10
– Or session setting:
set force local on;
SELECT c1 FROM t1 WHERE c2 > 10;
77
© 2014 IBM Corporation
Managing Big SQL Server – Command Line
Start or stop Big SQL server from UNIX command line
$BIGSQL_HOME/bin/bigsql –help # for more options
$BIGSQL_HOME/bin/bigsql level
# prints bigsql-server level
$BIGSQL_HOME/bin/bigsql clean # cleans up after improper stop
$BIGSQL_HOME/bin/bigsql forcestop
# try this if “bigsql stop”
does not stop.
78
© 2014 IBM Corporation
JSqsh – Big SQL’s CLI
JSqsh (“jay-skwish” – Java SQL Shell)
– Open source command line JDBC client (http://jsqsh.wiki.sourceforge.net)
– Works with any JDBC driver, not just Big SQL
It can be started with
$BIGSQL_HOME/bin/jsqsh
$ $BIGSQL_HOME/bin/jsqsh --driver=bigsql --user=biadmin --password=biadmin
JSqsh Release 1.5-ibm, Copyright (C) 2007-2013, Scott C. Gray
Type \help for available help topics. Using JLine.
[localhost][biadmin] 1> select * from syscat.tables;
+------------+--------------+
| schemaname | tablename
|
+------------+--------------+
| syscat
| columns
|
| syscat
| tables
|
| syscat
| schemas
|
| syscat
| indexcolumns |
| system
| dual
|
| system
| integers
|
+------------+--------------+
79
© 2014 IBM Corporation
JSqsh Quick-Start – Help
The \help command displays available help topics
1> \help
+----------+----------------------------------------------+
| Category | Description
|
+----------+----------------------------------------------+
| commands | Help on all avaiable commands
|
| vars
| Help on all avaiable configuration variables |
| topics
| General help topics for jsqsh
|
+----------+----------------------------------------------+
– \help commands – Lists all available commands
– \help vars – Lists all available configuration variables
– \help topics – Lists general help topics
\help can be run with any command or variable name for details
1> \help help
SYNOPSIS
\help [[topics|vars|commands] | item]
DESCRIPTION
Displays help for a jsqsh command. If no arguments are provided,
\help provides a list of available categories of help: topics, vars
(variables), or commands. Running \help with one of those category...
80
© 2014 IBM Corporation
81
© 2014 IBM Corporation

Documentos relacionados