(Big Data).
Transcripción
(Big Data).
Meetup Cluster de Hadoop con BigSQL Luis Reina Information Management IBM Software Group @luisrei [email protected] 1 © 2014 IBM Corporation ¿Qué es un Meetup? 2 © 2014 IBM Corporation Meetup Big Data (Madrid) Big Data Developers in Madrid http://www.meetup.com/Big-Data-Developers-in-Madrid 3 © 2014 IBM Corporation AGENDA Introducción 09:45-10:45 ¿Qué es Hadoop? 10:45-11:15 Café 11:15-11:30 El Ecosistema entorno a Hadoop 11:30-11:45 IBM BigInsights 11:45-12:15 BigSQL 12:15-12:30 IBM Bluemix (Hadoop Cloud) 12:30-13:30 Laboratorio de Big SQL 09:30-09:45 4 © 2014 IBM Corporation INTRODUCCIÓN 5 © 2014 IBM Corporation Consideraciones sobre Big Data Multiples Definiciones de Big Data Big Data está de moda: Todo el Mundo habla de “Big Data” Big Data es el Problema Herramientas/Desarrollos para Convertir la Amenaza en Oportunidad. Servidores Baratos y Software Open Source pero no hay que menospreciar el esfuerzo/coste de Analizar Big Data. 6 © 2014 IBM Corporation ¿Qué es Big Data? y ¿Qué no es Big Data? La Frontera no esta 100% clara DATOS TRADICIONALES Bases de Datos Relacionales Datos Transaccionales OLTP Datos de un ERP ……. 7 BIG DATA Datos de Redes Sociales (tweeter, fabebook) Logs de IT, Web. Datos de Sensores ………… © 2014 IBM Corporation ¿Qué es Big Data? y ¿Qué no es Big Data? DATOS que debido a su: VOLUMEN, VELOCIDAD o VARIEDAD (formato) Es DIFICIL o IMPRACTICO ANALIZAR Con medios TRADICIONALES. 8 © 2014 IBM Corporation ¿Cúal es la Solución para abordar Big Data? Clasificación de Big Data: Datos en Reposo: – Los Datos analizados están almacenados. – Ejemplos: Información de logs, facebook, twitter, etc. – Solución: Hadoop (open source). Datos en Movimiento: – Los Datos son analizados en vuelo, en tiempo real, según se generán sin esperar a almacenarlos. – Ejemplos: Sensores, Información de fraude, etc. – Solución: IBM Infosphere Streams. 9 9 © 2014 IBM Corporation ¿Qué es Hadoop? 10 © 2014 IBM Corporation Hadoop es un framework de desarrollo y Entorno de Ejecución Un framework de desarrollo y un entorno de ejecución para realizar aplicaciones capaces de procesar gran volumen de datos (Big Data). Las aplicaciones generadas son de tipo batch y de lectura intensiva. Basado en tecnología de Google. Es Open Source (gratuito): Apache Hadoop http://hadoop.apache.org/ 11 © 2014 IBM Corporation Hadoop no es un Sistema Gestor de Base de Datos Es un framework de desarrollo y ejecución no un SGBD. No pretende sustituir los Data Warehouse actuales. Las aplicaciones generadas usan CPU y disco de ordenadores baratos de tipo “commodity”. Las aplicaciones funcionan en Cluster de muchas máquinas trabajando en paralelo. Se pueden añadir máquinas sin cambiar las aplicaciones, ni como se cargan los datos, ni los formatos de datos. Si se “rompe” una máquina otra realiza su trabajo. 12 © 2014 IBM Corporation Origen y Evolución de Hadoop Wins Terabyte sort benchmark Publishes MapReduce, GFS Paper early research 13 Apache OpenSource MapReduce & HDFS projects created Runs 4,000 node Hadoop Cluster open source dev momentum Launches SQL Support for Hadoop Big Insights announced Releases CHD3 initial success stories Commercialization © 2014 IBM Corporation Hadoop tiene 2 componentes clave Sistema de Ficheros: HDFS – Donde Hadoop almacena los datos. – Usa discos locales pero trababa como un gran sistema de ficheros entre multiples nodos. Map/Reduce – Algoritmo para procesar los datos en el cluster. – Son 2 pasos MAP y REDUCE. – Divide y Vencerás 14 © 2014 IBM Corporation HDFS es un Sistema de Ficheros para el Cluster HDFS= HADOOP Distributed FILESYSTEM HDFS es un sistema de ficheros para almacenar los datos que se van a analizar. Es un único sistema de ficheros distribuido. Los datos se reparten por todo el cluster. Cada nodo del cluster tiene un “cachito” de los datos . Esto “cachitos” se llamas bloques y son de 64MB por defecto. 15 © 2014 IBM Corporation HDFS HDFS asume que un nodo puede fallar replicando los datos en multiples nodos. Por defecto 3 copias No hay una SAN o NAS los nodos tienen discos locales solamente. Los nodos pueden hablar entre si para rebalancear y mover los de datos si fuese necesario. Existe un Nodo (NodeName) que guarda la información de quien tiene que (metadatos), i.e. en que nodo estan que datos. Las aplicaciones no se tienen que preocupar de la ubicación de los datos. 16 © 2014 IBM Corporation DEMO HDFS: VMware Herramienta VMware: Me permite disponer de uno o más maquinas virtuales (con distintos Sistema Operativos) dentro de mi sistema operativo nativo. – Ejecutar Linux dentro de Windows. – Ejecutar Windows dentro de Linux. – Ejecutar varios sistemas operativos a la vez. 17 © 2014 IBM Corporation DEMO HDFS Sistema de Fichero POSIX (Portable Operating System Interface) HDFS no es POSIX Hadoop shell – hadoop dfs –<comando> Ej: hadoop dfs –ls hadoop dfs –mkdir … hadoop dfs –put… hadoop dfs –get Interfaz Web Apache Hadoop – HDFS Name Node: http://<hostname>:50070 Interfaz Web de BigInsights – http://<hostname>:8080 18 © 2014 IBM Corporation ¿Qué es Map/Reduce? Algoritmo para analizar los datos. Partimos de que se han distribuido los datos por el cluster (HDFS). El programa que analiza estos datos hace uso del algoritmo Map/Reduce. Estos programas se llaman Jobs que se dividen en Tareas (Tasks) de tipo Map y Reduce Paso 1: Tarea Map Convierte los datos en Tuplas: (clave, valor) Paso 2: Tarea Reduce Reduce el número de Tuplas generadas por Map (e.g. agregando) 19 © 2014 IBM Corporation Ejemplo de MapReduce Contar el número de apariciones de cada palabra Hola Mundo Adios Mundo Datos Entrada Proceso Map (paralelo) Hola Meetup Map 1 emite: < Hola, 1> < Mundo, 1> < Adios, 1> < Mundo, 1> Map 2 emite: < Hola, 1> < Meetup, 1> Reduce (salida final): Proceso Reduce 20 < < < < Adios, 1> Meetup, 1> Hola, 2> Mundo, 2> © 2014 IBM Corporation Como se ejecuta una aplicación Hadoop Nodos de Datos de Hadoop public static class TokenizerMapper public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable private final static IntWritable one = new IntWritable(1); one = new IntWritable(1); private Text word = new Text(); private Text word = new Text(); public void map(Object key, Text val, Context public void map(Object key, Text val, Context StringTokenizer itr = StringTokenizer itr = new StringTokenizer(val.toString()); new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { while (itr.hasMoreTokens()) { word.set(itr.nextToken()); word.set(itr.nextToken()); context.write(word, one); context.write(word, one); } } } } } } public static class IntSumReducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); private IntWritable result = new IntWritable(); public void reduce(Text key, public void reduce(Text key, Iterable<IntWritable> val, Context context){ Iterable<IntWritable> val, Context context){ int sum = 0; int sum = 0; for (IntWritable v : val) { for (IntWritable v : val) { sum += v.get(); sum += v.get(); 1. Fase de Map Lanzar las tareas MAP al cluster 2. Shuffle . . . . . . 3. Fase de Reduce Aplicación MapReduce Shuffle Resultado 21 Devuelve un único conjunto de Resultados © 2014 IBM Corporation ¿Cómo crear Aplicaciones Hadoop? (Jobs Map/Reduce) • Desarrollos Map/reduce en JAVA Difícil Muy Complejo • PIG Lenguaje Open/Source de más alto nivel Estándar • HIVE PIG Lenguaje Open/Source Similar al SQL • JAQL Lenguaje similar a PIG, mayor funcionalidad • Herramientas tipo BigSheets Navegador/Hoja de Cálculo No requiere desarrollo Fácil 22 © 2014 IBM Corporation DEMO Map/Reduce Programa que Cuenta Palabras: hadoop jar hadoop-examples.jar wordcount <in-dir> <out-dir> Ejemplo: hadoop jar hadoop-example.jar wordcount /tmp/datos /salida Interfaz Web Apache Hadoop – http://<hostname>:50030 Interfaz Web BigInsights – http://<hostname>:8080 23 © 2014 IBM Corporation Ecosistema de Hadoop 24 © 2014 IBM Corporation Entorno a la idea de Hadoop existe un rico Ecosistema El Ecosistema enriquece Hadoop. Incluye herramientas como: – Flume: Para cargar datos en Hadoop (HDFS). – HBase: Base de datos sobre HDFS. – Oozie: Control de Flujos (workflow). – Lucene: Indexador y motor de búsqueda en HDFS. – Jaql, Pig, Hive: Lenguajes de alto nivel que generan Map/Reduce jobs. – Text Analytics: Análisis de Texto usando Map/Reduce. – Zookeeper: Coordinador. – ……………………………………… 25 PIG ZooKeeper © 2014 IBM Corporation (Full Text Search) BigIndex (Text Analytics) SystemT (Big Data Shell) Jaql (Data Warehouse SQL) Hive (ETL) Pig (Interactive Storage) HBase (Data Collection) Flume (Workflow) Oozie (Coordination) Zookeeper (Management) Web Console Ejemplo de Herramientas del Ecosistema MapReduce (Distributed Computation Framework) HDFS (or GPFS) (Distributed File System) Blue Boxes components only available with IBM BigInsights product. 26 © 2014 IBM Corporation Hive Lenguaje de consultas para acceder a Hadoop. Lenguaje “Similar” al SQL. Hive Query Language (HQL). Genera Map/Reduce Paralelismo MPP. Tiene un catalogo de datos (Hive Catalog). Un uso de caso de Hadoop es “Data Warehouse Augmentation”. 27 27 Tabla Hive y Consulta CREATE EXTERNAL TABLE Tabla_Hive ( Nombre STRING, Direccion STRING, Edad INT) ) COMMENT ”Table de Ejemplo” ROW FORMAT DELIMITED FIELDS TERMINATED by ”\t” STORED AS TEXTFILE LOCATION ”/datos/tabla_hive/”; SELECT * FROM Tabla_Hive; © 2014 2013 IBM IBM Corporation Corporation © Hive tiene limitaciones Se parece pero no es SQL estándar. No SQL92. No esta pensado para consultas “online”. Proceso Batch para consultas y cargas de datos. No soporta Inserts y Updates. Tipos de Datos limitados (e.g. varchar, decimal). No Soporta de Subqueries. Soporte de Sintaxis de Joins limitada. Driver jdbc/odbc limitado. 28 28 © 2014 2013 IBM IBM Corporation Corporation © HBASE Base de datos NoSQL Open Source. Basado en los papeles de Google de Big Table [2006]. Permite grandes volumenes. • 2011 Cluster de 1000 nodos con Petabytes de datos (3x) Tolerante a fallos, escalable de forma horizontal, alto rendimiento. Usa el sistema de fichero de Hadoop HDFS para almacenar los dato. Permite Real Time: Modificaciones y consultas (Hadoop es batch y lecturas) Proporciona a Hadoop una forma de hacer modificaciones y Real Time. 29 29 © 2014 2013 IBM IBM Corporation Corporation © Limitaciones de HBASE No es una base de datos Relacional. Lenguaje no es SQL (scan, get, put,etc) No permite tener índices secundarios No permite transaccionalidad entre varias filas. Carece de un optimizador de consultas. Consume mucho espacio. 30 30 © 2014 2013 IBM IBM Corporation Corporation © BIG SQL SQL NATIVO: ANSI SQL 92+ Drivers JDBC/ODBC. Usa Map Reduce para paralelismo. Acceso directo para consultas rápidas. Muchas fuentes de datos Hbase Ficheros CSV, delimitados Ficheros JSON. Tablas HIVE. ……………… 31 31 Application SQL JDBC / ODBC Driver JDBC / ODBC Server Big SQL Engine Data Sources Hive Tables HBase tables CSV Files © 2014 2013 IBM IBM Corporation Corporation © BigInsights 32 © 2014 IBM Corporation ¿Qué es IBM BigInsights? Producto basado en Hadoop. Mejora Hadoop para que sea “Enterprise Ready” añadiendo distintos elementos como: • Administración. • Seguridad. • Sistema de ficheros GPFS. • Capacidades analíticas avanzadas de IBM Research. • Workflow. • Aprovisonamiento. • Facilidad de Uso (BigSheets). Se integra con las Bases de datos y Warehouse existentes de IBM: DB2 Infosphere Warehouse, Smart Analytics y Netezza. 33 33 © 2014 IBM Corporation IBM Significantly Enhances Hadoop IBM Innovation • Scalable – New nodes can be added on the fly. • Affordable – Massively parallel computing on commodity servers • Flexible – Hadoop is schema-less, and can absorb any type of data. • Fault Tolerant – Through MapReduce software framework 34 34 • Performance & reliability – Adaptive MapReduce, Compression, Indexing, Flexible Scheduler • Analytic Accelerators • Productivity Accelerators – Web-based UIs – Tools to leverage existing skills – End-user visualization • Enterprise Integration – To extend & enrich your information supply chain. © 2014 IBM Corporation BigInsights Enterprise Edition Optional IBM and partner offerings Analytics and discovery Text processing engine and library Accelerator for social data analysis BigSheets Accelerator for machine data analysis Big R Infrastructure Integrated installer Text compression Adaptive MapReduce Enhanced security Web Crawler Boardreader Distrib file copy ... DB export DB import Machine learning Data processing Jaql Pig Oozie HBase Hive Lucene GPFS GPFS –FPO –FPO ZooKeeper HCatalog MapReduce HDFS Connectivity and Integration JDBC Flume 35 Sqoop Data Explorer DB2 Netezza Guardium DataStage IBM Administrative and development tools Ad hoc query Big SQL Indexing Flexible scheduler “Apps” Open Source Streams Web console • Monitor cluster health, jobs, etc. • Add / remove nodes • Start / stop services • Inspect job status • Inspect workflow status • Deploy applications • Launch apps / jobs • Work with distrib file system •Work with spreadsheet interface •Support REST-based API • Create / view alerts •... Eclipse tools • Text analytics • MapReduce programming • Jaql, Hive, Pig development • BigSheets plug-in development • Oozie workflow generation Cognos BI © 2014 IBM Corporation Consola Web de BigInsight Gestiona BigInsights – – – – Inspeccionar el Sistema Añadir/Quitar nodos Arrancar/Parar servicios Ejecutar y monitorizar jobs (aplicaciones) – Explorar el sistema de ficheros. – .……………………. Lanza Aplicaciones – Herramienta de análisis con formato de hoja de cálculo. – Aplicaciones preconstruidas (suministradas por IBM o desarrolladas por el usuario) Publica Aplicaciones 36 © 2014 IBM Corporation Ejecutar aplicaciones desde la Consola Web 37 © 2014 IBM Corporation Desarrollos con Eclipse Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics 38 © 2014 IBM Corporation Quickly drag and drop to create new applications 39 © 2014 IBM Corporation Application Accelerators Quickly build, deploy custom applications in high-value areas IBM Accelerator for Social Data Analytics • B2C businesses • Sample applications: Customer acquisition / retention, Customer Segmentation or Micro Segmentation, Marketing Campaign Optimization, Lead generation, Brand Management or Surveillance • Ships with BigInsights v2 and Streams v3 IBM Accelerator for Machine Data Analytics • Cross-industry: manufacturing, oil & gas, energy and utility, healthcare, travel and transportation, CPG, Retail, etc. • Operational efficiency monitoring, security incident investigation. proactive maintenance, troubleshooting, outage prevention, efficiency tracking, etc • Ships with BigInsights v2 IBM Accelerator for Telco Event Data Analytics • Telcos • Campaign management, real-time promotion, fraud detection, service assurance and network monitoring, • Ships with Streams v3, but works with BigInsights or PureSparta for Analytics (a.k.a. Netezza) 40 © 2014 IBM Corporation Big Sheets: Collection Sample Spreadsheet-like structures defined by user Based on data accessible through BigInsights Web console – e.g., file system data, output from Web crawl, etc. 41 © 2014 IBM Corporation Big Sheets: Collection Operations Work with built-in “sheets” editor Add / delete columns Filter data Specify formulas to compute new values using spreadsheet-style syntax Apply built-in or custom macro functions…. 42 © 2014 IBM Corporation What is Text Analytics? High Performance and Scalable rule based Information Extraction Engine. Distill structured information from unstructured data - Rich annotator library supports multiple languages – – Provides sophisticated tooling to help build, test, and refine rules. Developer tools, an easy to use text analytics language, and a set of extractors for fast adoption. Multilingual support, including support for DBCS languages. Developed at IBM Research since 2004: System T Embedded in several IBM products – Infosphere Warehouse – Infosphere Streams. – Lotus Notes – Cognos Consumer Insights BigInsights is the first time IBM opens up the Text Analytics Engine technology for customization and development 43 © 2014 IBM Corporation Annotator Query Language (AQL) Language to create rules for Text Analytics. SQL Like Language. Fully declarative text analytics language. Once compiled produced an AOG plan to work in the data. No “black boxes” or modules that can’t be customized. Tooling for easy customization because you are abstracted from the programmatic details. Competing solutions make use of locked up black-box modules that cannot be customized, which restricts flexibility and are difficult to optimize for performance create view AmountWithUnit as extract pattern <N.match> <U.match> as match from Number N, Unit U; 44 © 2014 IBM Corporation BigInsights Text Analytics Components Eclipse Tools – Develop and maintain extractors in AQL AQL Language Pre-compiled extractor library – Western languages: Named Entities (person, organization, location, phone, URL, email, date/time) and financial events (merger, acquisition, company earnings) – Chinese/Japanese: Named Entities (Person, Organization, Location) Optimizer Compiled Plan Jaql Text Analytics module – Execute extractors on the cluster from Jaql Text Analytics Java API – Invoke Text Analytics directly from your application Input Document Extracted objects BigInsights Cluster 45 © 2014 IBM Corporation Text Analytic: Simple Example Football World Cup 2010, one team distinguished well from the rest winning the final. Early in the second half, Netherlands’ striker, Arjen Robben, had a chance to score, but the awesome keeper for Spain, Iker Casillas made the save. Winner superiority was reflected when Winger Andres Iniesta scored for Spain for the win. World Cup 2010 Highlights 46 Arjen Robben Striker Netherlands Iker Casillas Andres Iniesta Keeper Spain Winger Spain © 2014 IBM Corporation Large Scale Indexing, Faceted Search Designed to improve text searches over big data Indexing characteristics – Based on Apache Lucene – Parallel index • Index operation is run in parallel, but the index is stored in one physical index. • • • Index is too large to be contained in one physical index Index is distributed into shards, representing one logical index Each query is evaluated against all shards – Distributed index Faceted search - categorization, drill down 47 © 2014 IBM Corporation Adaptive MapReduce Performance Fully compatible with Hadoop jobs Tests out to be 20 – 50% faster Self adjusts based on nature if Hadoop job. Broadcast join example – Large startup cost (100sec) vs. – Imperfect balance between maps Adaptive Mappers to the rescue – Split size matters much less – Default size (64MB) performs much better 48 © 2014 IBM Corporation BigInsights LZO Compression IBM LZO compression: – Fast, flexible compression. – Splittable vs GNU that is not. – Similar to GNU-based LZO compression, but no index file needed. – Fixed-size compression blocks automatically created Original source: Compressed representation 49 Fixed size © 2014 IBM Corporation Comparación de GPFS y HDFS 50 Sistema de Ficheros GPFS HDFS Robusted No punto único de fallo Vulnerabilidad del NameNode Integridad de Datos Alta Posibilidad de perdida de datos Escalabilidad Miles de nodos Miles de Nodos Cumplimiento POSIX Completo Limitado Gestión de Datos Securidad, Copias de Seguridad, Replicación Limitado Rendimiento Map Reduce Bueno Bueno Rendimiento de Aplicaciones Tradicionales Bueno Rendimiento pobre en lecturas y escrituras de tipo random © 2014 IBM Corporation Integración Sample UDFs to submit BigInsights jobs, consume results Netezza DB2 JDBC Streams Jaql read/write DataStage DB2 LUW, IW with DPF Netezza BigInsights JDBC DBMS Jaql read/write 51 © 2014 IBM Corporation Integración con DB2 52 © 2014 IBM Corporation Integración con DB2 53 © 2014 IBM Corporation Más Información de BigInsights (1/3) Wiki de BigInsight con enlaces, demos, forums, etc. http://www.ibm.com/developerworks/wiki/biginsights/ 54 54 © 2014 IBM Corporation Más Información de BigInsights (2/3) In the Cloud – Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. – Pay only for the resources used. In the Classroom – Via IBM Education – Online at www.bigdatauniversity.com On Your Cluster – Download Basic Edition from ibm.com. With the BigInsights Community – Technical portal at http://tinyurl.com/biginsights – Links to demos, papers, forum, downloads, etc. 55 © 2014 IBM Corporation Más Información de BigInsights (3/3) Get Educated – IBM Big Data: ibm.com/bigdata – IBMBigDataHub.com – BigDataUniversity.com – IBV study on big data – Books / analyst papers Schedule a Big Data Workshop – Free of charge – Best practices – Industry use cases – Business uses – Business value assessment 56 © 2014 IBM Corporation BigSQL 57 © 2014 IBM Corporation IBM inventor de las Bases de Datos 1960s: Navigational DBMS – IMS (hierarchical) 1970s-1980s: Relational DBMS – SQL – System R, System Z, DB2 1990s: Data Warehouse – Dimensional model, ETL, MDM Today: Big Data – Big SQL 58 58 Ted Codd © 2014 IBM Corporation SQL para Hadoop: ¿Por qué? Data warehouse augmentation is the leading Hadoop use case 1 Pre-Processing Hub Streams Real-time processing 2 Query-able Archive BigInsights Information Integration BigInsights Landing zone for all data Data Warehouse Data Warehouse 3 Exploratory Analysis Can combine with unstructured information Data Warehouse MapReduce is difficult – MapReduce Java API is tedious and requires programming expertise – Unfamiliar languages (ie. Pig) also require special skills SQL support opens the data to a much wider audience – Familiar, widely known syntax – Common catalog for identifying data and structure – Declarative – clear separation of the what (the data you’re after) vs. the how (processing) 59 59 © 2014 IBM Corporation Big SQL: Acceso Nativo SQL a Hadoop Native SQL access to data stored in BigInsights Application – ANSI SQL 92+ – Standard syntax support (joins, data types, …) SQL Real JDBC/ODBC drivers – – – – JDBC / ODBC Driver Prepared statements Cancel support Database metadata API support Secure socket connections (SSL) JDBC / ODBC Server Big SQL Engine Optimization – Leveraging MapReduce parallelism or… – Direct access for low-latency queries Data Sources Varied data sources – – – – 60 60 HBase (including secondary indexes) CSV, Delimited files, Sequence files JSON Hive tables Hive Tables HBase tables CSV Files BigInsights © 2014 IBM Corporation Arquitectura Big SQL shares catalogs with Hive via the Hive metastore Application – Each can query the others tables SQL engine analyzes incoming queries – Separates portion(s) to execute at the server vs. portion(s) to execute on the cluster – Re-writes query if necessary for improved performance – Determines appropriate storage handler for data – Produces execution plan – Executes and coordinates query SQL Language JDBC / ODBC Driver Big SQL Server Network Protocol SQL Engine Job Tracker Name Node Head Node Head Node ••• Storage Handlers Del Files SEQ Files HBase RDBMS ••• Head Node Task Tracker Data Node Region Server Compute Node Hive Metastore Head Node Task Tracker Data Node Region Server ••• Compute Node Task Tracker Data Node Region Server Compute Node BigInsights Cluster 61 © 2014 IBM Corporation Herramientas Estándar de BI Cognos BI server can push down many computations to BigInsights – Big SQL directs this processing to happen on BigInsights instead of the Cognos BI Server Faster response times – Increased opportunity for query processing to occur closer to the data Cognos BI Server Explore & Analyze Report & Act SQL Interface via JDBC Application Free from the limitations of Hive (latency, SQL language support) (Map-Reduce) Storage (HBase, HDFS) InfoSphere BigInsights 62 62 © 2014 IBM Corporation Herramienta Estándar: SQuirreL SQL Using existing SQL tooling against BigData Support for “standard” authentication!! (not supported for Hive, but supported by Big SQL!) 63 63 © 2014 2013 IBM IBM Corporation Corporation © Herramienta Estándar: Eclipse Using existing SQL tooling against BigData Same setup as for existing SQL sources!! Support for “standard” authentication!! 64 64 © 2014 IBM Corporation BigSQL desde la Consola de BigInsights In Quick Links, select to run Big SQL queries from the console Type in query, or cut and paste from SQL script. Hit Run. 65 © 2014 IBM Corporation BigSQL: Tablas BigSQL supports create table and many data types including varchar, decimals, etc. CREATE TABLE TPCH.CUSTOMER ( C_CUSTKEY INTEGER, C_NAME VARCHAR(25), C_ADDRESS VARCHAR(40), C_NATIONKEY INTEGER, C_PHONE CHAR(15), C_ACCTBAL FLOAT, C_MKTSEGMENT CHAR(10), C_COMMENT VARCHAR(117) ) row format delimited fields terminated by '|' stored as textfile WITH HINTS(accessMode='local'); Hive does not support datatypes like varchar and decimal CREATE TABLE TPCH.CUSTOMER ( C_CUSTKEY INTEGER, C_NAME VARCHAR(25), C_ADDRESS VARCHAR(40), C_NATIONKEY INTEGER, C_PHONE CHAR(15), C_ACCTBAL FLOAT, C_MKTSEGMENT CHAR(10), C_COMMENT VARCHAR(117) ) row format delimited fields terminated by '|' stored as textfile WITH HINTS(accessMode='local'); 66 66 © 2014 IBM Corporation LOCATION and EXTERNAL LOCATION keyword – – – – Allows explicit data placement Specifies a directory containing table data All files in directory are assumed to be data If not provided then the table directory is created in the hive warehouse dir EXTERNAL keyword – – – – 67 Big SQL does not manage the data Requires LOCATION keyword Data is assumed to already exist Dropping the table leaves the original data intact create create table table user user (( user_id user_id int int not not null, null, fname varchar(20) fname varchar(20) not not null, null, lname varchar(30) not null lname varchar(30) not null )) ... ... location location ' '/users/bob/tables/user'; /users/bob/tables/user'; create create external external table table user user (( user_id user_id int int not not null, null, fname varchar(20) fname varchar(20) not not null, null, lname varchar(30) not null lname varchar(30) not null )) ... ... location location ' '/users/bob/tables/user'; /users/bob/tables/user'; © 2014 IBM Corporation System Catalog Tables Big SQL maintains a number of system tables – These are virtual tables that do not live in the Hive catalogs Catalog tables – These are views over the Hive catalogs – Live in the syscat schema 68 Name Description tables Contains all tables and the schem in which they reside columns Details all table columns schemas Lists all defined schemas indexcolumns Lists all defined indexes © 2014 IBM Corporation Big SQL – Create Schema CREATE SCHEMA example with more clauses: CREATE SCHEMA IF NOT EXISTS sales COMMENT 'This schema is for sales team' LOCATION '/user/sales' WITH DBPROPERTIES ( 'owner' = 'John Doe' , 'alternateContactInfo' = 'Mary Doe' ); DBPROPERTIES is a set of user defined properties Tip: When trying Big SQL on shared cluster, create a personal schema for all your tables etc to avoid interfering with others’ work. 69 © 2014 IBM Corporation Tipos de Datos Big SQL supports the following data types tinyint smallint int[eger] bigint boolean float double real timestamp string varchar(len) char(len) binary binary(len) varbinary(len) With the following caveats: – – – – – 70 tinyint is an alias for smallint real is an alias for float char(len) is an alias for varchar(len) binary(len) is an alias for varbinary(len) string is treated like varchar(32768), binary like binary(32768) © 2014 IBM Corporation Funciones SQL Wide variety of built-in functions – Numeric abs ceil floor ln log10 mod power sqrt sign width_bucket cos sin tan acos asin atan cosh sinh tanh _add_days _add_months _add_years localtimestamp _age _day_of_week _day_of_year _week_of_year _days_between _months_between _years_between _ymdint_between _first_of_month _last_of_month extract char_length trim octet_length upper lower substring position index translate – Trigonometric – Date – String 71 © 2014 IBM Corporation Agregaciones Standard aggregates max min sum count var_samp stdev stdev_samp stdev_pop var_pop percentile corr covar_samp covar_pop regr_avgx regr_avgy regr_count regr_slope regr_intercept regr_r2 regr_sxx regr_syy Windowed aggregates rank ntile percentile_cont first_value lead dense_rank tertial percentile_disc last_value ratio_to_report percentile_rank cume_dist lag nth_value Not all aggregates are currently fully parallelizable – However, filtering, sorting, and grouping of the data to feed aggregation is still parallel – Where possible we will work on providing inexact but parallelizable variants 72 © 2014 IBM Corporation Joins BigSQL supports both Standard and ANSI join syntax select ... from tpch.orders, tpch.lineitem where o_orderkey = l_orderkey select ... from tpch.orders join tpch.lineitem on o_orderkey = l_orderkey Hive supports joins via ANSI join syntax only select ... from tpch.orders, tpch.lineitem where o_orderkey = l_orderkey select ... from tpch.orders join tpch.lineitem on o_orderkey = l_orderkey 73 73 © 2014 IBM Corporation SQL Support – Subqueries BigSQL supports subqueries select c1, (select count(*) from t2) from t1 ... select c1 from t1 where c2 > (select ...) Hive does not support subqueries select c1, (select count(*) from t2) from t1 ... 74 74 select c1 from t1 where c2 > (select ...) © 2014 2013 IBM IBM Corporation Corporation © SQL Support – Aggregates BigSQL supports windowed aggregates select * from (select rank() over (order by age asc) as my_rank, empno, name, age from employee2) as t where my_rank <= 4; Hive does not support windowed aggregates select * from (select rank() over (order by age asc) as my_rank, empno, name, age from employee2) as t where my_rank <= 4; 75 75 © 2014 2013 IBM IBM Corporation Corporation © Client Drivers Big SQL provides standards compliant native drivers – Details on driver features and usage will be covered later Type 4 JDBC 3.0 Driver – Prepared statements • Including result set and parameter marker metadata – Cancel support – Database metadata API support • Retrieve up tables, columns, types, etc. – Secure socket connections (SSL) ODBC Driver – Same feature set as JDBC driver – Supported platforms: x86 Linux 64 bit, windows 32 and 64 bit 76 © 2014 IBM Corporation "Point queries" MapReduce incurs measurable overhead for the sake of resiliency – Each mapper/reducer may involve JVM startup/shutdown – Intermediate data is written to disk so partial failures can restart just the failed portion of the query – Job scheduling overhead – Overhead can be as high as 20-30 seconds per job For small data sets or certain data sources (e.g. HBase) MapReduce may be unnecessary Big SQL provides the ability to run queries entirely in the server, providing milliseconds response time – Automatically chosen for very simple queries: SELECT c1, c2 FROM T1 – Can be provided as a query hint: SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10 – Or session setting: set force local on; SELECT c1 FROM t1 WHERE c2 > 10; 77 © 2014 IBM Corporation Managing Big SQL Server – Command Line Start or stop Big SQL server from UNIX command line $BIGSQL_HOME/bin/bigsql –help # for more options $BIGSQL_HOME/bin/bigsql level # prints bigsql-server level $BIGSQL_HOME/bin/bigsql clean # cleans up after improper stop $BIGSQL_HOME/bin/bigsql forcestop # try this if “bigsql stop” does not stop. 78 © 2014 IBM Corporation JSqsh – Big SQL’s CLI JSqsh (“jay-skwish” – Java SQL Shell) – Open source command line JDBC client (http://jsqsh.wiki.sourceforge.net) – Works with any JDBC driver, not just Big SQL It can be started with $BIGSQL_HOME/bin/jsqsh $ $BIGSQL_HOME/bin/jsqsh --driver=bigsql --user=biadmin --password=biadmin JSqsh Release 1.5-ibm, Copyright (C) 2007-2013, Scott C. Gray Type \help for available help topics. Using JLine. [localhost][biadmin] 1> select * from syscat.tables; +------------+--------------+ | schemaname | tablename | +------------+--------------+ | syscat | columns | | syscat | tables | | syscat | schemas | | syscat | indexcolumns | | system | dual | | system | integers | +------------+--------------+ 79 © 2014 IBM Corporation JSqsh Quick-Start – Help The \help command displays available help topics 1> \help +----------+----------------------------------------------+ | Category | Description | +----------+----------------------------------------------+ | commands | Help on all avaiable commands | | vars | Help on all avaiable configuration variables | | topics | General help topics for jsqsh | +----------+----------------------------------------------+ – \help commands – Lists all available commands – \help vars – Lists all available configuration variables – \help topics – Lists general help topics \help can be run with any command or variable name for details 1> \help help SYNOPSIS \help [[topics|vars|commands] | item] DESCRIPTION Displays help for a jsqsh command. If no arguments are provided, \help provides a list of available categories of help: topics, vars (variables), or commands. Running \help with one of those category... 80 © 2014 IBM Corporation 81 © 2014 IBM Corporation