14赞

Archival and Analytics

作者：黄晓敏3023 | 2021-09-07 05:07

ArchivalandAnalytics

May 16, 2014

By Severalnines

We won’t bore you with buzzwords like volume, velocity and variety. This post is for MySQL users who want to get their hands dirty with Hadoop, so roll up your sleeves and prepare for work. Why would you ever want to move MySQL data into Hadoop? One good reason is archival and analytics. You might not want to delete old data, but rather move it into Hadoop and make it available for further analysis at a later stage.

In this post, we are going to deploy a Hadoop Cluster and export data in bulk from a Galera Cluster usingApache Sqoop. Sqoop is a well-proven approach for bulk data loading from a relational database into Hadoop File System. There is alsoHadoop Applieravailable fromMySQL labs, which works by retrieving INSERT queries from MySQL master binlog and writing them into a file in HDFS in real-time (yes, it applies INSERTs only).

We will useApache Ambarito deploy Hadoop (HDP 2.1) on three servers. We have a clustered Wordpress site running on Galera, and for the purpose of this blog, we will export some user data to Hadoop for archiving purposes. The database name is wordpress, we will use Sqoop to import the data to a Hive table running on HDFS. The following diagram illustrates our setup:

The ClusterControl node has been installed with an HAproxy instance to load balance Galera connections and listen on port 33306.

Prerequisites

All hosts are running CentOS 6.5 with firewall and SElinux turned off. All servers’ time are using NTP server and synced with each other. Hostname must be FQDN or define your hosts across all nodes in /etc/hosts file. Each host has been configured with the following host definitions:

192.168.0.100		clustercontrol haproxy mysql192.168.0.101		mysql1 galera1192.168.0.102		mysql2 galera2192.168.0.103		mysql3 galera3192.168.0.111		hadoop1 hadoop1.cluster.com192.168.0.112		hadoop2 hadoop2.cluster.com192.168.0.113		hadoop3 hadoop3.cluster.com

Create an SSH key and configure passwordless SSH on hadoop1 to other Hadoop nodes to automate the deployment by Ambari Server. In hadoop1, run following commands as root:

$ ssh-keygen -t rsa # press Enter for all prompts$ ssh-copy-id -i ~/.ssh/id_rsa hadoop1.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop2.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop3.cluster.com

On all Hadoop hosts, install and configure NTP:

$ yum install ntp -y$ chkconfig ntp on$ service ntpd on$ ntpdate -u se.pool.ntp.org

Deploying Hadoop

1. Install Ambari Server on one of the Hadoop nodes (we chose hadoop1.cluster.com), this will help us deploy the Hadoop cluster. Configure Ambari repository for CentOS 6 and start the installation:

$ cd /etc/yum.repos.d$ wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo$ yum -y install ambari-server

2. Setup and start ambari-server:

$ ambari-server setup # accept all default values on prompt$ ambari-server start

Give Ambari a few minutes to bootstrap before accessing the web interface at port 8080.

3. Open a web browser and navigate to http://hadoop1.cluster.com:8080. Login with username and password ‘admin’. This is the Ambari dashboard, it will guide us through the deployment. Assign a cluster name and clickNext.

4. At theSelect Stackstep, choose HDP2.1:

5. Specify all Hadoop hosts in theTarget Hostsfields. Upload the SSH key that we have generated in the Prerequisites section during passwordless SSH setup and clickRegister and Confirm:

6. This page will confirm that Ambari has located the correct hosts for your Hadoop cluster. Ambri will check those hosts to make sure they have the correct directories, packages, and processes to continue the install. ClickNextto proceed.

7. If you have enough resources, just go ahead and install all services:

8. InAssign Masterpage, we let Ambari choose the configuration for us before clickingNext:

9. InAssign Slaves and Clients page, we’ll enable all clients and slaves on each of our Hadoop hosts:

10. Hive, Oozie and Nagios might requires further input like database password and administrator email. Specify the needed information accordingly and clickNext.

11. You will be able to review your configuration selection before clickingDeploy to start the deployment:

WhenSuccessfully installed and started the servicesappears, chooseNext. On the summary page, chooseComplete. Hadoop installation and deployment is now complete. Verify that all services are running correctly:

We can now proceed to import some data from our Galera cluster as described in the next section.

Importing MySQL Data using Sqoop to Hive

Before importing any MySQL data, we need to create a target table in Hive. This table will have a similar definition as the source table in MySQL as we are importing all columns at the same time. Here is MySQL’s CREATE TABLE statement:

CREATE TABLE `wp_users` (`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,`user_login` varchar(60) NOT NULL DEFAULT '',`user_pass` varchar(64) NOT NULL DEFAULT '',`user_nicename` varchar(50) NOT NULL DEFAULT '',`user_email` varchar(100) NOT NULL DEFAULT '',`user_url` varchar(100) NOT NULL DEFAULT '',`user_registered` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',`user_activation_key` varchar(60) NOT NULL DEFAULT '',`user_status` int(11) NOT NULL DEFAULT '0',`display_name` varchar(250) NOT NULL DEFAULT '',PRIMARY KEY (`ID`),KEY `user_login_key` (`user_login`),KEY `user_nicename` (`user_nicename`)) ENGINE=InnoDB AUTO_INCREMENT=5864 DEFAULT CHARSET=utf8

SSH into any Hadoop node (since we installed Hadoop clients on all nodes) and switch to hdfs user:

$ su - hdfs

Enter into Hive console:

$ hive

Create a Hive database and table, similar to our MySQL table (Hive does not support DATETIME data type, so we are going to replace it with TIMESTAMP):

hive> CREATE SCHEMA wordpress;hive> SHOW DATABASES;OKdefaultwordpresshive> USE wordpress;hive> CREATE EXTERNAL TABLE IF NOT EXISTS users (ID BIGINT,user_login VARCHAR(60),user_pass VARCHAR(64),user_nicename VARCHAR(50),user_email VARCHAR(100),user_url VARCHAR(100),user_registered TIMESTAMP,user_activation_key VARCHAR(60),user_status INT,display_name VARCHAR(250));hive> exit;

Now we can start to import the wp_users MySQL table into Hive’s users table, connecting to MySQL nodes through HAproxy (port 33306):

$ sqoop import /--connect jdbc:mysql://192.168.0.100:33306/wordpress /--username=wordpress /--password=password /--table=wp_users /--hive-import /--hive-table=wordpress.users /--target-dir=wp_users_import /--direct

We can track the import progress from the Sqoop output :

..INFO mapreduce.ImportJobBase: Beginning import of wp_users..INFO mapreduce.Job: Job job_1400142750135_0020 completed successfully..OKTime taken: 10.035 secondsLoading data to table wordpress.usersTable wordpress.users stats: [numFiles=5, numRows=0, totalSize=240814, rawDataSize=0]OKTime taken: 3.666 seconds

You should see that an HDFS directory wp_users_import has been created (as specified in --target-dir in the Sqoop command) and we can browse its files using the following commands:

$ hdfs dfs -ls$ hdfs dfs -ls wp_users_import$ hdfs dfs -cat wp_users_import/part-m-00000 | more

Now let’s check our imported data inside Hive:

$ hive -e 'SELECT * FROM wordpress.users LIMIT 10'Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.propertiesOK2	admin	$P$BzaV8cFzeGpBODLqCmWp3uOtc5dVRb.	admin	my@email.com		2014-05-15 12:53:12		0	admin5	SteveJones	$P$BciftXXIPbAhaWuO4bFb4LVUN24qay0	SteveJones	demouser2@54.254.93.50		2014-05-15 12:57:59		0	Steve8	JanetGarrett	$P$BEp8IY1zvvrIdtPzDiU9D/br.FtzFa1	JanetGarrett	demouser3@54.254.93.50		2014-05-15 12:57:59		0	Janet11	AnnWalker	$P$B1wix5Xn/15o06BWyHa.r/cZ0rwUWQ/	AnnWalker	demouser4@54.254.93.50		2014-05-15 12:57:59		0	Ann14	DeborahFields	$P$B5PouJkJdfAucdz9p8NaKtS9WoKJu01	DeborahFields	demouser5@54.254.93.50		2014-05-15 12:57:59		0	Deborah17	ChristopherMitchell	$P$Bi/VWI1W4iP7h9mC0SXd4f.kKWnilH/	ChristopherMitchell	demouser6@54.254.93.50		2014-05-15 12:57:59		0	Christopher20	HenryHolmes	$P$BrPHv/ZHb7IBYzFpKgauBl/2WPZAC81	HenryHolmes	demouser7@54.254.93.50		2014-05-15 12:58:00		0	Henry23	DavidWard	$P$BVYg0SFTihdXwDhushveet4n2Eitxp1	DavidWard	demouser8@54.254.93.50		2014-05-15 12:58:00		0	David26	WilliamMurray	$P$Bc8FmkMadsQZCsW4L5Vo8Xax2ex8we.	WilliamMurray	demouser9@54.254.93.50		2014-05-15 12:58:00		0	William29	KellyHarris	$P$Bc85yvlxvWQ4XxkeAgJRugOqm6S6au.	KellyHarris	demouser10@54.254.93.50		2014-05-15 12:58:00		0	KellyTime taken: 16.282 seconds, Fetched: 10 row(s)

Nice! Now we can see that our data exists both in Galera and Hadoop. You can also use --query option in Sqoop to filter the data that you want to export to Hadoop using an SQL query. This is a basic example of how we can start to leverage Hadoop for archival and analytics. Welcome to big data!

References

Sqoop User Guide (v1.4.2) http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Hortonworks Data Platform Documentation http://docs.hortonworks.com/HDPDocuments/Ambari-1.5.1.0/bk_using_Ambari_book/content/index.html

推荐阅读

程序员
从命令提示符读取行？

如何解决《从命令提示符读取行？》经验，为你挑选了1个好方法。 ... [详细]
程序员
应检查方法返回值？

如何解决《应检查方法返回值？》经验，为你挑选了1个好方法。 ... [详细]
程序员
指定cython输出文件

如何解决《指定cython输出文件》经验，为你挑选了2个好方法。 ... [详细]
程序员
使用javascript图表可视化数据 - 动态

如何解决《使用javascript图表可视化数据-动态》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用R将多个文件从多个文件夹复制到单个文件夹

如何解决《使用R将多个文件从多个文件夹复制到单个文件夹》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用trampolining进行递归溢出

如何解决《使用trampolining进行递归溢出》经验，为你挑选了0个好方法。 ... [详细]
程序员
PostgreSQL查询聚合和RIGHT JOIN不过滤

如何解决《PostgreSQL查询聚合和RIGHTJOIN不过滤》经验，为你挑选了1个好方法。 ... [详细]
程序员
在Swift 2.0中正确实现cellForRowAtIndexPath

如何解决《在Swift2.0中正确实现cellForRowAtIndexPath》经验，为你挑选了1个好方法。 ... [详细]
程序员
编译输出中的TypeScript依赖项未按正确顺序解析

如何解决《编译输出中的TypeScript依赖项未按正确顺序解析》经验，为你挑选了1个好方法。 ... [详细]
程序员
64位平台的效率:指针与32位数组索引

如何解决《64位平台的效率:指针与32位数组索引》经验，为你挑选了3个好方法。 ... [详细]
程序员
访问数据和连接字符串？

如何解决《访问数据和连接字符串？》经验，为你挑选了1个好方法。 ... [详细]
程序员
根据方案设置默认模型值

如何解决《根据方案设置默认模型值》经验，为你挑选了0个好方法。 ... [详细]
程序员
是否可以将套接字映射到虚拟内存？

如何解决《是否可以将套接字映射到虚拟内存？》经验，为你挑选了1个好方法。 ... [详细]
程序员
从Javascript中的Kendo网格中的列名获取列索引

如何解决《从Javascript中的Kendo网格中的列名获取列索引》经验，为你挑选了1个好方法。 ... [详细]
程序员
基于系统时间(DAY)触发

如何解决《基于系统时间(DAY)触发》经验，为你挑选了1个好方法。 ... [详细]
程序员
WP rest api jwt auth

如何解决《WPrestapijwtauth》经验，为你挑选了1个好方法。 ... [详细]
程序员
Ionic/Cordova:如何强制应用程序在开始时刷新,即使它在后台？

如何解决《Ionic/Cordova:如何强制应用程序在开始时刷新,即使它在后台？》经验，为你挑选了1个好方法。 ... [详细]
程序员
Azure - BlobStore SAS uri命令执行失败.

如何解决《Azure-BlobStoreSASuri命令执行失败.》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何在django admin中实现搜索

如何解决《如何在djangoadmin中实现搜索》经验，为你挑选了1个好方法。 ... [详细]
程序员
无法找到相对于目录"web/static/js"的预设"es2015"

如何解决《无法找到相对于目录"web/static/js"的预设"es2015"》经验，为你挑选了2个好方法。 ... [详细]

黄晓敏3023

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章