当前位置:首页 > 服务端 > 大数据环境搭建

大数据环境搭建

系统 centos7

远程连接工具MobaXterm

一、虚拟机

虚拟机配置

下载安装VMware Station,下载centos7

新建虚拟机

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 稍后安装操作系统,下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 操作系统选择,下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 修改名称和位置,下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 完成

新建虚拟机右键,虚拟机设置,CD/DVD选择ISO映像文件

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 开启虚拟机

 选择语言

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 继续

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 

 

点 安装位置

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 点 完成

软件选择 保持最小安装

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 开始安装

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 设置ROOT密码

 

 zh**j**123

安装完成重启

 

 

VMware --> 编辑 ---> 虚拟网络编辑器,选 VMnet8,取消勾选 【使用本地DHCP服务将IP地址分配给虚拟机】

填写子网IP,可以任意填写

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 点 NAT 设置,网关IP已经自动设置了,记住网关IP

 

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

虚拟机--->设置--->网络适配器,网络连接点 自定义,选VMnet8

 大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 在Windows上打开网络连接

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

查看VMnet8属性,查看Internet协议版本4

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

填写IP地址和子网掩码,IP地址和上面设置的子网同一网段,即192.168.147.*

默认网关可以不填,也可以填上面设置的网关192.168.29.2

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

进入虚拟机

进入/etc/sysconfig/network-scripts目录,修改ifcfg-ens33

vi /etc/sysconfig/network-scripts/ifcfg-ens33

修改配置

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=aae5b9e2-96b2-416f-a009-f8e0c041edca
DEVICE=ens33
ONBOOT=yes
IPADDR=192.168.147.8
NETMASK=255.255.255.0
GATEWAY=192.168.147.2
DNS=192.168.147.2
DNS1=8.8.8.8
BOOTPROTO=static,设置网卡引导协议为 静态

 

ONBOOT=yes,设置网卡启动方式为 开机启动 并且可以通过系统服务管理器 systemctl 控制网卡

重启网络服务

systemctl restart network

测试

[root@localhost network-scripts]# ping www.baidu.com
PING www.wshifen.com (104.193.88.77) 56(84) bytes of data.
64 bytes from 104.193.88.77 (104.193.88.77): icmp_seq=2 ttl=128 time=256 ms
64 bytes from 104.193.88.77 (104.193.88.77): icmp_seq=3 ttl=128 time=321 ms

 

 

注意:

上面的配置是没问题的,此时用Windows命令ping VMware虚拟机,如果ping不通:

禁用一下VMnet8

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 然后,再启用,此时应该是可以正常ping通了。

 

 

 

 

克隆另外两台主机,名称为bigdata2,bigdata3,ip为192.168.147.9、192.168.147.10

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

下一步 

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 下一步

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 

二、阿里云

2.1 阿里云准备

1.三台CES

2.若需要,购买公网弹性IP并绑定

3.若需要,可以购买云盘

挂载数据盘

阿里云购买的第2块云盘默认是不自动挂载的,需要手动配置挂载上。

(1)查看SSD云盘

sudo fdisk -l

大数据环境搭建 _ JavaClub全栈架构师技术笔记

可以看到SSD系统已经识别为/dev/vdb

 (2)格式化云盘

sudo mkfs.ext4 /dev/vdb 

大数据环境搭建 _ JavaClub全栈架构师技术笔记

(3)挂载

sudo mount /dev/vdb  /opt  

将云盘挂载到/opt目录下。

(4)配置开机自动挂载

修改/etc/fstab文件,文件末尾添加:

/dev/vdb   /opt ext4    defaults    0  0 

然后df -hl就可以看到第二块挂载成功咯

 大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

如果是正在使用中的系统盘容量不够了,扩容系统盘

阿里云ECS服务器扩容系统盘

yum install cloud-utils-growpart

growpart /dev/vda 1

resize2fs /dev/vda1

 

三、准备

关闭防火墙

centos 7 默认使用的是firewall,不是iptables

 systemctl stop firewalld.service
 systemctl mask firewalld.service

关闭SELinux(所有节点)

 vim /etc/selinux/config
 
 设置SELINUX=disabled

修改主机名

分别命名为node01、node02、node03

以node01为例

[root@node01 ~]# hostnamectl set-hostname node01
[root@node01 ~]# cat /etc/hostname
node01

已经修改,重新登录即可。

修改 /etc/hosts文件

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.147.8 node01
192.168.147.9 node02
192.168.147.10 node03

 

配置免密登录

生成私钥和公钥

ssh-keygen  -t rsa
 -t  type 指定要创建的密钥类型。可以使用:"rsa1"(SSH-1) "rsa"(SSH-2) "dsa"(SSH-2)
生成一对密钥,存放在用户目录的~/.ssh下

将公钥拷贝到要免密登录的目标机器上

ssh-copy-id node01
ssh-copy-id node02
ssh-copy-id node03

编写几个有用的脚本文件

使用rsync编写xsync

#!/bin/sh
# 获取输入参数个数,如果没有参数,直接退出
pcount=$#
if((pcount==0)); then
        echo no args...;
        exit;
fi

# 获取文件名称
p1=$1
fname=`basename $p1`
echo fname=$fname
# 获取上级目录到绝对路径
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir
# 获取当前用户名称
user=`whoami`
# 循环
for((host=1; host<=3; host++)); do
        echo $pdir/$fname $user@slave$host:$pdir
        echo ==================slave$host==================
        rsync -rvl $pdir/$fname $user@slave$host:$pdir
done
#Note:这里的slave对应自己主机名,需要做相应修改。另外,for循环中的host的边界值由自己的主机编号决定

xcall.sh

#! /bin/bash

for host in node01 node02 node03
do
    echo ------------ $i -------------------
    ssh $i "$*"
done

执行上面脚本之前将/etc/profile中的环境变量追加到~/.bashrc中,否则ssh执行命令会报错

[root@node01 bigdata]# cat /etc/profile >> ~/.bashrc
[root@node02 bigdata]# cat /etc/profile >> ~/.bashrc
[root@node03 bigdata]# cat /etc/profile >> ~/.bashrc

创建/bigdata目录

JDK配置

下载JDK,这里我们下载JDK8,https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

需要Oracale账号密码,可以网络搜索

上传JDK到各个节点的/bigdata目录下

解压缩

tar -zxvf jdk-8u241-linux-x64.tar.gz

文件属主和属组如果不是root进行修改,下面是

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

Linux系统按文件所有者、文件所有者同组用户和其他用户来规定了不同的文件访问权限。

1、chgrp:更改文件属组

语法:

chgrp [-R] 属组名 文件名

2、chown:更改文件属主,也可以同时更改文件属组

语法:

chown [–R] 属主名 文件名 chown [-R] 属主名:属组名 文件名

创建软连接

ln -s /root/bigdata/jdk1.8.0_241/ /usr/local/jdk

配置环境变量

vi /etc/profile

在最后面添加

export JAVA_HOME=/usr/local/jdk
export PATH=$PATH:${JAVA_HOME}/bin

加载配置文件

source /etc/profile

查看Java版本

[root@node03 bigdata]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

安装成功

 安装MySQL

mysql安装

安装Maven

http://maven.apache.org/download.cgi

下载,解压

tar -zxvf apache-maven-3.6.1-bin.tar.gz

建立软连接

ln -s /bigdata/apache-maven-3.6.3 /usr/local/maven

加入/etc/profile中

export M2_HOME=/usr/local/maven3
export PATH=$PATH:$M2_HOME/bin

 

安装Git

yum install git

 

 

 四、Cloudera Manager 6.3.1安装 

JDK位置

JAVA_HOME 一定要是 /usr/java/java-version

三台节点下载第三方依赖

yum install bind-utils psmisc cyrus-sasl-plain cyrus-sasl-gssapi fuse portmap fuse-libs /lib/lsb/init-functions httpd mod_ssl openssl-devel python-psycopg2 MySQL-python libxslt

配置仓库

版本 6.3.1

RHEL 7 Compatible https://archive.cloudera.com/cm6/6.3.1/redhat7/yum/ cloudera-manager.repo

下载cloudera-manager.repo 文件,放到Cloudera Manager Server节点的 /etc/yum.repos.d/ 目录 中

[root@node01 ~]# cat /etc/yum.repos.d/cloudera-manager.repo
[cloudera-manager]
name=Cloudera Manager 6.3.1
baseurl=https://archive.cloudera.com/cm6/6.3.1/redhat7/yum/
gpgkey=https://archive.cloudera.com/cm6/6.3.1/redhat7/yum/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1
autorefresh=0

安装Cloudera Manager Server

yum install cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server

 如果速度太慢,可以去 https://archive.cloudera.com/cm6/6.3.1/redhat7/yum/RPMS/x86_64/ 下载rpm包,上传到服务器进行安装

 rpm -ivh cloudera-manager-agent-6.3.1-1466458.el7.x86_64.rpm cloudera-manager-daemons-6.3.1-1466458.el7.x86_64.rpm cloudera-manager-server-6.3.1-1466458.el7.x86_64.rpm

安装完后

[root@node01 cm]# ll /opt/cloudera/
total 16
drwxr-xr-x 27 cloudera-scm cloudera-scm 4096 Mar  3 19:36 cm
drwxr-xr-x  8 root         root         4096 Mar  3 19:36 cm-agent
drwxr-xr-x  2 cloudera-scm cloudera-scm 4096 Sep 25 16:34 csd
drwxr-xr-x  2 cloudera-scm cloudera-scm 4096 Sep 25 16:34 parcel-repo

所有节点

server_host=node01

配置数据库

安装mysql

修改密码,配置权限

移动引擎日志文件

将旧的InnoDB log files /var/lib/mysql/ib_logfile0 和 /var/lib/mysql/ib_logfile1 从 /var/lib/mysql/ 中移动到其他你指定的地方做备份

[root@node01 ~]# mv /var/lib/mysql/ib_logfile0 /bigdata
[root@node01 ~]# mv /var/lib/mysql/ib_logfile1 /bigdata

更新my.cnf文件

默认在/etc/my.cnf目录中

[root@node01 etc]# mv my.cnf my.cnf.bak
[root@node01 etc]# vi my.cnf

官方推荐配置

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
symbolic-links = 0

key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1

max_connections = 550
#expire_logs_days = 10
#max_binlog_size = 100M

#log_bin should be on a disk with enough free space.
#Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your
#system and chown the specified folder to the mysql user.
log_bin=/var/lib/mysql/mysql_binary_log

#In later versions of MySQL, if you enable the binary log and do not set
#a server_id, MySQL will not start. The server_id must be unique within
#the replicating group.
server_id=1

binlog_format = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

sql_mode=STRICT_ALL_TABLES

确保开机启动

systemctl enable mysqld

启动MySql

systemctl start mysqld

安装JDBC驱动

下载

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz

解压缩

tar zxvf mysql-connector-java-5.1.46.tar.gz

拷贝驱动到 /usr/share/java/ 目录中并重命名,如果没有创建该目录

[root@node01 etc]# mkdir -p /usr/share/java/
[root@node01 etc]# cd mysql-connector-java-5.1.46
[root@node01 mysql-connector-java-5.1.46]# cp mysql-connector-java-5.1.46-bin.jar /usr/share/java/mysql-connector-java.jar

为CM组件配置MySQL数据库

Cloudera Manager Server, Oozie Server, Sqoop Server, Activity Monitor, Reports Manager, Hive Metastore Server, Hue Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server这些组件都需要建立数据库

Service Database User
Cloudera Manager Server scm scm
Activity Monitor amon amon
Reports Manager rman rman
Hue hue hue
Hive Metastore Server metastore hive
Sentry Server sentry sentry
Cloudera Navigator Audit Server nav nav
Cloudera Navigator Metadata Server navms navms
Oozie oozie oozie

登录mysql,输入密码

mysql -u root -p

Create databases for each service deployed in the cluster using the following commands. You can use any value you want for the <database><user>, and <password> parameters. The Databases for Cloudera Software table, below lists the default names provided in the Cloudera Manager configuration settings, but you are not required to use them.

Configure all databases to use the utf8 character set.

Include the character set for each database when you run the CREATE DATABASE statements described below.

为每个部属在集里的服务创建数据库,所有数据库都使用 utf8 character set

CREATE DATABASE <database> DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

赋权限

GRANT ALL ON <database>.* TO '<user>'@'%' IDENTIFIED BY '<password>';

实例

mysql> CREATE DATABASE amon DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE scm DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE hive DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE oozie DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE hue DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.01 sec)

mysql> CREATE DATABASE rman DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE sentry DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.01 sec)

mysql>
mysql> CREATE DATABASE nav DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;                             Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE metastore DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)
mysql> GRANT ALL ON scm.* TO 'scm'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> GRANT ALL ON amon.* TO 'amon'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON oozie.* TO 'oozie'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON hue.* TO 'hue'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON rman.* TO 'rman'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> GRANT ALL ON metastore.* TO 'metastore'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON nav.* TO 'nav'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON navms.* TO 'navms'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL ON sentry.* TO 'sentry'@'%' IDENTIFIED BY '@Zhaojie123';
Query OK, 0 rows affected, 1 warning (0.01 sec)

flush privileges;

Record the values you enter for database names, usernames, and passwords. The Cloudera Manager installation wizard requires this information to correctly connect to these databases.

建立Cloudera Manager数据库

 使用CM自带脚本创建

/opt/cloudera/cm/schema/scm_prepare_database.sh <databaseType> <databaseName> <databaseUser>

实例

[root@node01 cm]# /opt/cloudera/cm/schema/scm_prepare_database.sh mysql scm scm
Enter SCM password:
JAVA_HOME=/usr/local/jdk
Verifying that we can write to /etc/cloudera-scm-server
Creating SCM configuration file in /etc/cloudera-scm-server
Executing:  /usr/local/jdk/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/share/java/oracle-connector-java.jar:/usr/share/java/postgresql-connector-java.jar:/opt/cloudera/cm/schema/../lib/* com.cloudera.enterprise.dbutil.DbCommandExecutor /etc/cloudera-scm-server/db.properties com.cloudera.cmf.db.
Tue Mar 03 19:46:36 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2020-03-03 19:46:36,866 [main] INFO  com.cloudera.enterprise.dbutil.DbCommandExecutor  - Successfully connected to database.
All done, your SCM database is configured correctly!

 

主节点

vim /etc/cloudera-scm-server/db.properties
com.cloudera.cmf.db.type=mysql
com.cloudera.cmf.db.host=node01
com.cloudera.cmf.db.name=scm
com.cloudera.cmf.db.user=scm
com.cloudera.cmf.db.setupType=EXTERNAL
com.cloudera.cmf.db.password=@Z

 

准备parcels,将CDH相关文件拷贝到主节点

[root@node01 parcel-repo]# pwd
/opt/cloudera/parcel-repo
[root@node01 parcel-repo]# ll
total 2035084
-rw-r--r-- 1 root root 2083878000 Mar  3 21:27 CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel
-rw-r--r-- 1 root root         40 Mar  3 21:15 CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel.sha1
-rw-r--r-- 1 root root      33887 Mar  3 21:15 manifest.json
[root@node01 parcel-repo]# mv CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel.sha1 CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel.sha
[root@node01 parcel-repo]# ll
total 2035084
-rw-r--r-- 1 root root 2083878000 Mar  3 21:27 CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel
-rw-r--r-- 1 root root         40 Mar  3 21:15 CDH-6.3.1-1.cdh6.3.1.p0.1470567-el7.parcel.sha
-rw-r--r-- 1 root root      33887 Mar  3 21:15 manifest.json

 

启动

主节点

systemctl start cloudera-scm-server
systemctl start cloudera-scm-agent

从节点

systemctl start cloudera-scm-agent

 

 

浏览器输入地址 ip:7180,登录,用户名和密码均为admin 

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 继续

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 接受协议,继续

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

选择版本,继续

进入集群安装欢迎页

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 继续, 为集群命名,

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

继续, 选择管理的主机

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

选择CDH版本

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 集群安装

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

速度慢,可去https://archive.cloudera.com/cdh6/6.3.2/parcels/下载

 检测网络和主机

大数据环境搭建 _ JavaClub全栈架构师技术笔记

不断继续

服务暂时选HDFS、YARN、Zookeeper

分配角色

继续直到完成

配置Hadoop支持LZO

LzoCodec和LzopCodec区别

两种压缩编码LzoCodec和LzopCodec区别:
    1. LzoCodec比LzopCodec更快, LzopCodec为了兼容LZOP程序添加了如 bytes signature, header等信息。
    2. LzoCodec作为Reduce输出,结果文件扩展名为 ”.lzo_deflate” ,无法被lzop读取;使用LzopCodec作为Reduce输出,生成扩展名为 ”.lzo” 的文件,可被lzop读取。
    3. LzoCodec结果(.lzo_deflate文件) 不能由 lzo index job 的 "DistributedLzoIndexer" 创建index。
    4. “.lzo_deflate” 文件不能作为MapReduce输入。而这些 “.LZO” 文件都支持。
        综上所述,map输出的中间结果使用LzoCodec,reduce输出使用 LzopCodec。

另外:org.apache.hadoop.io.compress.LzoCodec和com.hadoop.compression.lzo.LzoCodec功能一样,都是源码包中带的,生成的都是 lzo_deflate 文件。

在线Parcel安装Lzo
下载地址:修改6.x.y为对应版本

CDH6:https://archive.cloudera.com/gplextras6/6.x.y/parcels/ 
CDH5:https://archive.cloudera.com/gplextras5/parcels/5.x.y/

1. 在CDH的 Parcel 配置中,“远程Parcel存储库URL”,点击 “+” 号,添加地址栏:

    CDH6:https://archive.cloudera.com/gplextras6/6.0.1/parcels/
    CDH5:http://archive.cloudera.com/gplextras/parcels/latest/

其他离线方式:

下载parcel放到 /opt/cloudera/parcel-repo 目录下

或者

搭建httpd,更改parcel URL地址,再在按远程安装

大数据环境搭建 _ JavaClub全栈架构师技术笔记

2. 返回Parcel列表,延迟几秒后会看到多出了 GPLEXTRAS(CDH6) 或者 HADOOP_LZO (CDH5),

下载 -- 分配 -- 激活

3. 安装完LZO后,打开HDFS配置,找到“压缩编码解码器”,点击 “+” 号,

添加:

com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

大数据环境搭建 _ JavaClub全栈架构师技术笔记

4. YARN配置,找到 “MR 应用程序 Classpath”(mapreduce.application.classpath)

添加:

/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/*

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 


5. 重启更新过期配置

添加sqoop

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 继续

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

Spark安装

添加服务,添加spark

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 服务添加完成后,去节点进行配置

 

三台节点都要配置

进入目录

cd /opt/cloudera/parcels/CDH/lib/spark/conf

添加JAVA路径

vi spark-env.sh

末尾添加

export JAVA_HOME=/usr/local/jdk

创建slaves文件

添加work节点

node02
node03

删除软连接work

大数据环境搭建 _ JavaClub全栈架构师技术笔记

rm -r work

修改端口,防止与yarn冲突

vi spark-defaults.conf

 spark.shuffle.service.port=7337 可改为7338

 

启动时发现

[root@node01 sbin]# ./start-all.sh
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark) overrides detected (/opt/cloudera/parcels/CDH/lib/spark).
WARNING: Running start-master.sh from user-defined location.
/opt/cloudera/parcels/CDH/lib/spark/bin/load-spark-env.sh: line 77: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/bin/start-master.sh: No such file or directory
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark) overrides detected (/opt/cloudera/parcels/CDH/lib/spark).
WARNING: Running start-slaves.sh from user-defined location.
/opt/cloudera/parcels/CDH/lib/spark/bin/load-spark-env.sh: line 77: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/bin/start-slaves.sh: No such file or directory

将sbin目录下的文件拷贝到bin目录下

[root@node01 bin]# xsync start-slave.sh
[root@node01 bin]# xsync start-master.sh

启动成功

jps命令查看,node1又master,node2和node3有worker

进入shell

[root@node01 bin]# spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/04 13:22:07 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://node01:4040
Spark context available as 'sc' (master = yarn, app id = application_1583295431127_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.3.1
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241)
Type in expressions to have them evaluated.
Type :help for more information.

scala> var h =1
h: Int = 1

scala> h + 3
res1: Int = 4


scala> :quit

 

 

在网页修改才会持续修改,在文件中修改,重启CDH会被复原。

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 

 

Flink安装

本人编译号的Flink

链接:https://pan.baidu.com/s/1lIqeBtNpj0wR-Q8KAEAIsg
提取码:89wi

 

1、环境
Jdk 1.8、centos7.6、Maven 3.2.5、Scala-2.12

2、源码和CDH 版本
Flink 1.10.0 、 CDH 6.3.1(Hadoop 3.0.0)

源码下载 https://flink.apache.org/downloads.html

flink重新编译

修改maven的配置文件

vi settings.xml

配置maven源

<mirrors>
        <mirror>
                <id>alimaven</id>
                <mirrorOf>central</mirrorOf>
                <name>aliyun maven</name>
                <url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
        </mirror>
        <mirror>
                <id>alimaven</id>
                <name>aliyun maven</name>
                <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
                <mirrorOf>central</mirrorOf>
        </mirror>
        <mirror>
                <id>central</id>
                <name>Maven Repository Switchboard</name>
                <url>http://repo1.maven.org/maven2/</url>
                <mirrorOf>central</mirrorOf>
        </mirror>
        <mirror>
                <id>repo2</id>
                <mirrorOf>central</mirrorOf>
                <name>Human Readable Name for this Mirror.</name>
                <url>http://repo2.maven.org/maven2/</url>
        </mirror>
        <mirror>
                <id>ibiblio</id>
                <mirrorOf>central</mirrorOf>
                <name>Human Readable Name for this Mirror.</name>
                <url>http://mirrors.ibiblio.org/pub/mirrors/maven2/</url>
        </mirror>
        <mirror>
                <id>jboss-public-repository-group</id>
                <mirrorOf>central</mirrorOf>
                <name>JBoss Public Repository Group</name>
                <url>http://repository.jboss.org/nexus/content/groups/public</url>
        </mirror>
        <mirror>
                <id>google-maven-central</id>
                <name>Google Maven Central</name>
                <url>https://maven-central.storage.googleapis.com
                </url>
                <mirrorOf>central</mirrorOf>
        </mirror>
        <mirror>
                <id>maven.net.cn</id>
                <name>oneof the central mirrors in china</name>
                <url>http://maven.net.cn/content/groups/public/</url>
                <mirrorOf>central</mirrorOf>
        </mirror>
  </mirrors>

下载依赖的 flink-shaded 源码
不同的 Flink 版本使用的 Flink-shaded不同,1.10 版本使用 10.0

https://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-shaded-10.0/flink-shaded-10.0-src.tgz

解压后,在 pom.xml 中,添加如下,加入到标签中

<profile>
        <id>vendor-repos</id>
        <activation>
                <property>
                        <name>vendor-repos</name>
                </property>
        </activation>
        <!-- Add vendor maven repositories -->
        <repositories>
                <!-- Cloudera -->
                <repository>
                        <id>cloudera-releases</id>
                        <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                        <releases>
                                <enabled>true</enabled>
                        </releases>
                        <snapshots>
                                <enabled>false</enabled>
                        </snapshots>
                </repository>
                <!-- Hortonworks -->
                <repository>
                        <id>HDPReleases</id>
                        <name>HDP Releases</name>
                        <url>https://repo.hortonworks.com/content/repositories/releases/</url>
                        <snapshots><enabled>false</enabled></snapshots>
                        <releases><enabled>true</enabled></releases>
                </repository>
                <repository>
                        <id>HortonworksJettyHadoop</id>
                        <name>HDP Jetty</name>
                        <url>https://repo.hortonworks.com/content/repositories/jetty-hadoop</url>
                        <snapshots><enabled>false</enabled></snapshots>
                        <releases><enabled>true</enabled></releases>
                </repository>
                <!-- MapR -->
                <repository>
                        <id>mapr-releases</id>
                        <url>https://repository.mapr.com/maven/</url>
                        <snapshots><enabled>false</enabled></snapshots>
                        <releases><enabled>true</enabled></releases>
                </repository>
        </repositories>
</profile>

在flink-shade目录下运行下面的命令,进行编译

mvn -T2C clean install -DskipTests -Pvendor-repos -Dhadoop.version=3.0.0-cdh6.3.1 -Dscala-2.12 -Drat.skip=true

 

下载flink源码 https://mirrors.aliyun.com/apache/flink/flink-1.10.0/

解压,进入目录,修改文件

[root@node02 ~]# cd /bigdata/
[root@node02 bigdata]# cd flink
[root@node02 flink]# cd flink-1.10.0
[root@node02 flink-1.10.0]# cd flink-runtime-web/
[root@node02 flink-runtime-web]# ll
total 24
-rw-r--r-- 1 501 games 8726 Mar  7 23:31 pom.xml
-rw-r--r-- 1 501 games 3505 Feb  8 02:18 README.md
drwxr-xr-x 4 501 games 4096 Feb  8 02:18 src
drwxr-xr-x 3 501 games 4096 Mar  7 23:19 web-dashboard
[root@node02 flink-runtime-web]# vi pom.xml

加入国内的下载地址,否则很可能报错

<execution>
    <id>install node and npm</id>
    <goals>
        <goal>install-node-and-npm</goal>
    </goals>
    <configuration> 

            <nodeDownloadRoot>http://npm.taobao.org/mirrors/node/</nodeDownloadRoot>

            <npmDownloadRoot>http://npm.taobao.org/mirrors/npm/</npmDownloadRoot>

        <nodeVersion>v10.9.0</nodeVersion>
    </configuration>
</execution>

在flink源码解压目录下运行下列命令,编译 Flink 源码

mvn clean install -DskipTests -Dfast -Drat.skip=true -Dhaoop.version=3.0.0-cdh6.3.1 -Pvendor-repos -Dinclude-hadoop -Dscala-2.12 -T2C 

提取出 flink-1.10.0 二进制包即可
目录地址:

flink-1.10.0/flink-dist/target/flink-1.10.0-bin

 

flink  on  yarn模式

三个节点配置环境变量

export HADOOP_HOME=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567
export HADOOP_CONF_DIR=/etc/hadoop/conf
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

source下配置文件

如果机器上安装了spark,其worker端口8081会和flink的web端口冲突进行修改

进入一个节点flink目录下conf目录中的的配置文件

vi flink-conf.yaml

设置

rest.port: 8082

并继续在该文件中添加或修改

high-availability: zookeeper
high-availability.storageDir: hdfs://node01:8020/flink_yarn_ha
high-availability.zookeeper.path.root: /flink-yarn
high-availability.zookeeper.quorum: node01:2181,node02:2181,node03:2181
yarn.application-attempts: 10

将flink分发到各个节点

xsync flink-1.10.0

hdfs上面创建文件夹

node01执行以下命令创建hdfs文件夹

hdfs dfs -mkdir -p /flink_yarn_ha

建立测试文件

vim wordcount.txt

内容如下

hello world

flink hadoop

hive spark

hdfs上面创建文件夹并上传文件

hdfs dfs -mkdir -p /flink_input

hdfs dfs -put wordcount.txt  /flink_input

测试

[root@node01 flink-1.10.0]# bin/flink run -m yarn-cluster ./examples/batch/WordCount.jar -input hdfs://node01:8020/flink_input -output hdfs://node01:8020/out_result1/out_count.txt  -yn 2 -yjm 1024 -ytm 1024

查看输出结果

hdfs dfs -cat hdfs://node01:8020/out_result/out_count.txt

 

Kafka

下载 http://archive.cloudera.com/kafka/parcels/4.0.0/

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 分配,激活

添加服务,三个节点都分配borker角色,其他不用配置

可以修改Java  Heap Size of Broker

创建topic

/opt/cloudera/parcels/KAFKA-4.0.0-1.4.0.0.p0.1/bin/kafka-topics --zookeeper node01:2181,node02:2181,node03:2181 --create --replication-factor 1 --partitions 1 --topic test

查看主题

 /opt/cloudera/parcels/KAFKA-4.0.0-1.4.0.0.p0.1/bin/kafka-topics --zookeeper node01:2181 --list

产生消息

/opt/cloudera/parcels/KAFKA-4.0.0-1.4.0.0.p0.1/bin/kafka-console-producer --broker-list node01:9092 --topic test

消费消息

/opt/cloudera/parcels/KAFKA-4.0.0-1.4.0.0.p0.1/bin/kafka-console-consumer --bootstrap-server node01:9092 --topic test

 

 

五、原生安装

https://archive.apache.org/dist/

Hadoop 2.8.5

Hive 2.3.6 

HBase 2.1.8

Flume 

Sqoop

 

Kafka 

Storm 

spark 2.4.6

Flink

 

 

Zookeeper

https://www.cnblogs.com/aidata/p/12441506.html#_label1_2

三节点

集群规划
在node01、node02和node03三个节点上部署Zookeeper。
解压安装
(1)解压Zookeeper安装包到/opt/module/目录下

[root@hadoop101 software]$ tar -zxvf zookeeper-3.4.10.tar.gz -C /opt/module/

(2)同步/opt/module/zookeeper-3.4.10目录内容到hadoop103、hadoop104

[root@hadoop101 module]$ xsync zookeeper-3.4.10/

配置服务器编号
(1)在/opt/module/zookeeper-3.4.10/这个目录下创建zkData

[root@hadoop101 zookeeper-3.4.10]$ mkdir -p zkData

(2)在/opt/module/zookeeper-3.4.10/zkData目录下创建一个myid的文件

[root@hadoop101 zkData]$ touch myid

添加myid文件,注意一定要在linux里面创建,在notepad++里面很可能乱码
(3)编辑myid文件

[root@hadoop101 zkData]$ vi myid

在文件中添加与server对应的编号 1
(4)拷贝配置好的zookeeper到其他机器上

[root@hadoop101 zkData]$ xsync myid

并分别在hadoop102、hadoop103上修改myid文件中内容为2、3
配置zoo.cfg文件
(1)重命名/opt/module/zookeeper-3.4.10/conf这个目录下的zoo_sample.cfg为zoo.cfg

[root@hadoop101 conf]$ mv zoo_sample.cfg zoo.cfg

(2)打开zoo.cfg文件

[root@hadoop101 conf]$ vim zoo.cfg

修改数据存储路径配置

dataDir=/opt/module/zookeeper-3.4.10/zkData

增加如下配置

#######################cluster##########################
server.1=hadoop101:2888:3888
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888

(3)同步zoo.cfg配置文件

[root@hadoop101 conf]$ xsync zoo.cfg

(4)配置参数解读
server.A=B:C:D。
A是一个数字,表示这个是第几号服务器;
集群模式下配置一个文件myid,这个文件在dataDir目录下,这个文件里面有一个数据就是A的值,Zookeeper启动时读取此文件,拿到里面的数据与zoo.cfg里面的配置信息比较从而判断到底是哪个server。
B是这个服务器的ip地址;
C是这个服务器与集群中的Leader服务器交换信息的端口;
D是万一集群中的Leader服务器挂了,需要一个端口来重新进行选举,选出一个新的Leader,而这个端口就是用来执行选举时服务器相互通信的端口。
集群操作
(1)分别启动Zookeeper

[root@hadoop101 zookeeper-3.4.10]$ bin/zkServer.sh start
[root@hadoop102 zookeeper-3.4.10]$ bin/zkServer.sh start
[root@hadoop103 zookeeper-3.4.10]$ bin/zkServer.sh start

(2)查看状态

大数据环境搭建 _ JavaClub全栈架构师技术笔记
[root@hadoop101 zookeeper-3.4.10]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Mode: follower
[root@hadoop102 zookeeper-3.4.10]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Mode: leader
[root@hadoop103 zookeeper-3.4.5]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Mode: follower
大数据环境搭建 _ JavaClub全栈架构师技术笔记

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

 

 id在集群中必须是唯一的,其值应在1到255之间。

 常用服务命令

1. 启动ZK服务: bin/zkServer.sh start

2. 查看ZK服务状态: bin/zkServer.sh status

3. 停止ZK服务: bin/zkServer.sh stop

4. 重启ZK服务: bin/zkServer.sh restart

5. 连接服务器: zkCli.sh -server 127.0.0.1:2181

 集群监控

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

如果出现错误 

[myid:1] - WARN  [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumCnxManager@685] - Cannot open channel to 3 at election address k8s-node3/10.0.2.15:17888
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:606)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:656)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:713)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:741)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:910)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1229)

如hadoop101

server.1=0.0.0.0:2888:3888
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888

其他节点一样

 

本机用节点 用  0.0.0.0  IP代替主机名

原因:https://stackoverflow.com/questions/30940981/zookeeper-error-cannot-open-channel-to-x-at-election-address

How have defined the ip of the local server in each node? If you have given the public ip, then the listener would have failed to connect to the port. You must specify 0.0.0.0 for the current node

server.1=0.0.0.0:2888:3888
server.2=192.168.10.10:2888:3888
server.3=192.168.2.1:2888:3888

This change must be performed at the other nodes too.

 

安装脚本

#! /bin/bash

echo "====================zookeeper安装==============================="
echo "====================下载zookeeper==============================="
#wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.5.8/apache-zookeeper-3.5.8-bin.tar.gz
#tar -zxvf apache-zookeeper-3.5.8-bin.tar.gz
#xsync apache-zookeeper-3.5.8-bin/

# 循环
i=0
for host in node01 node02 node03; do
        echo ==================node$host==================
        ssh $host "mkdir -p /bigdata/apache-zookeeper-3.5.8-bin/zkData"
        ssh $host "touch /bigdata/apache-zookeeper-3.5.8-bin/zkData/myid"
        ssh $host "echo $i > /bigdata/apache-zookeeper-3.5.8-bin/zkData/myid"
        ssh $host "cp /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo_sample.cfg /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo.cfg"
        ssh $host 'sed -i "s#^dataDir=.*#dataDir=/bigdata/apache-zookeeper-3.5.8-bin/zkData#" /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo.cfg'
        ssh $host 'echo "server.1=node01:2888:3888" >> /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo.cfg'
        ssh $host 'echo "server.2=node02:2888:3888" >> /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo.cfg'
        ssh $host 'echo "server.3=node03:2888:3888" >> /bigdata/apache-zookeeper-3.5.8-bin/conf/zoo.cfg'

         let 'i+=1'

done

 

启动脚本

#!/bin/sh

# 循环
for((host=1; host<=3; host++)); do
        echo ==================k8s-node$host==================
        ssh root@k8s-node$host "source /etc/profile;/opt/module/apache-zookeeper-3.5.7-bin/bin/zkServer.sh start"
done

修改为你自己的主机名和目录

关闭所有节点

#!/bin/sh

# 循环
for((host=1; host<=3; host++)); do
        echo ==================k8s-node$host==================
        ssh root@k8s-node$host "source /etc/profile;/opt/module/apache-zookeeper-3.5.7-bin/bin/zkServer.sh stop"
done

查看所有节点状态

#!/bin/sh

# 循环
for((host=1; host<=3; host++)); do
        echo ==================k8s-node$host==================
        ssh root@k8s-node$host "source /etc/profile;/opt/module/apache-zookeeper-3.5.7-bin/bin/zkServer.sh status"
done

 综合为一个

#! /bin/bash

case $1 in
"start"){
    for host in node01 node02 node03; do
        ssh $host "/bigdata/apache-zookeeper-3.5.8-bin/bin/zkServer.sh start"
    done
};;
"stop"){
    for host in node01 node02 node03; do
        ssh $host "/bigdata/apache-zookeeper-3.5.8-bin/bin/zkServer.sh stop"
    done
};;
"status"){
    for host in node01 node02 node03; do
        ssh $host "/bigdata/apache-zookeeper-3.5.8-bin/bin/zkServer.sh status"
    done
};;
esac

 

 

 

mysql

 

Hadoop 

配置HDFS

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- 指定hdfs的nameservice名称空间为ns -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://ns</value>
    </property>
    <!-- 指定hadoop临时目录,默认在/tmp/{$user}目录下,不安全,每次开机都会被清空-->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop/hdpdata/</value>
        <description>需要手动创建hdpdata目录</description>
    </property>
    <!-- 指定zookeeper地址 -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>node01:2181,node02:2181,node03:2181</value>
        <description>zookeeper地址,多个用逗号隔开</description>
    </property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- NameNode HA配置 -->
    <property>
        <name>dfs.nameservices</name>
        <value>ns</value>
        <description>指定hdfs的nameservice为ns,需要和core-site.xml中的保持一致</description>
    </property>
    <property>
        <name>dfs.ha.namenodes.ns</name>
        <value>nn1,nn2</value>
        <description>ns命名空间下有两个NameNode,逻辑代号,随便起名字,分别是nn1,nn2</description>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns.nn1</name>
        <value>node01:9000</value>
        <description>nn1的RPC通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns.nn1</name>
        <value>node01:50070</value>
        <description>nn1的http通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns.nn2</name>
        <value>node02:9000</value>
        <description>nn2的RPC通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns.nn2</name>
        <value>node02:50070</value>
        <description>nn2的http通信地址</description>
    </property>
    <!--JournalNode配置 -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://node01:8485;node02:8485;node03:8485/ns</value>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/usr/local/hadoop/journaldata</value>
        <description>指定JournalNode在本地磁盘存放数据的位置</description>
    </property>
    <!--namenode高可用主备切换配置 -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
        <description>开启NameNode失败自动切换</description>
    </property>
    <property>
        <name>dfs.client.failover.proxy.provider.ns</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        <description>配置失败自动切换实现方式,使用内置的zkfc</description>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>
            sshfence
            shell(/bin/true)
        </value>
        <description>配置隔离机制,多个机制用换行分割,先执行sshfence,执行失败后执行shell(/bin/true),/bin/true会直接返回0表示成功</description>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
        <description>使用sshfence隔离机制时需要ssh免登陆</description>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>30000</value>
        <description>配置sshfence隔离机制超时时间</description>
    </property>
    <!--dfs文件属性设置-->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <description>默认block副本数为3,测试环境这里设置为1,注意生产环境一定要设置3个副本以上</description>
    </property>

    <property>
        <name>dfs.block.size</name>
        <value>134217728</value>
        <description>设置block大小是128M</description>
    </property>

</configuration>

 

配置YARN

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>指定mr框架为yarn方式 </description>
    </property>
    <!-- 历史日志服务jobhistory相关配置 -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node02:10020</value>
        <description>历史服务器端口号</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node02:19888</value>
        <description>历史服务器的WEB UI端口号</description>
    </property>
</configuration>

yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
    <!-- 开启RM高可用 -->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <!-- 指定RM的cluster id,一组高可用的rm共同的逻辑id -->
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yarn-ha</value>
    </property>
    <!-- 指定RM的名字,可以随便自定义 -->
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>
    <!-- 分别指定RM的地址 -->
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>node01</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm1</name>
        <value>${yarn.resourcemanager.hostname.rm1}:8088</value>
        <description>HTTP访问的端口号</description>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>node02</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm2</name>
        <value>${yarn.resourcemanager.hostname.rm2}:8088</value>
    </property>
    <!-- 指定zookeeper集群地址 -->
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>node01:2181,node02:2181,node03:2181</value>
    </property>
    <!--NodeManager上运行的附属服务,需配置成mapreduce_shuffle,才可运行MapReduce程序-->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 开启日志聚合 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- 日志聚合HDFS目录 -->
    <property>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>/data/hadoop/yarn-logs</value>
    </property>
    <!-- 日志保存时间3days,单位秒 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>259200</value>
    </property>
</configuration>

在/usr/local/hadoop路径下创建hdpdata文件夹

cd /usr/local/hadoop
mkdir hdpdata

 

修改/usr/local/hadoop/etc/hadoop下的slaves文件

设置datanode和nodemanager启动节点主机名称

在slaves文件中添加节点的主机名称

node02
node03

 

将hadoop文件夹复制到各个节点

 

集群启动

(注意严格按照顺序启动)

启动journalnode(分别在node01、node02、node03上执行启动)

/usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode

运行jps命令检验,node01、node02、node03上多了JournalNode进程


格式化HDFS
在node01上执行命令:

hdfs namenode -format

格式化成功之后会在core-site.xml中的hadoop.tmp.dir指定的路径下生成dfs文件夹,将该文件夹拷贝到node02的相同路径下

scp -r hdpdata root@node02:/usr/local/hadoop

 

在node01上执行格式化ZKFC操作

hdfs zkfc -formatZK

执行成功,日志输出如下信息
INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/ns in ZK

在node01上启动HDFS

sbin/start-dfs.sh

 

在node02上启动YARN

sbin/start-yarn.sh

在node01单独启动一个ResourceManger作为备份节点

sbin/yarn-daemon.sh start resourcemanager

 

在node02上启动JobHistoryServer

sbin/mr-jobhistory-daemon.sh start historyserver

启动完成node02会增加一个JobHistoryServer进程

hadoop安装启动完成
HDFS HTTP访问地址
NameNode (active):http://node01:50070
NameNode (standby):http://node02:50070
ResourceManager HTTP访问地址
ResourceManager :http://node02:8088
历史日志HTTP访问地址
JobHistoryServer:http:/node02:19888

集群验证

 验证HDFS 是否正常工作及HA高可用首先向hdfs上传一个文件

hadoop fs -put /usr/local/hadoop/README.txt /

在active节点手动关闭active的namenode

sbin/hadoop-daemon.sh stop namenode

通过HTTP 50070端口查看standby namenode的状态是否转换为active
手动启动上一步关闭的namenode

sbin/hadoop-daemon.sh start namenode

验证ResourceManager HA高可用
手动关闭node02的ResourceManager

sbin/yarn-daemon.sh stop resourcemanager

通过HTTP 8088端口访问node01的ResourceManager查看状态
手动启动node02 的ResourceManager

sbin/yarn-daemon.sh start resourcemanager

 

 

安装脚本

#! /bin/bash

tar -zxvf /bigdata/downloads/hadoop-2.8.5.tar.gz -C /bigdata
\cp /bigdata/downloads/yarn-site.xml /usr/local/hadoop/etc/hadoop/
\cp /bigdata/downloads/mapred-site.xml /usr/local/hadoop/etc/hadoop/
\cp /bigdata/downloads/hdfs-site.xml /usr/local/hadoop/etc/hadoop/
\cp /bigdata/downloads/core-site.xml /usr/local/hadoop/etc/hadoop/
cat /dev/null > /usr/local/hadoop/etc/hadoop/slaves"
echo "node02" >> /usr/local/hadoop/etc/hadoop/slaves'
echo "node03" >> /usr/local/hadoop/etc/hadoop/slaves'

xsync /bigdata/hadoop-2.8.5
# 追加环境变量
echo 'export HADOOP_HOME=/usr/local/hadoop' >> /etc/profile
echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> /etc/profile
echo 'export YARN_HOME=$HADOOP_HOME' >> /etc/profile
echo 'export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> /etc/profile
echo 'export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin' >> /etc/profile

xsync /etc/profile

# 循环
i=0
for host in node01 node02 node03; do
        echo ==================node$host==================
        # 建立软连接
        #ssh $host "ln -s /bigdata/hadoop-2.8.5 /usr/local/hadoop"
        # 环境变量生效
        ssh $host "source /etc/profile"
done

格式化,初次启动集群

#! /bin/bash

for host in node01 node02 node03; do
        echo ==================node$host==================
        # 启动journalnode
        ssh $host "/usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode"
done

/usr/local/hadoop/bin/hdfs namenode -format
scp -r /usr/local/hadoop/hdpdata root@node02:/usr/local/hadoop
/usr/local/hadoop/bin/hdfs zkfc -formatZK
/usr/local/hadoop/sbin/start-dfs.sh
ssh node02 "/usr/local/hadoop/sbin/start-yarn.sh"
/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
ssh node02 "/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver"

 

Hive

这里笔者的MySql使用的是docker,在hvie-site.xml根据主机实际情况配置即可

1.创建HDFS数据仓库目录

  hadoop fs -mkdir -p /user/hive/warehouse

2.为所有用户添加数据仓库目录的写权限

hadoop fs -chmod a+w /user/hive/warehouse

3.开放HDFS 中tmp临时目录的权限

hadoop fs -chmod -R 777 /tmp

5.将Hive安装包解压到/bigdata/安装目录

tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /bigdata

6.创建软链接

 ln -s /bigdata/apache-hive-1.2.2-bin /usr/local/hive

7.设置环境变量

 vim /etc/profile

    添加如下内容:

 export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$PATH:${HIVE_HOME}/bin

8.重新编译使环境变量生效

source /etc/profile

9.hive-site.xml配置文件上传到hive/conf目录中,添加用于存储元数据的MySQL数据库配置信息

大数据环境搭建 _ JavaClub全栈架构师技术笔记
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://192.168.10.100:3307/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive1234</value>
    </property>
</configuration>
大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

10.将mysql驱动jar文件拷贝到${HIVE_HOME}/lib目录下

11.登录MySQL创建用户hive

    登录MySQL:mysql -u root -p

    创建用户:create user 'hive'@'%' identified by 'hive1234';

    查询用户表确定用户创建成功:select user,host from mysql.user;

    为用户授权:grant all privileges on *.* to 'hive'@'%';

    刷新权限:flush privileges;

12.启动hive

    /usr/local/hive/bin/hive

 

脚本

mysql已经配置好

hiveInstall.sh

#! /bin/bash
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod a+w /user/hive/warehouse
hadoop fs -chmod -R 777 /tmp
tar -zxvf /bigdata/apache-hive-2.3.6-bin.tar.gz -C /bigdata
ln -s /bigdata/apache-hive-2.3.6-bin /usr/local/hive
echo 'export HIVE_HOME=/usr/local/hive' >> /etc/profile
echo 'export PATH=$PATH:$PATH:${HIVE_HOME}/bin' >> /etc/profile
source /etc/profile

\cp /bigdata/downloads/hive-site.xml /usr/local/hive/conf/
\cp /bigdata/downloads/mysql-connector-java-5.1.47.jar /usr/local/hive/lib

如果脚本中设置了环境变量,执行脚本的时候用source或 .

. hiveInstall.sh
或
source hiveInstall.sh

否则使用

./hiveInstall.sh

会通过子shell执行

则里面的source /etc/profile只在子shell中生效,执行完脚本退出子shell,回到当前shell,环境变量没有生效

初始化hive,在mysql中生成相关数据

schematool -dbType mysql -initSchema

启动hive

 /usr/local/hive/bin/hive

https://www.cnblogs.com/aidata/p/11571111.html#_label3

 

Hbase

conf目录下:

配置hbase-env.sh

设置jdk路径:export JAVA_HOME=/usr/local/jdk

启用外部zookeeper:export HBASE_MANAGES_ZK=false

配置hbase-site.xml

大数据环境搭建 _ JavaClub全栈架构师技术笔记
<configuration>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/usr/local/zookeeper/data</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://node02:9000/user/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>node01:2181,node02:2181,node03:2181</value>
    </property>
</configuration>

 

配置regionservers

node02
node03

新建文件backup-masters

node02

进入lib下,拷贝client-facing-thirdparty下的jar包到lib目录:

cp client-facing-thirdparty/htrace-core-3.1.0-incubating.jar

 

安装脚本

#! /bin/bash

tar -zxvf /bigdata/downloads/hbase-2.1.8-bin.tar.gz -C /bigdata
# 循环
for host in node01 node02 node03; do
        echo ==================node$host==================
        # 建立软连接
        ssh $host "ln -s /bigdata/hbase-2.1.8 /usr/local/hbase"

done
# 覆盖配置文件
\cp /bigdata/downloads/hbase-site.xml /usr/local/hbase/conf
# 配置regionservers
cat /dev/null > /usr/local/hbase/conf/regionservers
echo "node02" >> /usr/local/hbase/conf/regionservers
echo "node03" >> /usr/local/hbase/conf/regionservers
# 创建backup-masters
touch /usr/local/hbase/conf/backup-masters
echo "node02" >> /usr/local/hbase/conf/backup-masters
\cp /usr/local/hbase/lib/client-facing-thirdparty/htrace-core-3.1.0-incubating.jar /usr/local/hbase/lib

xsync /bigdata/hbase-2.1.8-bin

 

启动

bin目录下 

./start-hbase.sh
./hbase shell

 

 

Kafka

1.集群规划
使用3台机器部署,分别是node01、node02、node03
2.下载Kafka安装包
下载地址http://kafka.apache.org/downloads,选择Kafka版本kafka_2.11-0.10.2.1.tgz
3.安装kafka
将安装包上传到其中一台机器node01上,并解压到/bigdata目录下

tar -zxvf kafka_2.11-0.10.2.1.tgz

创建软连接

ln -s /bigdata/kafka_2.11-0.10.2.1 /usr/local/kafka

4.添加到环境变量:vim /etc/profile
添加内容

export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:${KAFKA_HOME}/bin

刷新环境变量:source /etc/profile
5.修改配置文件

cd /usr/local/kafka/config
vim server.properties

6.在/usr/local/kafka中创建kafka-logs文件夹

mkdir /usr/local/kafka/kafka-logs

7.使用scp将配置好的kafka安装包拷贝到node02和node03两个节点

scp -r /bigdata/kafka_2.11-0.10.2.1 root@node02:/bigdata/
scp -r /bigdata/kafka_2.11-0.10.2.1 root@node03:/bigdata/

8.分别修改node02和node03的配置文件server.properties 具体文件在下面
8.1 node02的server.properties修改项

broker.id=1
host.name=node02

8.2 node03的server.properties修改项

broker.id=2
host.name=node03

9.分别在node01、node02、node03启动kafka
cd /usr/local/kafka
启动的时候使用-daemon选项,则kafka将以守护进程的方式启动

bin/kafka-server-start.sh -daemon config/server.properties

10.日志目录
默认在kafka安装路径生成的logs文件夹中

 

server.properties

############################# Server Basics #############################

#每个borker的id是唯一的,多个broker要设置不同的id
broker.id=0

#访问端口号
port=9092

#访问地址
host.name=node01

#允许删除topic
delete.topic.enable=true


# The number of threads handling network requests
num.network.threads=3

# The number of threads doing disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

#存储数据路径,默认是在/tmp目录下,需要修改
log.dirs=/usr/local/kafka/kafka-logs

#创建topic默认分区数
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

#数据保存时间,默认7天,单位小时
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

#zookeeper地址,多个地址用逗号隔开
zookeeper.connect=node01:2181,node02:2181,node03:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

如果想要内网中连接kafka集群,如windows中IDEA操作虚拟机中的Kafka,添加配置

listeners=PLAINTEXT://192.168.10.108:9092
advertised.listeners=PLAINTEXT://192.168.10.108:9092

如果是公网则需进一步设置

listeners 是kafka真正bind的地址

advertised.listeners 是暴露给外部的listeners,如果没有设置,会用listeners,将Broker的Listener信息发布到Zookeeper中

 

分别在三个节点启动kafka

 bin/kafka-server-start.sh -daemon config/server.properties

创建主题

bin/kafka-topics.sh --create --zookeeper node01:2181 --topic topic1 --replication-factor 2 --partitions 2

查看主题信息

bin/kafka-topics.sh --describe --zookeeper node01:2181 --topic topic1

查看kafka中已经创建的主题列表

bin/kafka-topics.sh --list --zookeeper node01:2181

删除topic:

bin/kafka-topics.sh --delete --zookeeper node01:2181 --topic topic1

增加分区

bin/kafka-topics.sh --alter --zookeeper node01:2181 --topic topic1 --partitions 3

 

 

生产端

bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic topic1

消费端

bin/kafka-console-consumer.sh --bootstrap-server node01:9092 --from-beginning --topic topic1

 

 

安装脚本

#! /bin/bash

tar -zxvf /bigdata/downloads/kafka_2.12-2.2.1.tgz -C /bigdata
# 循环
for host in node01 node02 node03; do
        echo ==================node$host==================
        # 建立软连接
        ssh $host "ln -s /bigdata/kafka_2.12-2.2.1 /usr/local/kafka"
        ssh $host 'echo "export KAFKA_HOME=/usr/local/kafka" >> /etc/profile'
        ssh $host "echo 'export PATH=\$PATH:\${KAFKA_HOME}/bin' >> /etc/profile"
        #ssh $host 'source /etc/profile' # 无效
done
## 覆盖配置文件
\cp /bigdata/downloads/server.properties /usr/local/kafka/config
#
mkdir -p /usr/local/kafka/kafka-logs
#xsync /bigdata/kafka_2.12-2.2.1
## 循环
m=0
for host in node01 node02 node03; do
        echo ==================node$host==================
        ssh $host "sed -i s#^broker.id=.*#broker.id="$m"# /usr/local/kafka/config/server.properties"
        ssh $host "sed -i s#^host.name=.*#host.name=node0"`expr $m + 1`"# /usr/local/kafka/config/server.properties"
        let 'm+=1'
done

 

Flume

 下载

解压

flume-env.sh

export JAVA_HOME=/usr/local/jdk

 

Sqoop

 

 

Spark

  • 在所有节点上下载或上传spark文件,解压缩安装,建立软连接
  • 配置所有节点spark安装目录下的spark-evn.sh文件
  • 配置slaves
  • 配置spark-default.conf
  • 配置所有节点的环境变量

spark-evn.sh

[root@node01 conf]# mv spark-env.sh.template spark-env.sh
[root@node01 conf]# vi spark-env.sh

加入

大数据环境搭建 _ JavaClub全栈架构师技术笔记
export JAVA_HOME=/usr/local/jdk
#export SCALA_HOME=/software/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
#Spark历史服务分配的内存尺寸
#export SPARK_DAEMON_MEMORY=512m
#下面的这一项就是Spark的高可用配置,如果是配置master的高可用,master就必须有;如果是slave的高可用,slave就必须有;但是建议都配置。
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node01:2181,node02:2181,node03:2181 -Dspark.deploy.zookeeper.dir=/spark"

#当启用了Spark的高可用之后,下面的这一项应该被注释掉(即不能再被启用,后面通过提交应用时使用--master参数指定高可用集群节点)
#export SPARK_MASTER_IP=master01
#export SPARK_WORKER_MEMORY=1500m
#export SPARK_EXECUTOR_MEMORY=100m
大数据环境搭建 _ JavaClub全栈架构师技术笔记

-Dspark.deploy.recoveryMode=ZOOKEEPER    #说明整个集群状态是通过zookeeper来维护的,整个集群状态的恢复也是通过zookeeper来维护的。就是说用zookeeper做了spark的HA配置,Master(Active)挂掉的话,Master(standby)要想变成Master(Active)的话,Master(Standby)就要像zookeeper读取整个集群状态信息,然后进行恢复所有Worker和Driver的状态信息,和所有的Application状态信息; 
-Dspark.deploy.zookeeper.url=potter2:2181,potter3:2181,potter4:2181,potter5:2181#将所有配置了zookeeper,并且在这台机器上有可能做master(Active)的机器都配置进来;(我用了4台,就配置了4台) 
-Dspark.deploy.zookeeper.dir=/spark 
-Dspark.deploy.zookeeper.dir是保存spark的元数据,保存了spark的作业运行状态; 
zookeeper会保存spark集群的所有的状态信息,包括所有的Workers信息,所有的Applactions信息,所有的Driver信息,如果集群

slaves

[root@node03 conf]# mv slaves.template slaves
[root@node03 conf]# vi slaves

将localhost删掉,三个节点都加进去

node01
node02
node03

 

配置环境变量

vi /etc/profile

添加

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

source /etc/profile

 

配置spark-default.conf

spark默认本地模式

修改下面一项:

spark.master                     spark://node01:7077,node02:7077,node03:7077

 

以上工作是在所有节点都要进行的

 

启动

zookeeper启动

hadoop启动

在一个节点上

/usr/local/spark/sbin/start-all.sh

在另外两个节点上单独启动master,实现高可用

/usr/local/spark/sbin/start-master.sh

spark-shell命令可以启动shell

大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

web界面

node01:8080

node02:8080 

node03:8080

如果8080被占用,spark默认会加1

 

安装脚本

#! /bin/bash

tar -zxvf /bigdata/downloads/spark-2.4.6-bin-hadoop2.7.tgz -C /bigdata
# 循环
for host in node01 node02 node03; do
        echo ==================node$host==================
        # 建立软连接
        ssh $host "ln -s /bigdata/spark-2.4.6-bin-hadoop2.7 /usr/local/spark"
        ssh $host "echo 'export SPARK_HOME=/usr/local/spark' >> /etc/profile"
        ssh $host "echo 'export PATH=\$PATH:\$SPARK_HOME/bin' >> /etc/profile"
done
mv /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
echo "export JAVA_HOME=/usr/local/jdk" >> /usr/local/spark/conf/spark-env.sh
echo "export HADOOP_HOME=/usr/local/hadoop" >> /usr/local/spark/conf/spark-env.sh
echo "export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop" >> /usr/local/spark/conf/spark-env.sh
echo 'export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node01:2181,node02:2181,node03:2181 -Dspark.deploy.zookeeper.dir=/spark"
' >> /usr/local/spark/conf/spark-env.sh
mv /usr/local/spark/conf/slaves.template /usr/local/spark/conf/slaves
cat /dev/null > /usr/local/spark/conf/slaves
echo "node01" >> /usr/local/spark/conf/slaves
echo "node02" >> /usr/local/spark/conf/slaves
echo "node03" >> /usr/local/spark/conf/slaves
mv /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
echo "spark.master spark://node01:7077,node02:7077,node03:7077" >> /usr/local/spark/conf/spark-defaults.conf

xsync /bigdata/spark-2.4.6-bin-hadoop2.7

https://www.cnblogs.com/aidata/p/11453991.html#_label0

Flink

下载 https://flink.apache.org/downloads.html

flink-1.10.1-bin-scala_2.12

flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

 

解压缩

[root@node01 software]# tar -zxvf flink-1.10.1-bin-scala_2.12.tgz -C /bigdata/application/

 

配置环境变量,建立软连接

ln -s /bigdata/flink-1.10.1 /usr/local/flink

 

将官网hadoop的jar包  flink-shaded-hadoop-2-uber-2.8.3-10.0.jar 放入lib目录下

 

编辑flink-conf.yaml

jobmanager.rpc.address:值设置成你master节点的IP地址
taskmanager.heap.mb:每个TaskManager可用的总内存
taskmanager.numberOfTaskSlots:每台机器上可用CPU的总数
parallelism.default:每个Job运行时默认的并行度
taskmanager.tmp.dirs:临时目录
jobmanager.heap.mb:每个节点的JVM能够分配的最大内存
jobmanager.rpc.port: 6123
jobmanager.web.port: 8081

大数据环境搭建 _ JavaClub全栈架构师技术笔记
################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################


#==============================================================================
# Common
#==============================================================================

# The external address of the host on which the JobManager runs and can be
# reached by the TaskManagers and any clients which want to connect. This setting
# is only used in Standalone mode and may be overwritten on the JobManager side
# by specifying the --host <hostname> parameter of the bin/jobmanager.sh executable.
# In high availability mode, if you use the bin/start-cluster.sh script and setup
# the conf/masters file, this will be taken care of automatically. Yarn/Mesos
# automatically configure the host name based on the hostname of the node where the
# JobManager runs.

jobmanager.rpc.address: node03

# The RPC port where the JobManager is reachable.

jobmanager.rpc.port: 6123


# The heap size for the JobManager JVM

jobmanager.heap.size: 1024m


# The heap size for the TaskManager JVM

taskmanager.heap.size: 1024m


# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.

taskmanager.numberOfTaskSlots: 2

# The parallelism used for programs that did not specify and other parallelism.

parallelism.default: 2

# The default file system scheme and authority.
# 
# By default file paths without scheme are interpreted relative to the local
# root file system 'file:///'. Use this to override the default and interpret
# relative paths relative to a different file system,
# for example 'hdfs://mynamenode:12345'
#
fs.default-scheme: hdfs://ns/

#==============================================================================
# High Availability
#==============================================================================

# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.
#
high-availability: zookeeper

# The path where metadata for master recovery is persisted. While ZooKeeper stores
# the small ground truth for checkpoint and leader election, this location stores
# the larger objects, like persisted dataflow graphs.
# 
# Must be a durable file system that is accessible from all nodes
# (like HDFS, S3, Ceph, nfs, ...) 
#
high-availability.storageDir: hdfs://ns/flink/ha/



# The list of ZooKeeper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2:clientPort,..." (default clientPort: 2181)
#
high-availability.zookeeper.quorum: node01:2181,node02:2181,node03:2181
high-availability.zookeeper.path.root: /flink

# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# It can be either "creator" (ZOO_CREATE_ALL_ACL) or "open" (ZOO_OPEN_ACL_UNSAFE)
# The default value is "open" and it can be changed to "creator" if ZK security is enabled
#
# high-availability.zookeeper.client.acl: open

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
state.backend: filesystem

# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
state.checkpoints.dir: hdfs://ns/flink-checkpoints

# Default target directory for savepoints, optional.
#
state.savepoints.dir: hdfs://ns/flink-checkpoints

# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend). 
#
# state.backend.incremental: false

#==============================================================================
# Rest & web frontend
#==============================================================================

# The port to which the REST client connects to. If rest.bind-port has
# not been specified, then the server will bind to this port as well.
#
rest.port: 8081

# The address to which the REST client will connect to
#
#rest.address: 0.0.0.0

# Port range for the REST and web server to bind to.
#
#rest.bind-port: 8080-8090

# The address that the REST & web server binds to
#
#rest.bind-address: 0.0.0.0

# Flag to specify whether job submission is enabled from the web-based
# runtime monitor. Uncomment to disable.

web.submit.enable: true

#==============================================================================
# Advanced
#==============================================================================

# Override the directories for temporary files. If not specified, the
# system-specific Java temporary directory (java.io.tmpdir property) is taken.
#
# For framework setups on Yarn or Mesos, Flink will automatically pick up the
# containers' temp directories without any need for configuration.
#
# Add a delimited list for multiple directories, using the system directory
# delimiter (colon ':' on unix) or a comma, e.g.:
#     /data1/tmp:/data2/tmp:/data3/tmp
#
# Note: Each directory entry is read from and written to by a different I/O
# thread. You can include the same directory multiple times in order to create
# multiple I/O threads against that directory. This is for example relevant for
# high-throughput RAIDs.
#
# io.tmp.dirs: /tmp

# Specify whether TaskManager's managed memory should be allocated when starting
# up (true) or when memory is requested.
#
# We recommend to set this value to 'true' only in setups for pure batch
# processing (DataSet API). Streaming setups currently do not use the TaskManager's
# managed memory: The 'rocksdb' state backend uses RocksDB's own memory management,
# while the 'memory' and 'filesystem' backends explicitly keep data as objects
# to save on serialization cost.
#
# taskmanager.memory.preallocate: false

# The classloading resolve order. Possible values are 'child-first' (Flink's default)
# and 'parent-first' (Java's default).
#
# Child first classloading allows users to use different dependency/library
# versions in their application than those in the classpath. Switching back
# to 'parent-first' may help with debugging dependency issues.
#
# classloader.resolve-order: child-first

# The amount of memory going to the network stack. These numbers usually need 
# no tuning. Adjusting them may be necessary in case of an "Insufficient number
# of network buffers" error. The default min is 64MB, the default max is 1GB.
# 
# taskmanager.network.memory.fraction: 0.1
# taskmanager.network.memory.min: 64mb
# taskmanager.network.memory.max: 1gb

#==============================================================================
# Flink Cluster Security Configuration
#==============================================================================

# Kerberos authentication for various components - Hadoop, ZooKeeper, and connectors -
# may be enabled in four steps:
# 1. configure the local krb5.conf file
# 2. provide Kerberos credentials (either a keytab or a ticket cache w/ kinit)
# 3. make the credentials available to various JAAS login contexts
# 4. configure the connector to use JAAS/SASL

# The below configure how Kerberos credentials are provided. A keytab will be used instead of
# a ticket cache if the keytab path and principal are set.

# security.kerberos.login.use-ticket-cache: true
# security.kerberos.login.keytab: /path/to/kerberos/keytab
# security.kerberos.login.principal: flink-user

# The configuration below defines which JAAS login contexts

# security.kerberos.login.contexts: Client,KafkaClient

#==============================================================================
# ZK Security Configuration
#==============================================================================

# Below configurations are applicable if ZK ensemble is configured for security

# Override below configuration to provide custom ZK service name if configured
# zookeeper.sasl.service-name: zookeeper

# The configuration below must match one of the values set in "security.kerberos.login.contexts"
# zookeeper.sasl.login-context-name: Client

#==============================================================================
# HistoryServer
#==============================================================================

# The HistoryServer is started and stopped via bin/historyserver.sh (start|stop)

# Directory to upload completed jobs to. Add this directory to the list of
# monitored directories of the HistoryServer as well (see below).
#jobmanager.archive.fs.dir: hdfs:///completed-jobs/

# The address under which the web-based HistoryServer listens.
#historyserver.web.address: 0.0.0.0

# The port under which the web-based HistoryServer listens.
historyserver.web.port: 8082

# Comma separated list of directories to monitor for completed jobs.
#historyserver.archive.fs.dir: hdfs:///completed-jobs/

# Interval in milliseconds for refreshing the monitored directories.
#historyserver.archive.fs.refresh-interval: 10000

yarn.application-attempts: 10
大数据环境搭建 _ JavaClub全栈架构师技术笔记

 

编辑master文件

node03:8086
node01:8086

 

 

编辑slaves文件

node01
node02
node03

 

编辑zoo.cfg文件

大数据环境搭建 _ JavaClub全栈架构师技术笔记
################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

# The number of milliseconds of each tick
tickTime=2000

# The number of ticks that the initial  synchronization phase can take
initLimit=10

# The number of ticks that can pass between  sending a request and getting an acknowledgement
syncLimit=5

# The directory where the snapshot is stored.
# dataDir=/tmp/zookeeper

# The port at which the clients will connect
clientPort=2181

# ZooKeeper quorum peers
server.1=node01:2888:3888
server.2=node02:2888:3888
server.3=node03:2888:3888
# server.2=host:peer-port:leader-port
大数据环境搭建 _ JavaClub全栈架构师技术笔记
 

将配置好的flink目录复制到各个节点,配置环境变量,软连接

 

启动

bin下通过 start-cluster.sh 启动

访问node03:8086

 

安装脚本

#! /bin/bash

tar -zxvf /bigdata/downloads/flink-1.10.1-bin-scala_2.12.tgz -C /bigdata

# 循环
for host in node01 node02 node03; do
        echo ==================node$host==================
        # 建立软连接
        ssh $host "ln -s /bigdata/flink-1.10.1 /usr/local/flink"
        ssh $host "echo 'export FLINK_HOME=/usr/local/flink' >> /etc/profile"
        ssh $host "echo 'export PATH=\$PATH:\$FLINK_HOME/bin' >> /etc/profile"
done
# 复制jar包
\cp /bigdata/downloads/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar /usr/local/flink/lib
\cp /bigdata/downloads/flink-conf.yaml /usr/local/flink/conf
# 编辑masters和slaves
cat /dev/null > /usr/local/flink/conf/masters
cat /dev/null > /usr/local/flink/conf/slaves
echo "node01" >> /usr/local/flink/conf/slaves
echo "node02" >> /usr/local/flink/conf/slaves
echo "node03" >> /usr/local/flink/conf/slaves
echo "node03:8086" >> /usr/local/flink/conf/masters
echo "node01:8086" >> /usr/local/flink/conf/masters
\cp /bigdata/downloads/zoo.cfg /usr/local/flink/conf

xsync /bigdata/flink-1.10.1

 

 ClickHouse

Rpm包下载 http://repo.red-soft.biz/repos/clickhouse/stable/el7/

下载到了downloads目录下了

# 可能用到的相关依赖
rpm -ivh downloads/libtool-ltdl-2.4.2-21.el7_2.x86_64.rpm rpm -ivh downloads/unixODBC-2.3.1-11.el7.x86_64.rpm yum install libicu.x86_64

rpm
-ivh downloads/clickhouse-server-common-1.1.54236-4.el7.x86_64.rpm rpm -ivh downloads/clickhouse-server-1.1.54236-4.el7.x86_64.rpm #安装server rpm -ivh downloads/clickhouse-server-1.1.54236-4.el7.x86_64.rpm rpm -ivh downloads/clickhouse-debuginfo-1.1.54236-4.el7.x86_64.rpm rpm -ivh downloads/clickhouse-client-1.1.54236-4.el7.x86_64.rpm rpm -ivh downloads/clickhouse-compressor-1.1.54236-4.el7.x86_64.rpm #clickhouse-server配置文件目录 cd /etc/clickhouse-server/ config.xml配置相应的IP地址(《listen host》)
允许远程连接
    <!-- Listen specified host. use :: (wildcard IPv6 address), if you want to accept connections both with IPv4 and IPv6 from everywhere. -->
    <!-- <listen_host>::</listen_host> -->
    <listen_host>0.0.0.0</listen_host>
可修改端口
<tcp_port>9006</tcp_port>
 
     

users.xml(配置相应的IP地址)(
<networks><ip>)
允许所有连接
<networks incl="networks" replace="replace">
   <ip>::/0</ip>
</networks>

启动服务

clickhouse-server --config-file=/etc/clickhouse-server/config.xml

client连接

clickhouse-client --host=192.168.10.108  --port=9006

简单操作

show tables;
select 1;

关闭ClickHouse服务

ps -aux|grep clickhouse-server

后台托管启动服务

nohup clickhouse-server --config-file=/etc/clickhouse-server/config.xml >null 2>&1 &

 

 

 

 

 

 

来源链接:https://www.cnblogs.com/aidata/p/12343715.html

版权声明:
1、Java侠(https://www.javaxia.com)以学习交流为目的,由作者投稿、网友推荐和小编整理收藏优秀的IT技术及相关内容,包括但不限于文字、图片、音频、视频、软件、程序等,其均来自互联网,本站不享有版权,版权归原作者所有。

2、本站提供的内容仅用于个人学习、研究或欣赏,以及其他非商业性或非盈利性用途,但同时应遵守著作权法及其他相关法律的规定,不得侵犯相关权利人及本网站的合法权利。
3、本网站内容原作者如不愿意在本网站刊登内容,请及时通知本站(javaclubcn@163.com),我们将第一时间核实后及时予以删除。





本文链接:https://www.javaxia.com/server/125540.html

分享给朋友:

“大数据环境搭建” 的相关文章