3 分布式计算与资源调度
分布式计算概述
什么是分布式计算
由数据得到的结论是广义上的计算,也是计算的目的
分布式计算是指以分布式的形式完成数据的统计,得到结果
- 数据过大无法一台电脑独立计算
- 用数量取胜单台计算
分布式计算模式
分散-汇总模式(MapReduce)
- 分发数据,多台服务器各自负责一部分的数据处理
- 汇总到一台主机
中心调度-步骤执行模式(Spark、Flink)
在计算的中间结果有数据交换的过程
- 由一个节点作为中心调度管理者
- 划为几个步骤
- 安排每台机器的执行与数据交换
MapReduce
是Hadoop组件之一
MapReduce 提供了两个编程接口:
- Map 提供分散数据功能
- Reduce 汇总聚合功能
执行原理
提示
Hive 是基于MapReduce的sql计算框架
YARN
YARN 与 MapReduce 会一起执行,用于资源调度
分布式服务器资源调度,对整个服务器进行统一调度,对资源有规划有管理的使用,能提高效率。
YARN下MapReduce的调度
YARN架构
- RescourceManager:整个集群的资源调度者
- NodeManager:单个服务器的资源调度者
YARN容器
- 预先先占用资源,再分配给任务
- 虚拟化计算,封装运行(NodeManager)
YARN辅助角色
- 代理服务器,提高网络访问的安全
- JobHistoryServer历史服务器,通过收集容器的日志,通过浏览器访问
配置MapReduce与部署YARN
修改部分
集群规划:
MapReduce的配置
添加 mapred-env.sh 配置(/etc/hadoop)
export JAVA_HOME=/export/server/jdk
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
export HADOOP_MAPREN_ROOT_LOGGER=LNFO,RFA
添加 mapred-site.xml 配置(/etc/hadoop)
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node1:10820</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node1:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/data/mr-history/tmp</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/data/mr-history/done</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
YARN配置文件
添加 mapred-env.sh 配置(/etc/hadoop)
export JAVA_HOME=/export/server/jdk
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
添加 yarn-site.xml 配置(/etc/hadoop)
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/nm-local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/data/nm-log</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://node1:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.web-proxy.address</name>
<value>node1:8089</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
</configuration>
开始计算
提交MapReduce至YARN运行
MapReduce实例代码
- wordcount: 单词计数器
- pi:求圆周率
求单词计数的执行过程
- 把需要统计的文本放置在hdfs文件系统
- 执行以下命令进行文件单词统计
[hadoop@node1 ~]$ hadoop jar /export/server/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar wordcount hdfs://node1:8020/input/ hdfs://node1:8020/output/hl
提示
其中/input/
的意思是统计该目录下的所有文件
求PI的执行过程
hadoop jar /export/server/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar pi 3 10000
结果:
2023-06-18 14:06:45,427 INFO mapreduce.Job: map 0% reduce 0%
2023-06-18 14:07:01,534 INFO mapreduce.Job: map 67% reduce 0%
2023-06-18 14:09:57,239 INFO mapreduce.Job: map 78% reduce 0%
2023-06-18 14:09:58,243 INFO mapreduce.Job: map 89% reduce 0%
2023-06-18 14:10:13,294 INFO mapreduce.Job: map 89% reduce 22%
2023-06-18 14:10:16,305 INFO mapreduce.Job: map 100% reduce 22%
2023-06-18 14:10:17,309 INFO mapreduce.Job: map 100% reduce 100%
2023-06-18 14:10:17,314 INFO mapreduce.Job: Job job_1687019957893_0008 completed successfully
2023-06-18 14:10:17,457 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=72
FILE: Number of bytes written=1109437
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=783
HDFS: Number of bytes written=215
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=589338
Total time spent by all reduces in occupied slots (ms)=17633
Total time spent by all map tasks (ms)=589338
Total time spent by all reduce tasks (ms)=17633
Total vcore-milliseconds taken by all map tasks=589338
Total vcore-milliseconds taken by all reduce tasks=17633
Total megabyte-milliseconds taken by all map tasks=603482112
Total megabyte-milliseconds taken by all reduce tasks=18056192
Map-Reduce Framework
Map input records=3
Map output records=6
Map output bytes=54
Map output materialized bytes=84
Input split bytes=429
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=84
Reduce input records=6
Reduce output records=0
Spilled Records=12
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=669
CPU time spent (ms)=590390
Physical memory (bytes) snapshot=1938477056
Virtual memory (bytes) snapshot=12350963712
Total committed heap usage (bytes)=1853358080
Peak Map Physical memory (bytes)=625938432
Peak Map Virtual memory (bytes)=3091316736
Peak Reduce Physical memory (bytes)=286851072
Peak Reduce Virtual memory (bytes)=3089543168
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=354
File Output Format Counters
Bytes Written=97
Job Finished in 220.245 seconds
Estimated value of Pi is 3.14159264764749259810