spark核心术语及架构

mac2022-10-06 63

官方文档

http://spark.apache.org/docs/latest/cluster-overview.html

Components

spark应用程序会作为独立的进程。它是和SparkContext有交互的在你的main方法中（这个就叫做dirver program）

运行在集群之上时，SparkContext能够连接到集群管理器的不同模式上(standalone cluster manager, Mesos or YARN)。

集群管理器回去申请资源给应用程序。一旦连接上申请到资源，spark就会启动executor在我们的集群内。

即在NodeManager上启动启动container，在container上运行executor。(executor肯定运行在container上的)

executor用来运行计算和存储计算在你的应用程序上。然后sparkcontext会把你的进程代码发送到executor上。最后，SparkContext发送任务到executor去运行。

每一个应用程序有自己独立的executor 进程，executor 进程与整个程序生命周期一直并且以多线程的方式运行。对调度和执行来说，这么做能够把应用程序和其它应用程序隔离。然而，这就意味着，数据并不能够在不同应用程序之间共享，因为不同程序是跑在不同的jvm上，除非把共享信息写入到外部存储系统中。ps：目前有一个比较好的框架解决该问题：spark对于底层运行在什么模式上是不关注的。一旦获取到executor进程后，它们会相互通信，这样能够相对比较容易的在不同模式下运行应用程序。driver应用程序要监听接收executor过来的请求。因此，dirver程序必须和工作节点地址能够通信。因为driver要到task来调度任务，因此这个driver尽量与工作节点近一点，最好在同一局域网内。如果希望向集群远程发送请求，最好向驱动程序打开RPC并让它在附近提交操作，而不是在远离工作节点的地方运行驱动程序。

Glossary

Application

User program built on Spark. Consists of a driver program and executors on the cluster.

这是一个构件spark上的应用程序。它包含了一个driver和多个executor。

Application jar

A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.

一个jar包含spark应用程序。它包含这个应用程序的相关的依赖。用户的jar不应该包含Hadoop或Spark，这些包应该在运行时添加。

Driver program

The process running the main() function of the application and creating the SparkContext

一个运行应用程序的main方法，它会创建SparkContext

所以一个应用程序里是有Driver的

Cluster manager

An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)

一个外部的服务，当你在集群的时候获取资源的时候。

即通过外部服务，设置不同模式的参数

Deploy mode

Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

用来区分driver程序跑在哪里。如果是集群模式的话，框架会在集群里来启动。如果是客户端模式的话，会在集群外提交进程。

拿Yarn模式来说，

cluster模式，Driver是跑在container上的

client模式，Driver就运行在提交机器的本地。

Worker node

Any node that can run application code in the cluster

运行应用程序代码的节点叫工作节点

拿Yarn模式来说，就是集群上的NodeManager

Executor

A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

一个在 worker node 上启动进程的节点，运行tasks、缓存数据在内存或磁盘上。每个应用程序有自己的executors

最终数据是缓存在Executor上，对应Yarn的container。

Task

A unit of work that will be sent to one executor

会被发送到executor上的工作单元

假设当作业启动起来后，Executor里有任务跟缓存数据，这个任务就是通过task来发送的。

RDD是有多个partition构成的，每一个partition对应一个task。即从RDD组成上来说，RDD有多个partition，从运行上来说，一个paritition会有一个task任务。

Job

A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

这是一个由多个task组成的并行计算，这个task就是对应spark 里的 action。

即遇到action就会产生一个task

Stage

Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

每一个作业会被拆成比较小的任务集，这些任务集就叫做stage。和MapReduce里的map 和 reduce的stages比较像

总结

一个application：由1到n个job

一个job：1到多个stage构成

一个stage：1到多个task

task和partition是一一对应的

遇到shuffle算子就会把job任务进行拆分。

核心术语 Application ***** a driver program executors on the cluster. Application jar Driver program ***** main sc Cluster manager Deploy mode YARN: RM NM(container) cluster: Driver是跑在container client：Driver就运行在你提交机器的本地 client是不是一定要是集群内的？gateway Worker node Executor ***** process runs tasks keeps data in memory or disk storage across them Each application has its own executors. A：executor1 2 3 B：executor1 2 3 Task ***** A unit of work that will be sent to one executor RDD: partitions == task Job ***** action ==> job Stage

最新回复(0)