Prometheus怎么用来帮助解决性能问题之入门篇

mac2026-02-09 16

Prometheus是啥？Prometheus做什么的？

https://prometheus.io/

就是个开源的监控系统，显示数据的图表，可自定义查询数据规则，集成多种语言客户端

哦，就是跑job，从自己的客户端里抓数据（自己的项目里还得集成它的client，侵入了啊喂），显示成图表嘛，大家莫慌，直接上

Prometheus怎么用？不要怂直接上官网教程撸一遍

下载解压版

https://prometheus.io/download/

或者 docker

https://hub.docker.com/u/prom

我无耻的下载了windows版本的。

https://github.com/prometheus/prometheus/releases/download/v2.13.1/prometheus-2.13.1.windows-amd64.tar.gz

不先启动下吗？

双击prometheus.exe，windows提示有风险，点击more info，选择run anyway

或者命令行运行吧，这玩意儿启动时真的快

看到打印信息，习惯性的扫一遍，它能给我们看到的是

level=info ts=2019-11-01T02:14:28.256Z caller=main.go:296 msg="no time or size retention was set so using the default time retention" duration=15d

默认保留15天数据

level=info ts=2019-11-01T02:14:28.257Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=HEAD, revision=6f92ce56053866194ae5937012c1bec40f1dd1d9)"

level=info ts=2019-11-01T02:14:28.258Z caller=main.go:333 build_context="(go=go1.13.1, user=root@88e419aa1676, date=20191017-13:31:33)"

level=info ts=2019-11-01T02:14:28.258Z caller=main.go:334 host_details=(windows)

level=info ts=2019-11-01T02:14:28.258Z caller=main.go:335 fd_limits=N/A

level=info ts=2019-11-01T02:14:28.258Z caller=main.go:336 vm_limits=N/A

level=info ts=2019-11-01T02:14:28.261Z caller=main.go:657 msg="Starting TSDB ..."

level=info ts=2019-11-01T02:14:28.261Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090

level=info ts=2019-11-01T02:14:28.264Z caller=head.go:514 component=tsdb msg="replaying WAL, this may take awhile"

level=info ts=2019-11-01T02:14:28.281Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0

level=info ts=2019-11-01T02:14:28.282Z caller=main.go:672 fs_type=unknown

level=info ts=2019-11-01T02:14:28.282Z caller=main.go:673 msg="TSDB started"

启动了个TSDB，现在做mertrics的都用它，time-series database，可以按时间点和时间区间索引数据

level=info ts=2019-11-01T02:14:28.283Z caller=main.go:743 msg="Loading configuration file" filename=prometheus.yml

加载了个配置文件prometheus.yml，配置文件这咱熟悉，一会肯定是要玩配置了

level=info ts=2019-11-01T02:14:28.295Z caller=main.go:771 msg="Completed loading of configuration file" filename=prometheus.yml

level=info ts=2019-11-01T02:14:28.295Z caller=main.go:626 msg="Server is ready to receive web requests."

那就打开配置文件先看看，也不是很复杂

# my global config 全局的 global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration 报警的 alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. # 定义生成图表用的规则 rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. # 还有它自己的，这里可以扩展到其他app scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] # 它启动的时候不告诉我们访问的端口，感情这放在配置里呢，浏览器贴入localhost:9090 # 看到页面，反正是起来了，下面继续看看怎么用吧

要怎么用呢？当然是配置配置文件啦

Prometheus 通过HTTP 抓数据，而且还抓自己的数据。

那我们就来抓抓除它自己以外的app的数据

进入prometheus.yml:

在global下面添加，不添加也行，不影响我们的小demo，注释里也说了是和外部app交互时用的label

# Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor'

然后添加在自己的scrape_configs下面添加新的job，叫做example-random。

你看，加job拉数据了又。

scrape_configs: - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary'

官网自己用的是go写的，先尝个鲜，后面性能调试的时候，我们直接用springboot做例子。

Ensure you have the Go compiler installed and have a working Go build environment (with correct GOPATH) set up.

# Fetch the client library code and compile example. git clone https://github.com/prometheus/client_golang.git cd client_golang/examples/random go get -d go build # Start 3 example targets in separate terminals: ./random -listen-address=:8080 ./random -listen-address=:8081 ./random -listen-address=:8082

这就可以了，在9090页面里，在执行按钮后面选择一个metric的label，就可以看数据了。

但我们一般都是要count avg啥的，看简单的不合适，在线上做这个就不合适了，在线计算耗时占资源，最好先定义好了，这时候就是rules入场了

创建prometheus.rules.yml，和prometheus.yml在同一个目录

添加如下代码

groups: - name: example rules: - record: job_service:rpc_durations_seconds_count:avg_rate5m expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service)

在prometheus.yml下添加，存在别的地方的就按照自己的路径改下

rule_files: - 'prometheus.rules.yml'

重启，还是老方式，打开9090，输入job_service:rpc_durations_seconds_count:avg_rate5m

执行，选择graph和console都看看。

其实在status菜单下面configuration，rules，targets，直接打开点击就送，不用手动输入的。

好了，现在算是入了门了。

下篇文章讲讲如何用prometheus来帮助解决性能问题。

最新回复(0)