云服务器Prometheus监控实战

作者：小梦
发表时间：2026-03-05
来源：原创

📈 引言：没有监控的服务器如同盲人摸象

当你将应用部署到云服务器后，最担心的是什么？无非是半夜收到用户投诉“网站打不开”，而你却对服务器的健康状况一无所知——CPU是否跑满？内存是否泄漏？磁盘是否写满？在故障发生时，如果能提前预警、快速定位，就能将损失降到最低。这就是监控系统的价值。

Prometheus 作为云原生时代的监控事实标准，以其强大的多维数据模型、灵活的查询语言（PromQL）和可靠的告警能力，成为监控云服务器的首选工具。本文将带你从零开始，在云服务器上搭建一套完整的 Prometheus 监控体系，涵盖指标采集、告警配置与可视化，让你真正掌控服务器的每一寸“脉搏”。

🔍 一、Prometheus 核心架构与原理

Prometheus 采用经典的“拉取”模型：监控目标暴露 HTTP 端点，Prometheus 服务端定期主动拉取指标数据。这种设计相比推模式，更容易控制采集频率和可靠性。其核心组件包括：

Prometheus Server： 核心服务，负责采集、存储指标数据（TSDB），并提供 PromQL 查询接口。
Exporter： 负责将各种系统或服务的指标转换为 Prometheus 可识别的格式，如 Node Exporter（主机指标）、MySQL Exporter 等。
Alertmanager： 处理告警规则产生的告警，进行分组、抑制、静默，并发送到邮件、钉钉、Slack 等接收端。
Grafana： 强大的可视化面板，从 Prometheus 查询数据并展示仪表盘。

Prometheus 支持四种指标类型：Counter（只增计数器）、Gauge（可增可减）、Histogram（直方图）和 Summary（摘要）。在实际监控中，Node Exporter 暴露的指标多数为 Gauge 类型，如 node_cpu_seconds_total 是一个 Counter，而 node_memory_MemAvailable_bytes 则是 Gauge。

🚀 二、实战部署：Prometheus Server 与 Node Exporter

本次实战基于 Ubuntu 20.04 LTS 云服务器，我们将通过二进制方式安装 Prometheus v2.45 和 Node Exporter v1.6。你也可以使用 Docker 或包管理器，但二进制更直观。

1. 下载并启动 Prometheus Server

# 下载最新版
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# 创建系统用户
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo cp -r consoles console_libraries /etc/prometheus/

# 修改配置文件 /etc/prometheus/prometheus.yml，默认即可，后续添加 target
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

# 创建 systemd 服务文件 /etc/systemd/system/prometheus.service
# 启动并设置开机自启
sudo systemctl start prometheus
sudo systemctl enable prometheus

2. 部署 Node Exporter

# 下载 Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xvf node_exporter-1.6.0.linux-amd64.tar.gz
sudo cp node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter

# 创建 systemd 服务文件 /etc/systemd/system/node_exporter.service
# 启动并设置开机自启
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# 验证指标端点
curl http://localhost:9100/metrics

Node Exporter 默认监听 9100 端口，暴露了大量主机指标，如 CPU、内存、磁盘、网络等。接下来，我们需要让 Prometheus 发现这个目标。

🎯 三、配置监控目标与关键指标

编辑 Prometheus 配置文件 /etc/prometheus/prometheus.yml，在 scrape_configs 中添加 Node Exporter 任务：

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']   # 本机 Node Exporter

重启 Prometheus 后，即可在 Targets 页面（http://服务器IP:9090/targets）看到状态为 UP。现在我们可以查询一些关键指标，例如：

指标名	说明	类型
node_cpu_seconds_total	CPU 累计使用时间，用于计算使用率	Counter
node_memory_MemAvailable_bytes	可用内存字节数	Gauge
node_filesystem_avail_bytes	文件系统可用空间	Gauge
node_network_receive_bytes_total	网络接收字节累计	Counter

通过这些指标，我们可以计算出 CPU 使用率、内存使用率、磁盘使用率等。例如，CPU 使用率公式：100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)。

⚠️ 四、告警配置：第一时间发现问题

监控的目的是在异常发生时及时通知。Prometheus 通过定义告警规则，将触发条件发送到 Alertmanager。

1. 定义告警规则文件

创建文件 /etc/prometheus/alert_rules.yml，添加如下规则：

groups:
  - name: host_alerts
    rules:
      - alert: HighCPULoad
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        annotations:
          summary: "CPU 使用率超过 80%"
          description: "{{ $labels.instance }} CPU 使用率已达 {{ $value }}%"
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        annotations:
          summary: "根分区磁盘空间不足 10%"
          description: "{{ $labels.instance }} 可用空间仅 {{ $value }}%"

然后在 prometheus.yml 中引入该文件：

rule_files:
  - "alert_rules.yml"

2. 部署 Alertmanager

下载 Alertmanager，配置接收器（如企业微信、邮件）。以下是一个简单的邮件配置示例：

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname']
  receiver: 'email'
receivers:
  - name: 'email'
    email_configs:
      - to: 'admin@example.com'

启动 Alertmanager 后，在 Prometheus 的 prometheus.yml 中配置 alertmanager 地址，重启即可生效。

📊 五、可视化：Grafana 让数据说话

数据有了，但命令行查询不够直观。Grafana 是 Prometheus 的最佳搭档，提供丰富的图表和仪表盘。

1. 安装 Grafana

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

访问 http://服务器IP:3000，默认账号密码 admin/admin。

2. 添加 Prometheus 数据源并导入仪表盘

在 Grafana 中添加 Prometheus 数据源，URL 填写 http://localhost:9090。然后导入官方 Node Exporter 仪表盘（ID 通常为 1860），即可看到完整的服务器监控视图，包括 CPU、内存、磁盘、网络、进程等。

下图展示了 Grafana 仪表盘的效果，将枯燥的指标转化为直观的曲线和数字，运维效率大幅提升。

✅ 总结：监控是一切的开始

至此，我们已经在云服务器上搭建了一套完整的 Prometheus 监控体系，实现了主机指标的自动采集、告警通知和可视化展示。这套方案具备以下优势：

**开源免费**，社区活跃，生态丰富；
**易于扩展**，可随时添加更多 Exporter 监控数据库、中间件；
**高可靠性**，Pull 模型和本地存储使得监控系统本身不易成为瓶颈；
**数据驱动决策**，通过历史趋势分析，提前扩容或优化代码。

当然，监控体系建设不是一劳永逸的。你需要根据业务变化持续调整告警阈值、优化指标采集频率、关注存储容量。未来还可以引入服务发现（如 Consul）自动发现新服务器，或者将 Prometheus 与 Kubernetes 集成，实现容器化监控。

📌 行动建议： 不要追求一次性完美，先从基础指标开始，让监控系统先“跑起来”，然后逐步丰富。记住，报警的准确性比数量更重要——减少误报，才能让团队信任监控系统。

快速导航

友情链接

声明

禁止：违规违法业务
禁止：违规违法业务
禁止：违规违法业务
禁止：违规违法业务

知识资讯