Python实用工具：Prometheus Client 从入门到精通实战教程

作者：

在

Prometheus是一款开源的监控告警系统，而prometheus_client库是Python应用接入Prometheus监控的核心工具，它能让开发者轻松在Python程序中定义、暴露监控指标。其工作原理是通过在代码中实例化不同类型的指标对象，收集数据后以HTTP接口形式暴露，供Prometheus服务器定时拉取。该库遵循Apache License 2.0开源协议，优点是轻量易用、支持多类型指标、与Prometheus生态无缝兼容；缺点是高级功能需结合Prometheus服务端配置，且无内置的数据持久化能力。

一、prometheus_client库核心基础

1.1 库的用途

prometheus_client是Python应用与Prometheus监控系统对接的官方客户端库，主要用于在Python程序中埋点各类监控指标，比如业务指标（接口请求量、订单完成数）、系统指标（CPU使用率、内存占用）、自定义指标（函数执行耗时、任务失败次数）等，这些指标会以标准化格式暴露，供Prometheus采集、存储和分析，最终实现对Python应用的实时监控与告警。

1.2 核心工作原理

指标定义：开发者在Python代码中创建对应类型的指标实例（如计数器、仪表盘），并为指标添加标签（label）用于区分不同维度的数据。
指标数据采集：程序运行过程中，通过调用指标实例的方法更新数据（如计数器的inc()方法）。
指标暴露：通过库提供的HTTP服务，将所有指标数据以Prometheus支持的文本格式暴露在指定端口（默认8000）。
Prometheus拉取数据：Prometheus服务器按照配置的时间间隔，主动从Python应用暴露的接口拉取指标数据，存储到时序数据库中，供后续查询和可视化。

1.3 优缺点分析

| 特性 | 优点 | 缺点 |
||||
| 易用性 | 接口设计简洁，新手可快速上手；支持多种常见指标类型 | 高级监控场景（如分布式追踪）需结合其他工具 |
| 兼容性 | 完美适配Prometheus生态；支持Python 3.6+所有版本 | 无内置数据持久化，指标数据依赖Prometheus拉取 |
| 功能扩展性 | 支持自定义指标类型；可通过标签实现多维度监控 | 指标命名和标签设计不当易导致数据膨胀 |

1.4 开源协议

二、prometheus_client库安装与环境准备

2.1 安装方法

prometheus_client库已发布到PyPI，支持pip一键安装，适用于所有主流Python环境（Windows、Linux、macOS）。

打开命令行终端，执行以下安装命令：

pip install prometheus-client

安装完成后，可通过以下命令验证是否安装成功：

pip show prometheus-client

若终端输出库的版本号、作者等信息，则说明安装成功。

2.2 环境依赖说明

Python版本要求：Python 3.6及以上版本
依赖库：该库无强依赖第三方库，仅依赖Python标准库（如http.server、threading等）
运行环境：可在普通Python脚本、Django/Flask Web应用、Celery任务队列等场景中运行

三、prometheus_client核心指标类型与使用实战

prometheus_client提供了4种核心指标类型，分别对应不同的监控场景，开发者需根据实际需求选择合适的指标类型。

3.1 计数器（Counter）：单调递增的指标

Counter是最常用的指标类型，适用于记录只会增加不会减少的数据，比如接口请求次数、任务失败次数、错误发生次数等。Counter的核心方法是inc()，用于将指标值加1；也可通过inc(n)指定增加的数值（n需为正数）。

实战案例：统计接口请求次数

以下代码实现了一个简单的HTTP接口，使用Counter统计接口被访问的总次数，并暴露指标供Prometheus采集。

from prometheus_client import Counter, start_http_server
from http.server import BaseHTTPRequestHandler, HTTPServer
import time

# 1. 定义Counter指标
# 参数说明：
# name: 指标名称，需符合Prometheus命名规范（字母、数字、下划线）
# documentation: 指标描述，用于说明指标含义
# labelnames: 标签列表，用于区分不同维度的数据（可选）
request_counter = Counter(
    'api_requests_total',
    'Total number of API requests',
    labelnames=['method', 'endpoint']
)

# 2. 定义HTTP请求处理器
class SimpleAPIHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        # 2.1 根据请求路径判断接口
        if self.path == '/hello':
            # 2.2 更新Counter指标：method为GET，endpoint为/hello
            request_counter.labels(method='GET', endpoint='/hello').inc()
            # 2.3 构造响应
            self.send_response(200)
            self.send_header('Content-type', 'text/html')
            self.end_headers()
            self.wfile.write(b"Hello, Prometheus!")
        else:
            # 2.4 处理未知接口
            self.send_response(404)
            self.end_headers()
            self.wfile.write(b"404 Not Found")

# 3. 启动Prometheus指标暴露服务
# start_http_server函数会在指定端口启动一个HTTP服务，用于暴露指标
# 端口号可自定义，建议选择未被占用的端口（如8000）
start_http_server(8000)
print("Prometheus metrics server running on port 8000...")

# 4. 启动HTTP接口服务
if __name__ == '__main__':
    server_address = ('', 8080)
    httpd = HTTPServer(server_address, SimpleAPIHandler)
    print("API server running on port 8080...")
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        pass
    httpd.server_close()

代码运行与验证步骤

运行上述代码，终端会输出以下信息：
Prometheus metrics server running on port 8000... API server running on port 8080...
打开浏览器访问http://localhost:8080/hello，多次刷新页面，模拟接口请求。
访问http://localhost:8000，可看到暴露的指标数据，其中api_requests_total指标会随着接口访问次数增加而递增，格式如下：
# HELP api_requests_total Total number of API requests # TYPE api_requests_total counter api_requests_total{endpoint="/hello",method="GET"} 5.0

3.2 仪表盘（Gauge）：可增可减的指标

Gauge适用于记录可以增加也可以减少的数据，比如内存占用、CPU使用率、当前在线用户数、队列长度等。Gauge提供了丰富的方法：

inc()：加1
dec()：减1
set(n)：直接设置指标值为n
inc_to(n)：增加到n（若当前值小于n）
dec_to(n)：减少到n（若当前值大于n）

实战案例：监控系统内存占用

以下代码使用psutil库获取系统内存占用，并通过Gauge指标暴露给Prometheus。

from prometheus_client import Gauge, start_http_server
import psutil
import time

# 1. 定义Gauge指标：监控系统内存使用率
memory_usage_gauge = Gauge(
    'system_memory_usage_percent',
    'System memory usage percentage'
)

# 2. 定义Gauge指标：监控系统可用内存（单位：MB）
available_memory_gauge = Gauge(
    'system_available_memory_mb',
    'System available memory in megabytes'
)

# 3. 函数：更新内存指标数据
def update_memory_metrics():
    while True:
        # 3.1 获取系统内存信息
        memory_info = psutil.virtual_memory()
        # 3.2 更新内存使用率指标
        memory_usage_gauge.set(memory_info.percent)
        # 3.3 更新可用内存指标（转换为MB）
        available_memory = memory_info.available / 1024 / 1024
        available_memory_gauge.set(available_memory)
        # 3.4 每隔10秒更新一次
        time.sleep(10)

if __name__ == '__main__':
    # 4. 启动指标暴露服务
    start_http_server(8000)
    print("Metrics server running on port 8000...")
    # 5. 启动内存指标更新线程
    update_memory_metrics()

代码说明

首先导入psutil库（需提前安装：pip install psutil），用于获取系统硬件信息。
定义两个Gauge指标，分别监控内存使用率和可用内存。
update_memory_metrics函数通过循环获取内存信息，并调用set()方法更新指标值。
运行代码后，访问http://localhost:8000，可看到实时的内存指标数据。

3.3 直方图（Histogram）：统计数据分布

Histogram用于统计数据的分布情况，比如接口响应时间、函数执行耗时等。它会将数据划分到多个区间（bucket），并记录每个区间内的数据数量，同时还会记录数据的总和与总次数。

Histogram的核心参数是buckets，用于定义区间边界，默认区间为[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]。

实战案例：统计函数执行耗时分布

以下代码使用Histogram统计函数process_task的执行耗时分布，并暴露指标。

from prometheus_client import Histogram, start_http_server
import time
import random

# 1. 定义Histogram指标
# buckets参数：自定义区间，单位为秒
task_duration_histogram = Histogram(
    'task_process_duration_seconds',
    'Distribution of task processing duration',
    buckets=[0.1, 0.2, 0.5, 1.0, 2.0]
)

# 2. 定义待监控的函数
@task_duration_histogram.time()
def process_task():
    """模拟任务处理函数，耗时随机"""
    duration = random.uniform(0.05, 2.5)
    time.sleep(duration)
    return f"Task completed in {duration:.2f} seconds"

# 3. 模拟任务执行
def run_tasks():
    while True:
        process_task()
        time.sleep(1)

if __name__ == '__main__':
    # 4. 启动指标暴露服务
    start_http_server(8000)
    print("Metrics server running on port 8000...")
    # 5. 运行任务
    run_tasks()

代码说明

使用@task_duration_histogram.time()装饰器，可自动统计被装饰函数的执行耗时，并更新Histogram指标。
process_task函数通过random.uniform()模拟随机耗时，范围为0.05到2.5秒。
运行代码后，访问http://localhost:8000，可看到Histogram指标的三个部分：
- task_process_duration_seconds_bucket{le="0.1"}：耗时≤0.1秒的任务数量
- task_process_duration_seconds_sum：所有任务的总耗时
- task_process_duration_seconds_count：任务的总次数

3.4 摘要（Summary）：统计数据的分位数

Summary与Histogram类似，都用于统计数据分布，但Summary是直接计算数据的分位数（如中位数、95分位数、99分位数），而不需要预先定义区间。它适用于需要快速了解数据分布特征的场景，比如接口响应时间的P50、P95、P99值。

实战案例：统计接口响应时间分位数

以下代码使用Summary统计HTTP接口的响应时间分位数。

from prometheus_client import Summary, start_http_server
from http.server import BaseHTTPRequestHandler, HTTPServer
import time
import random

# 1. 定义Summary指标
# quantiles参数：指定需要统计的分位数及误差范围
# 例如(0.5, 0.05)表示中位数的误差不超过5%
request_duration_summary = Summary(
    'api_request_duration_seconds',
    'API request duration distribution',
    quantiles={0.5: 0.05, 0.95: 0.01, 0.99: 0.001}
)

# 2. 装饰器：统计函数执行时间
def measure_time(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start_time
        # 更新Summary指标
        request_duration_summary.observe(duration)
        return result
    return wrapper

# 3. 定义HTTP请求处理器
class APIHandler(BaseHTTPRequestHandler):
    @measure_time
    def do_GET(self):
        if self.path == '/data':
            # 模拟数据处理耗时
            time.sleep(random.uniform(0.01, 0.5))
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(b'{"status": "success", "data": "hello world"}')
        else:
            self.send_response(404)
            self.end_headers()

if __name__ == '__main__':
    # 4. 启动指标暴露服务
    start_http_server(8000)
    print("Metrics server running on port 8000...")
    # 5. 启动HTTP服务
    server = HTTPServer(('', 8080), APIHandler)
    print("API server running on port 8080...")
    server.serve_forever()

代码说明

定义Summary指标时，通过quantiles参数指定需要统计的分位数：中位数（0.5）、95分位数（0.95）、99分位数（0.99）。
自定义装饰器measure_time，用于计算函数执行耗时，并调用observe()方法更新Summary指标。
访问http://localhost:8080/data多次后，访问http://localhost:8000，可看到Summary指标的分位数数据，例如：
# HELP api_request_duration_seconds API request duration distribution # TYPE api_request_duration_seconds summary api_request_duration_seconds{quantile="0.5"} 0.12 api_request_duration_seconds{quantile="0.95"} 0.45 api_request_duration_seconds{quantile="0.99"} 0.49 api_request_duration_seconds_sum 12.34 api_request_duration_seconds_count 50

四、prometheus_client在Web框架中的集成实战

在实际项目中，Python Web应用（如Flask、Django）是监控的重点场景，以下分别介绍prometheus_client与Flask、Django框架的集成方法。

4.1 与Flask框架集成

Flask是轻量级Web框架，集成prometheus_client只需两步：定义指标、注册指标暴露接口。

实战案例：Flask应用监控

from flask import Flask
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# 1. 定义监控指标
# 1.1 接口请求次数计数器
flask_request_counter = Counter(
    'flask_requests_total',
    'Total number of Flask requests',
    labelnames=['endpoint', 'method', 'status_code']
)

# 1.2 接口响应时间仪表盘
flask_request_duration_gauge = Gauge(
    'flask_request_duration_seconds',
    'Flask request duration',
    labelnames=['endpoint']
)

# 2. 自定义中间件：统计请求指标
@app.before_request
def before_request():
    g.start_time = time.time()

@app.after_request
def after_request(response):
    # 计算请求耗时
    duration = time.time() - g.start_time
    # 更新响应时间指标
    flask_request_duration_gauge.labels(endpoint=request.endpoint).set(duration)
    # 更新请求次数指标
    flask_request_counter.labels(
        endpoint=request.endpoint,
        method=request.method,
        status_code=response.status_code
    ).inc()
    return response

# 3. 定义业务接口
@app.route('/user/<int:user_id>')
def get_user(user_id):
    # 模拟数据库查询耗时
    time.sleep(random.uniform(0.02, 0.2))
    return {"user_id": user_id, "name": "test_user", "age": 20}

@app.route('/order')
def get_order():
    # 模拟接口耗时
    time.sleep(random.uniform(0.05, 0.3))
    return {"order_id": "123456", "amount": 99.9}

# 4. 暴露Prometheus指标接口
@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

代码说明

使用before_request和after_request装饰器，在请求处理前后统计耗时和请求次数。
注册/metrics接口，通过generate_latest()函数生成Prometheus支持的指标数据格式。
运行Flask应用后，访问http://localhost:5000/user/1和http://localhost:5000/order，再访问http://localhost:5000/metrics即可查看监控指标。

4.2 与Django框架集成

Django是全栈Web框架，集成prometheus_client需要借助中间件和视图函数。

步骤1：定义监控指标

在Django项目的utils/metrics.py文件中定义指标：

from prometheus_client import Counter, Gauge

# 接口请求次数计数器
django_request_counter = Counter(
    'django_requests_total',
    'Total number of Django requests',
    labelnames=['view', 'method', 'status_code']
)

# 接口响应时间仪表盘
django_request_duration_gauge = Gauge(
    'django_request_duration_seconds',
    'Django request duration',
    labelnames=['view']
)

步骤2：编写中间件

在middleware.py文件中编写中间件，统计请求指标：

import time
from django.utils.deprecation import MiddlewareMixin
from utils.metrics import django_request_counter, django_request_duration_gauge

class PrometheusMetricsMiddleware(MiddlewareMixin):
    def process_request(self, request):
        request._start_time = time.time()
        return None

    def process_response(self, request, response):
        if hasattr(request, '_start_time'):
            duration = time.time() - request._start_time
            # 获取视图名称
            view_name = request.resolver_match.view_name if request.resolver_match else 'unknown'
            # 更新指标
            django_request_duration_gauge.labels(view=view_name).set(duration)
            django_request_counter.labels(
                view=view_name,
                method=request.method,
                status_code=response.status_code
            ).inc()
        return response

步骤3：注册中间件和指标视图

在项目的settings.py中注册中间件：

MIDDLEWARE = [
    # 其他中间件...
    'middleware.PrometheusMetricsMiddleware',
]

在views.py中定义指标暴露视图：

from django.http import HttpResponse
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from django.views.decorators.csrf import csrf_exempt

@csrf_exempt
def metrics(request):
    return HttpResponse(generate_latest(), content_type=CONTENT_TYPE_LATEST)

在urls.py中注册URL：

from django.urls import path
from .views import metrics, get_user

urlpatterns = [
    path('metrics/', metrics),
    path('user/<int:user_id>/', get_user),
]

代码说明

通过Django中间件process_request和process_response方法，在请求处理前后统计耗时。
注册/metrics接口，用于暴露指标数据。
运行Django应用后，访问业务接口，再访问/metrics即可查看监控数据。

五、实际业务场景综合实战：电商订单监控

以下以电商订单系统为例，展示prometheus_client在实际业务场景中的综合应用，监控指标包括：订单创建次数、订单支付成功率、订单处理耗时等。

5.1 业务场景需求

统计订单创建的总次数，区分PC端和移动端。
统计订单支付成功率（支付成功数/订单创建数）。
统计订单处理的耗时分布。

5.2 代码实现

from prometheus_client import Counter, Gauge, Histogram, start_http_server
import time
import random
import threading

# 1. 定义业务监控指标
# 1.1 订单创建计数器
order_create_counter = Counter(
    'order_create_total',
    'Total number of created orders',
    labelnames=['platform']  # platform: pc/mobile
)

# 1.2 订单支付计数器
order_pay_counter = Counter(
    'order_pay_total',
    'Total number of paid orders',
    labelnames=['platform']
)

# 1.3 订单支付成功率仪表盘
order_pay_success_rate_gauge = Gauge(
    'order_pay_success_rate',
    'Order payment success rate',
    labelnames=['platform']
)

# 1.4 订单处理耗时直方图
order_process_duration_histogram = Histogram(
    'order_process_duration_seconds',
    'Distribution of order processing duration',
    buckets=[0.1, 0.3, 0.5, 1.0]
)

# 2. 模拟订单创建函数
@order_process_duration_histogram.time()
def create_order(platform):
    """创建订单，返回订单ID"""
    # 模拟订单处理耗时
    time.sleep(random.uniform(0.05, 0.8))
    order_id = f"ORD{int(time.time() * 1000)}{random.randint(100, 999)}"
    # 更新订单创建计数器
    order_create_counter.labels(platform=platform).inc()
    print(f"Created order {order_id} on {platform} platform")
    return order_id

# 3. 模拟订单支付函数
def pay_order(platform, order_id):
    """支付订单，模拟支付成功率"""
    pay_success = random.random() > 0.2  # 80%支付成功率
    if pay_success:
        order_pay_counter.labels(platform=platform).inc()
        print(f"Order {order_id} paid successfully")
    else:
        print(f"Order {order_id} payment failed")
    return pay_success

# 4. 计算支付成功率
def calculate_pay_success_rate():
    while True:
        for platform in ['pc', 'mobile']:
            # 获取订单创建数和支付数
            create_count = order_create_counter.labels(platform=platform)._value.get()
            pay_count = order_pay_counter.labels(platform=platform)._value.get()
            # 计算成功率
            if create_count > 0:
                success_rate = pay_count / create_count
                order_pay_success_rate_gauge.labels(platform=platform).set(success_rate)
        time.sleep(10)

# 5. 模拟业务运行
def run_business():
    platforms = ['pc', 'mobile']
    while True:
        platform = random.choice(platforms)
        order_id = create_order(platform)
        # 模拟支付延迟
        time.sleep(random.uniform(1, 3))
        pay_order(platform, order_id)
        time.sleep(1)

if __name__ == '__main__':
    # 启动指标暴露服务
    start_http_server(8000)
    print("Metrics server running on port 8000...")

    # 启动支付成功率计算线程
    rate_thread = threading.Thread(target=calculate_pay_success_rate, daemon=True)
    rate_thread.start()

    # 启动业务线程
    business_thread = threading.Thread(target=run_business, daemon=True)
    business_thread.start()

    # 主线程保持运行
    while True:
        time.sleep(1)

代码说明

定义了4个业务指标，覆盖订单创建、支付、成功率和处理耗时。
create_order函数使用Histogram装饰器自动统计处理耗时，同时更新订单创建计数器。
calculate_pay_success_rate函数在独立线程中运行，每隔10秒计算一次支付成功率，并更新Gauge指标。
运行代码后，访问http://localhost:8000可查看所有业务指标数据，这些数据可用于Prometheus监控面板展示，例如：
- 通过order_create_total查看不同平台的订单创建趋势
- 通过order_pay_success_rate监控支付成功率，当低于阈值时触发告警
- 通过order_process_duration_seconds分析订单处理耗时的分布情况

六、相关资源地址

PyPI地址：https://pypi.org/project/prometheus-client
Github地址：https://github.com/prometheus/client_python
官方文档地址：https://prometheus.github.io/client_python/

关注我，每天分享一个实用的Python自动化工具。

实用工具

Python实用工具：Prometheus Client 从入门到精通实战教程

一、prometheus_client库核心基础

1.1 库的用途

1.2 核心工作原理

1.3 优缺点分析

1.4 开源协议

二、prometheus_client库安装与环境准备

2.1 安装方法

2.2 环境依赖说明

三、prometheus_client核心指标类型与使用实战

3.1 计数器（Counter）：单调递增的指标

实战案例：统计接口请求次数

代码运行与验证步骤

3.2 仪表盘（Gauge）：可增可减的指标

实战案例：监控系统内存占用

代码说明

3.3 直方图（Histogram）：统计数据分布

实战案例：统计函数执行耗时分布

代码说明

3.4 摘要（Summary）：统计数据的分位数

实战案例：统计接口响应时间分位数

代码说明

四、prometheus_client在Web框架中的集成实战

4.1 与Flask框架集成

实战案例：Flask应用监控

代码说明

4.2 与Django框架集成

步骤1：定义监控指标

步骤2：编写中间件

步骤3：注册中间件和指标视图

代码说明

五、实际业务场景综合实战：电商订单监控

5.1 业务场景需求

5.2 代码实现

代码说明

六、相关资源地址

更多文章

Python实用工具Beam：轻量化任务调度与异步执行入门教程

Python实用工具：Celery分布式任务队列入门与实战教程

Python实用工具：Apache Airflow 从入门到实战 保姆级教程

Python Squirrel库入门教程：高效数据缓存与持久化工具

Python实用工具：Apache Airflow 从入门到实战保姆级教程