DuckDB数据分析实战：比Pandas快10倍的本地数据库查询

📅 2026-06-21 🏷 DuckDB · 数据分析 · SQL · Python · 大数据处理

DuckDB是专为分析设计的嵌入式数据库。本文用真实数据集演示DuckDB的SQL查询、聚合分析和性能优化，教你用SQL代替Pandas处理百万级数据。

DuckDB数据分析实战：比Pandas快10倍的本地数据库查询

你有多少次因为一个Excel打不开而烦躁？

100万行的CSV文件，双击打开，等待30秒，然后卡死。Pandas读进来，内存飙到8GB，查询慢得像蜗牛。

这时候你需要的是DuckDB。

DuckDB是一个嵌入式分析型数据库，它不需要安装服务器，不需要配置网络，一条命令就能用。最关键的是——它的速度比Pandas快得多。

下面我用一个真实的电商销售数据集，带你体验DuckDB的强大。

为什么选DuckDB？

在回答"怎么用"之前，先搞清楚"为什么用"。

传统的数据分析工具分两类：

Pandas/Polars：内存计算，灵活但吃资源
PostgreSQL/MySQL：通用数据库，功能全但配置重

DuckDB的定位很明确：介于两者之间。它像Pandas一样易用，像PostgreSQL一样擅长聚合查询，而且数据存在磁盘上，不占内存。

benchmarks显示，对于常见的聚合查询（GROUP BY、JOIN、窗口函数），DuckDB通常比Pandas快5-15倍。原因很简单：它做了列式存储和向量化执行。

环境搭建

# Python环境
pip install duckdb

# Jupyter Notebook中直接使用
pip install duckdb-engine

导入方式也很简洁：

import duckdb

就这么一行。没有连接字符串，没有配置参数。DuckDB会自动管理一切。

实战：电商销售数据分析

假设你有一份2025年全年电商销售数据（约200万行），包含字段：order_id, customer_id, product_id, category, quantity, unit_price, order_date, region。

第一步：读取数据

import duckdb

# 直接读取CSV，无需加载到内存
con = duckdb.connect()

# 用SQL查询CSV文件的前几行
result = con.execute("""
    SELECT * FROM 'sales_2025.csv' LIMIT 5
""").fetchdf()

print(result.head())

DuckDB会智能地只读取需要的列和数据。如果你查询时只用了3个字段，它只会扫描CSV里的这3列，而不是把整个文件读进内存。

第二步：创建虚拟表

# 注册CSV为可查询的表
con.execute("CREATE TABLE sales AS SELECT * FROM 'sales_2025.csv'")

# 查看表结构
con.execute("DESCRIBE sales").fetchdf()

之后所有查询都可以直接用表名，跟操作普通数据库一样。

第三步：基础分析

按月统计销售额

monthly_revenue = con.execute("""
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(quantity * unit_price) AS revenue,
        COUNT(DISTINCT customer_id) AS customers,
        AVG(quantity * unit_price) AS avg_order_value
    FROM sales
    GROUP BY month
    ORDER BY month
""").fetchdf()

print(monthly_revenue)

输出示例：

        month   | revenue  | customers | avg_order_value
----------------|----------|-----------|----------------
2025-01-01      | 2847392  | 15234     | 156.82
2025-02-01      | 2654128  | 14892     | 148.53
2025-03-01      | 3102847  | 16721     | 165.42
...

按品类和地区交叉分析

category_region = con.execute("""
    SELECT 
        category,
        region,
        COUNT(*) AS order_count,
        ROUND(SUM(quantity * unit_price), 2) AS total_revenue,
        ROUND(AVG(quantity * unit_price), 2) AS avg_price,
        ROUND(100.0 * SUM(quantity * unit_price) / 
              SUM(SUM(quantity * unit_price)) OVER(), 2) AS revenue_share
    FROM sales
    GROUP BY category, region
    ORDER BY total_revenue DESC
""").fetchdf()

这里用了一个窗口函数 SUM(...) OVER() 来计算每个品类在全盘中的占比。这种查询用Pandas实现会很繁琐，但DuckDB一行SQL就搞定了。

第四步：高级分析

用户复购率分析

repurchase_rate = con.execute("""
    WITH user_orders AS (
        SELECT 
            customer_id,
            COUNT(DISTINCT order_id) AS order_count,
            MIN(order_date) AS first_order,
            MAX(order_date) AS last_order
        FROM sales
        GROUP BY customer_id
    ),
    cohorts AS (
        SELECT 
            DATE_TRUNC('month', first_order) AS cohort_month,
            CASE WHEN order_count > 1 THEN 1 ELSE 0 END AS is_repurchased
        FROM user_orders
    )
    SELECT 
        cohort_month,
        COUNT(*) AS total_customers,
        SUM(is_repurchased) AS repurchased_customers,
        ROUND(100.0 * SUM(is_repurchased) / COUNT(*), 2) AS repurchase_rate
    FROM cohorts
    GROUP BY cohort_month
    ORDER BY cohort_month
""").fetchdf()

这个查询用CTE（公共表表达式）分两步：先统计每个用户的订单数，再按首次下单月份分组计算复购率。这是DuckDB的强项——复杂查询依然快。

销售趋势同比分析

yoy_growth = con.execute("""
    WITH monthly AS (
        SELECT 
            DATE_TRUNC('month', order_date) AS month,
            EXTRACT(YEAR FROM order_date) AS year,
            SUM(quantity * unit_price) AS revenue
        FROM sales
        GROUP BY month, year
    ),
    ranked AS (
        SELECT 
            month,
            year,
            revenue,
            LAG(revenue) OVER(PARTITION BY year ORDER BY month) AS prev_revenue,
            LAG(revenue, 12) OVER(ORDER BY month) AS same_month_last_year
        FROM monthly
    )
    SELECT 
        month,
        year,
        revenue,
        ROUND(100.0 * (revenue - prev_revenue) / NULLIF(prev_revenue, 0), 2) AS mom_growth_pct,
        ROUND(100.0 * (revenue - same_month_last_year) / NULLIF(same_month_last_year, 0), 2) AS yoy_growth_pct
    FROM ranked
    WHERE same_month_last_year IS NOT NULL
    ORDER BY month
""").fetchdf()

这里用了 LAG() 窗口函数来做环比和同比计算。注意 NULLIF() 防止除以零的错误——这种细节在DuckDB里处理得很优雅。

性能优化技巧

1. 使用Parquet格式

如果数据量超过千万行，把CSV转成Parquet格式：

# 转换
con.execute("""
    COPY sales TO 'sales_2025.parquet' (FORMAT PARQUET)
""")

# Parquet查询（自动跳过不需要的列）
fast_result = con.execute("SELECT category, SUM(revenue) FROM 'sales_2025.parquet' GROUP BY category").fetchdf()

Parquet的列式存储让这种只查部分列的查询快3-5倍。

2. 创建索引（物化视图）

# 创建持久化物化视图
con.execute("""
    CREATE MATERIALIZED VIEW mv_monthly_sales AS
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        category,
        SUM(quantity * unit_price) AS revenue
    FROM sales
    GROUP BY month, category
""")

# 后续查询直接从物化视图读取，速度极快
con.execute("SELECT * FROM mv_monthly_sales WHERE month >= '2025-06-01'").fetchdf()

3. 并行查询

# DuckDB默认自动并行，也可以手动控制
con.execute("SET threads = 8")

# 查看执行计划
con.execute("EXPLAIN SELECT ...").fetchdf()

4. 内存限制

# 防止查询吃光内存
con.execute("SET memory_limit = '4GB'")

# 检查当前内存使用
con.execute("SHOW ALL SETTINGS LIKE 'memory%'").fetchdf()

DuckDB vs Pandas 对比

用一个简单的GROUP BY查询做对比测试：

import duckdb
import pandas as pd
import time

# 加载数据
df = pd.read_csv('sales_2025.csv')

# Pandas方案
start = time.time()
pandas_result = df.groupby(['category', 'region']).agg(
    revenue=('quantity', lambda x: (x * df.loc[x.index, 'unit_price']).sum()),
    count=('order_id', 'count')
).reset_index()
pandas_time = time.time() - start

# DuckDB方案
start = time.time()
duckdb_result = duckdb.query("""
    SELECT category, region,
           SUM(quantity * unit_price) AS revenue,
           COUNT(*) AS count
    FROM 'sales_2025.csv'
    GROUP BY category, region
""").fetchdf()
duckdb_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s")
print(f"DuckDB: {duckdb_time:.2f}s")
print(f"DuckDB快 {pandas_time/duckdb_time:.1f} 倍")

在200万行数据上，典型结果是：

Pandas：约12秒
DuckDB：约0.8秒
快约15倍

什么时候该用DuckDB？

适合的场景：

数据量超过Pandas内存处理能力
需要做复杂SQL查询（JOIN、窗口函数、CTE）
需要反复查询同一份数据
需要把分析结果存为中间表供下游使用
想在不搭数据库服务器的情况下用SQL分析数据

不适合的场景：

数据量很小（几千行），Pandas足够快
需要做机器学习特征工程（用Polars或Spark更好）
需要多人并发写入（DuckDB是单写多读）

快速参考 cheatsheet

# 读取CSV
con.execute("SELECT * FROM 'data.csv'").fetchdf()

# 读取Parquet
con.execute("SELECT * FROM 'data.parquet'").fetchdf()

# 创建表
con.execute("CREATE TABLE t AS SELECT * FROM 'data.csv'")

# SQL查询
con.execute("SELECT * FROM t WHERE x > 10").fetchdf()

# 导出结果
con.execute("COPY (SELECT * FROM t) TO 'result.csv' (HEADER)")

# 查看统计信息
con.execute("ANALYZE t").fetchdf()

# 关闭连接
con.close()

DuckDB的学习曲线几乎为零。你会写SQL就会用它。对于每天跟数据打交道的人来说，它是一个值得加入工具箱的工具。

你平时用啥工具分析数据？Pandas、Excel还是别的？聊聊看。