LlamaIndex 实战学习文档
LlamaIndex + Qwen RAG 实战学习文档
版本适配: LlamaIndex v0.14.x (2026年最新稳定版)
目标读者: 大模型应用程序员
核心模型: 千问 (Qwen2.5 / Qwen3 / DashScope API)
文档定位: 从环境搭建到生产级 RAG 系统的完整实践指南
目录
1. 框架概览与版本说明
1.1 LlamaIndex 是什么?
LlamaIndex 是一个专为 RAG(检索增强生成) 设计的 Python 框架,其核心理念是:
- 数据索引化:将任意格式的非结构化数据(PDF、网页、数据库)转化为 LLM 可消费的索引结构
- 检索即服务:提供从简单向量检索到多跳推理的完整检索抽象层
- 与模型无关:通过
Settings全局配置,可无缝切换 OpenAI、Claude、千问等不同模型
1.2 版本演进(关键 Breaking Changes)
| 版本 | 关键变化 | 影响 |
|---|---|---|
| 0.9.x 之前 | 单体式包结构 | from llama_index import ... |
| 0.10+ | 模块化重构 | 核心包 + 独立集成包,必须分开安装 |
| 0.11+ | Settings 全局配置 | 统一通过 Settings.llm = ... 配置 |
| 0.14.x | 当前最新 | API 稳定,支持异步流式、Agent 编排 |
⚠️ 重要: 0.10+ 版本后,llama-index 核心包不再包含模型集成代码。使用 HuggingFace 模型或特定 API 时,必须安装对应的子包。
2. 环境准备与安装
2.1 基础依赖安装
# 核心框架(必须)
pip install llama-index
# 千问本地部署(HuggingFace 方式)
pip install llama-index-llms-huggingface
pip install llama-index-embeddings-huggingface
# 千问 API 方式(DashScope / 阿里云百炼)
pip install llama-index-llms-openai-like
# 文档解析增强
pip install llama-index-readers-web # 网页读取
pip install llama-index-readers-file # 文件读取(PDF、Word 等)
# 向量数据库(生产环境必需)
pip install llama-index-vector-stores-chroma # ChromaDB
pip install llama-index-vector-stores-qdrant # Qdrant
# 其他工具
pip install transformers>=4.37.0
pip install torch
pip install sentence-transformers2.2 环境变量配置
# 方式一:本地 HuggingFace 模型(需提前下载)
export HF_ENDPOINT=https://hf-mirror.com # 国内镜像加速
export CUDA_VISIBLE_DEVICES=0
# 方式二:DashScope API(推荐,无需本地 GPU)
export DASHSCOPE_API_KEY="sk-xxxxxxxxxxxxxxxx"
# 方式三:阿里云百炼
export ALIBABA_API_KEY="sk-xxxxxxxxxxxxxxxx"3. 核心概念速览
在写代码前,必须理解 LlamaIndex 的 5 个核心抽象:
┌─────────────────────────────────────────────────────────────┐
│ LlamaIndex 核心抽象层 │
├─────────────────────────────────────────────────────────────┤
│ Document → 原始文档对象(text + metadata) │
│ ↓ │
│ Node → 文档切片(chunk),检索的最小单元 │
│ ↓ │
│ Index → 索引结构(VectorStoreIndex / TreeIndex) │
│ ↓ │
│ Retriever → 检索器(从 Index 中召回相关 Nodes) │
│ ↓ │
│ QueryEngine → 查询引擎(Retriever + LLM 合成答案) │
└─────────────────────────────────────────────────────────────┘关键理解: QueryEngine = Retriever(召回) + ResponseSynthesizer(生成)。你可以独立控制这两个环节。
4. 基础 RAG:五步走通全流程
Step 1: 配置全局 Settings(0.10+ 标准写法)
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
# 所有 LlamaIndex 组件都会自动读取 Settings 中的配置
Settings.chunk_size = 1024 # 文本切片大小
Settings.chunk_overlap = 200 # 切片重叠长度(保持上下文连贯性)
Settings.transformations = [SentenceSplitter(chunk_size=1024, chunk_overlap=200)]Step 2: 加载文档
from llama_index.core import SimpleDirectoryReader
# 自动识别 PDF、TXT、DOCX、MD 等格式
documents = SimpleDirectoryReader(
input_dir="./data",
required_exts=[".pdf", ".txt", ".md"], # 过滤文件类型
recursive=True # 递归子目录
).load_data()
print(f"✅ 加载了 {len(documents)} 个文档")
print(f"📄 第一个文档预览: {documents[0].text[:200]}...")Step 3: 构建向量索引
from llama_index.core import VectorStoreIndex
# 一行代码完成:切片 → Embedding → 存储
index = VectorStoreIndex.from_documents(
documents,
show_progress=True # 显示进度条(对大文件很有用)
)Step 4: 创建查询引擎
query_engine = index.as_query_engine(
similarity_top_k=5, # 召回 Top-5 相关片段
response_mode="compact" # 答案合成策略:compact / tree_summarize / refine
)Step 5: 执行查询
response = query_engine.query("这份文档的核心观点是什么?")
print(response)
# 查看引用来源(可溯源是 RAG 的核心价值)
for node in response.source_nodes:
print(f"📌 来源: {node.metadata.get('file_name', 'unknown')} | 相关度: {node.score:.4f}")
print(f"📝 片段: {node.text[:150]}...")
print("-" * 50)5. 千问模型集成详解
千问模型与 LlamaIndex 集成有三种主流方式,根据你的硬件条件和场景选择:
方式 A:DashScope API(推荐,零运维)
适合:快速验证、生产部署、无 GPU 环境
import os
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.dashscope import DashScopeEmbedding
# LLM:通过 OpenAI 兼容接口调用 DashScope
Settings.llm = OpenAILike(
model="qwen-turbo", # 可选: qwen-turbo, qwen-plus, qwen-max
api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key=os.getenv("DASHSCOPE_API_KEY"),
context_window=8192,
is_chat_model=True,
is_function_calling_model=False, # 纯 RAG 场景可关闭 function calling
)
# Embedding:使用 DashScope 的 Embedding 服务(支持中文)
Settings.embed_model = DashScopeEmbedding(
model_name="text-embedding-v2", # 或 text-embedding-v1
api_key=os.getenv("DASHSCOPE_API_KEY")
)方式 B:本地 HuggingFace 加载(需 GPU)
适合:数据隐私要求高、离线环境、已有 GPU 服务器
import torch
from llama_index.core import Settings, PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# ========== 千问提示词模板(必须!否则输出格式混乱)==========
def messages_to_prompt(messages):
"""将消息列表转换为 Qwen 的 ChatML 格式"""
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|im_start|>system\n{message.content}e<|im_end|>\n"
elif message.role == "user":
prompt += f"<|im_start|>user\n{message.content}e<|im_end|>\n"
elif message.role == "assistant":
prompt += f"<|im_start|>assistant\n{message.content}e<|im_end|>\n"
# 如果没有 system prompt,添加默认提示
if not prompt.startswith("<|im_start|>system"):
prompt = "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.e<|im_end|>\n" + prompt
prompt += "<|im_start|>assistant\n"
return prompt
def completion_to_prompt(completion):
"""将纯文本补全转换为 Qwen 格式"""
return f"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.e<|im_end|>\n<|im_start|>user\n{completion}e<|im_end|>\n<|im_start|>assistant\n"
# ========== 配置千问 LLM ==========
Settings.llm = HuggingFaceLLM(
model_name="Qwen/Qwen2.5-7B-Instruct", # 或 Qwen3-8B, Qwen2.5-14B-Instruct
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
context_window=32768, # Qwen2.5 支持 32K 上下文
max_new_tokens=2048,
generate_kwargs={
"temperature": 0.7,
"top_p": 0.95,
"top_k": 50,
"do_sample": True,
},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="auto", # 自动分配多 GPU
model_kwargs={"torch_dtype": torch.float16}, # 半精度节省显存
)
# ========== 配置 Embedding 模型 ==========
# 中文文档强烈推荐 bge-m3 或 bge-large-zh-v1.5
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-m3", # 多语言支持,中文效果优秀
device="cuda", # 或 "cpu"
trust_remote_code=True,
)
# ========== 文本切片配置 ==========
from llama_index.core.node_parser import SentenceSplitter
Settings.transformations = [SentenceSplitter(chunk_size=1024, chunk_overlap=200)]方式 C:Ollama 本地部署(开发调试友好)
适合:本地开发、快速原型验证
# 先安装 Ollama 并拉取模型
ollama pull qwen2.5:7b
ollama pull nomic-embed-text # 或 shaw/dmeta-embedding-zh 用于中文from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="qwen2.5:7b", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")6. 数据加载与文档解析
6.1 本地文件加载
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader, DocxReader
# 基础用法:自动识别格式
docs = SimpleDirectoryReader("./data").load_data()
# 进阶用法:自定义解析器 + 元数据注入
reader = SimpleDirectoryReader(
input_dir="./data",
file_extractor={
".pdf": PDFReader(),
".docx": DocxReader(),
},
filename_as_id=True, # 用文件名作为文档 ID,方便溯源
)
docs = reader.load_data()
# 手动注入元数据(对后续过滤很重要)
for doc in docs:
doc.metadata.update({
"category": "技术文档",
"author": "工程团队",
"version": "v2.0",
})6.2 网页数据加载
from llama_index.readers.web import SimpleWebPageReader, BeautifulSoupWebPageReader
# 简单网页读取(转为纯文本)
urls = [
"https://qwen.readthedocs.io/en/latest/",
"https://help.aliyun.com/document_detail/xxx.html"
]
docs = SimpleWebPageReader(html_to_text=True).load_data(urls)
# 进阶:用 BeautifulSoup 保留结构信息
from bs4 import BeautifulSoup
import requests
def custom_scraper(url):
response = requests.get(url, timeout=30)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取正文,过滤导航栏等噪音
main_content = soup.find('main') or soup.find('article') or soup.find('body')
return main_content.get_text(separator='\n', strip=True) if main_content else ""6.3 数据库 / API 加载
from llama_index.core import Document
import json
# 从数据库加载(示例:MySQL)
# import pymysql
# conn = pymysql.connect(...)
# cursor = conn.cursor()
# cursor.execute("SELECT title, content FROM knowledge_base")
# rows = cursor.fetchall()
# docs = [Document(text=row[1], metadata={"title": row[0]}) for row in rows]
# 从 JSON/CSV 加载
import pandas as pd
df = pd.read_csv("faq.csv")
docs = [
Document(
text=f"问题: {row['question']}\n答案: {row['answer']}",
metadata={"source": "FAQ", "category": row['category']}
)
for _, row in df.iterrows()
]7. 索引构建与向量存储
7.1 内存索引(开发调试)
from llama_index.core import VectorStoreIndex
# 默认使用内存中的 SimpleVectorStore,重启后丢失
index = VectorStoreIndex.from_documents(documents)7.2 持久化到本地磁盘
from llama_index.core import StorageContext
# 保存(会生成 ./storage 目录,包含 docstore、index_store、vector_store)
index.storage_context.persist(persist_dir="./storage")
# 加载
from llama_index.core import load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)7.3 生产级向量数据库:ChromaDB
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
# 创建 Chroma 客户端
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("knowledge_base")
# 包装为 LlamaIndex 的 VectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# 构建索引(数据存入 ChromaDB)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
# 后续加载无需重新 Embedding
index = VectorStoreIndex.from_vector_store(
vector_store=vector_store,
storage_context=storage_context
)7.4 生产级向量数据库:Qdrant(推荐)
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(path="./qdrant_data") # 本地模式
# client = qdrant_client.QdrantClient(host="localhost", port=6333) # 服务端模式
vector_store = QdrantVectorStore(
client=client,
collection_name="rag_collection",
dimension=1024, # 必须与 Embedding 模型维度一致(bge-m3 是 1024)
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)8. 检索策略与查询引擎
8.1 基础相似度检索
query_engine = index.as_query_engine(
similarity_top_k=5, # 召回数量
response_mode="compact", # 合成策略
streaming=False, # 是否流式输出
)8.2 带过滤条件的检索(元数据过滤)
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
# 只检索 category="技术文档" 的内容
filters = MetadataFilters(
filters=[ExactMatchFilter(key="category", value="技术文档")]
)
query_engine = index.as_query_engine(
similarity_top_k=5,
filters=filters # 传入过滤条件
)8.3 自定义检索器 + Rerank(高级 RAG 必备)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.query_engine import RetrieverQueryEngine
# 1. 宽召回:先取 Top-20
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=20,
)
# 2. 精排序:用 Cross-Encoder Reranker 筛选 Top-5
reranker = SentenceTransformerRerank(
model="BAAI/bge-reranker-v2-m3", # 中文 Reranker
top_n=5,
)
# 3. 组装自定义 Query Engine
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[reranker],
response_mode="compact",
)
# Reranker 能显著提升检索精度(通常 +10~25%)8.4 混合检索:向量 + 关键词
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
# 创建两种检索器
vector_retriever = index.as_retriever(similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(index=index, similarity_top_k=10)
# 融合检索(RRF 算法)
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1, # 是否开启查询扩展(1=不扩展)
mode="reciprocal_rerank", # RRF 融合算法
)
query_engine = index.as_query_engine(retriever=retriever)8.5 流式输出(提升用户体验)
query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("解释 RAG 的工作原理")
# 逐字输出(类似 ChatGPT 的打字效果)
for token in streaming_response.response_gen:
print(token, end="", flush=True)9. 高级 RAG 技巧
9.1 自动合并检索器(Auto-Merging)
解决"切片边界断裂"问题:当检索到的多个片段属于同一父文档时,自动合并为更大的上下文。
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore
# 1. 构建层级结构:大切片包含小切片
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512], # 第一层 2048,第二层 512
)
nodes = node_parser.get_nodes_from_documents(documents)
# 2. 存储到 docstore
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
# 3. 只用叶子节点(512)构建向量索引
leaf_nodes = [n for n in nodes if n.relationships.get("child") is None]
index = VectorStoreIndex(leaf_nodes)
# 4. 自动合并检索器
retriever = AutoMergingRetriever(
index.as_retriever(similarity_top_k=10),
docstore=docstore,
verbose=True,
)
query_engine = RetrieverQueryEngine.from_args(retriever)9.2 多文档 Agent(路由 + 工具调用)
当知识库包含多种类型文档(API 文档、教程、FAQ)时,让 Agent 自动选择数据源:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.agent import ReActAgent
# 为每种文档创建独立的 Query Engine
api_index = VectorStoreIndex.from_documents(api_docs)
tutorial_index = VectorStoreIndex.from_documents(tutorial_docs)
api_tool = QueryEngineTool(
query_engine=api_index.as_query_engine(),
metadata=ToolMetadata(
name="api_docs",
description="提供 API 接口的详细参数、返回值和错误码信息。",
),
)
tutorial_tool = QueryEngineTool(
query_engine=tutorial_index.as_query_engine(),
metadata=ToolMetadata(
name="tutorials",
description="提供面向新手的教程和最佳实践指南。",
),
)
# 创建 ReAct Agent(千问支持 function calling 时可用)
agent = ReActAgent.from_tools(
[api_tool, tutorial_tool],
llm=Settings.llm,
verbose=True,
system_prompt="你是技术文档助手。根据用户问题选择最合适的工具查询文档。",
)
response = agent.chat("如何调用用户认证接口?")9.3 查询转换(Query Transformation)
将复杂问题拆解为多个子问题,分别检索后综合答案:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
# HyDE(假设文档嵌入):生成假答案后再检索,提升召回率
hyde_transform = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde_transform)
response = hyde_query_engine.query("千问模型有哪些优势?")10. 生产化部署要点
10.1 性能优化清单
| 优化项 | 开发环境 | 生产环境 |
|---|---|---|
| Embedding 模型 | CPU / GPU | GPU 或专用 Embedding 服务 |
| 向量存储 | 内存 / 本地文件 | Qdrant / Milvus / Weaviate |
| 索引持久化 | persist() | 向量数据库原生持久化 |
| 并发处理 | 单线程 | 异步 aquery() + 连接池 |
| 文本切片 | 固定长度 | SemanticSplitter(语义切片) |
| 检索策略 | Top-K 相似度 | Hybrid + Rerank |
10.2 异步 API(FastAPI 集成示例)
from fastapi import FastAPI
from llama_index.core import Settings
import asyncio
app = FastAPI()
# 全局初始化索引(启动时加载)
index = None
@app.on_event("startup")
async def load_index():
global index
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
@app.post("/query")
async def query(question: str):
query_engine = index.as_query_engine()
# 使用异步接口避免阻塞事件循环
response = await query_engine.aquery(question)
return {
"answer": str(response),
"sources": [
{"file": n.metadata.get("file_name"), "score": n.score}
for n in response.source_nodes
]
}10.3 监控与评估
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
import time
# 添加调试处理器,追踪每一步耗时
debug_handler = LlamaDebugHandler(print_trace_on_end=True)
Settings.callback_manager = CallbackManager([debug_handler])
# 执行查询后会自动打印:
# - 检索耗时
# - LLM 调用耗时
# - Token 使用量
response = query_engine.query("测试问题")11. 完整可运行代码汇总
11.1 最小可用示例(DashScope API + 本地文件)
"""
最小可用 RAG 系统
依赖: pip install llama-index llama-index-llms-openai-like llama-index-embeddings-dashscope
"""
import os
from llama_index.core import (
VectorStoreIndex, SimpleDirectoryReader, Settings, StorageContext, load_index_from_storage
)
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.dashscope import DashScopeEmbedding
# ========== 配置 ==========
Settings.llm = OpenAILike(
model="qwen-turbo",
api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key=os.getenv("DASHSCOPE_API_KEY"),
context_window=8192,
is_chat_model=True,
)
Settings.embed_model = DashScopeEmbedding(
model_name="text-embedding-v2",
api_key=os.getenv("DASHSCOPE_API_KEY")
)
# ========== 构建索引 ==========
if os.path.exists("./storage"):
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
else:
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist("./storage")
# ========== 查询 ==========
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("介绍千问模型的特点")
print(response)11.2 完整生产级示例(本地 Qwen + ChromaDB + Rerank)
"""
生产级 RAG 系统
依赖:
pip install llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface
pip install llama-index-vector-stores-chroma chromadb
pip install transformers torch sentence-transformers
"""
import os
import torch
from llama_index.core import (
Settings, VectorStoreIndex, SimpleDirectoryReader, StorageContext,
load_index_from_storage, PromptTemplate
)
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.query_engine import RetrieverQueryEngine
import chromadb
# ========== 1. 千问提示词模板 ==========
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|im_start|>system\n{message.content}e<|im_end|>\n"
elif message.role == "user":
prompt += f"<|im_start|>user\n{message.content}e<|im_end|>\n"
elif message.role == "assistant":
prompt += f"<|im_start|>assistant\n{message.content}e<|im_end|>\n"
if not prompt.startswith("<|im_start|>system"):
prompt = "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.e<|im_end|>\n" + prompt
prompt += "<|im_start|>assistant\n"
return prompt
# ========== 2. 全局配置 ==========
Settings.llm = HuggingFaceLLM(
model_name="Qwen/Qwen2.5-7B-Instruct",
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
context_window=32768,
max_new_tokens=2048,
generate_kwargs={"temperature": 0.7, "top_p": 0.95, "do_sample": True},
messages_to_prompt=messages_to_prompt,
device_map="auto",
model_kwargs={"torch_dtype": torch.float16},
)
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-m3",
device="cuda",
trust_remote_code=True,
)
Settings.transformations = [SentenceSplitter(chunk_size=1024, chunk_overlap=200)]
# ========== 3. ChromaDB 向量存储 ==========
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("rag_prod")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# ========== 4. 加载或构建索引 ==========
if os.path.exists("./chroma_db") and chroma_collection.count() > 0:
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
else:
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# ========== 5. 高级检索:宽召回 + Rerank ==========
retriever = VectorIndexRetriever(index=index, similarity_top_k=20)
reranker = SentenceTransformerRerank(model="BAAI/bge-reranker-v2-m3", top_n=5)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[reranker],
response_mode="compact",
)
# ========== 6. 执行查询 ==========
response = query_engine.query("千问模型在中文理解方面有哪些优势?")
print("\n=== 回答 ===")
print(response)
print("\n=== 引用来源 ===")
for node in response.source_nodes:
print(f"[{node.score:.4f}] {node.metadata.get('file_name', 'unknown')}: {node.text[:120]}...")附录:常见问题排查
Q1: ImportError: cannot import name 'ServiceContext' from 'llama_index'
原因: 你正在使用 0.10+ 版本,但代码是 0.9.x 的老写法。
解决: 用 Settings 替代 ServiceContext:
# 旧写法(已废弃)
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(...)
# 新写法
from llama_index.core import Settings
Settings.llm = ...
Settings.embed_model = ...Q2: 千问模型输出乱码 / 重复特殊 token
原因: 未正确配置 messages_to_prompt 模板,导致 ChatML 格式错误。
解决: 必须严格使用官方模板(见第 5 节方式 B),确保 <|im_start|> 和 <|im_end|> 标签完整。
Q3: Embedding 阶段 CUDA Out of Memory
解决:
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-m3",
device="cpu", # 改为 CPU 运行 Embedding(CPU 足够快)
# 或减小 batch size
embed_batch_size=1,
)Q4: 中文检索效果差
解决:
- 使用中文优化 Embedding:
BAAI/bge-m3或BAAI/bge-large-zh-v1.5 - 使用中文 Reranker:
BAAI/bge-reranker-v2-m3 - 调整切片大小:中文文档建议
chunk_size=512~1024(中文字符密度高) - 增加 overlap:
chunk_overlap=100~200
Q5: 如何升级 LlamaIndex?
pip install --upgrade llama-index
pip install --upgrade llama-index-core
# 然后逐一升级子包
pip install --upgrade llama-index-llms-huggingface
pip install --upgrade llama-index-embeddings-huggingface文档版本: v1.0
适配版本: llama-index-core >= 0.14.0, Qwen2.5/3
最后验证: 2026-05-13