BaseLoader、Document源码分析

BaseLoader

LangChain在设计时，要保证Source中多种不同的数据源，在接下来的流程中可以用一种统一
的形式读取、调用。

另一方面：为什么 PDFloader 和 TextLoader 等Document Loader 都使用 load() 去加载，且都使
用 .page_content 和 .metadata 读取数据?

每一个在LangChain中集成的文档加载器，都要继承自 BaseLoader(文档加载器) ，
BaseLoader提供了一个名为"load"的公开方法，用于从配置的不同 数据源 加载数据，全部作为
Document 对象。

BaseLoader类分析

BaseLoader类定义了如何从不同的数据源加载文档，每个基于不同数据源实现的loader，都需要继承
BaseLoader 。Baseloader要求不多，对于任何具体实现的loader，最少都要实现 load方法。

class BaseLoader(ABC):  # noqa: B024
    """Interface for document loader.

    Implementations should implement the lazy-loading method using generators to avoid
    loading all documents into memory at once.

    `load` is provided just for user convenience and should not be overridden.
    """

    # Sub-classes should not implement this method directly. Instead, they
    # should implement the lazy load method.
    def load(self) -> list[Document]:
        """Load data into `Document` objects.

        Returns:
            The documents.
        """
        return list(self.lazy_load())

    async def aload(self) -> list[Document]:
        """Load data into `Document` objects.

        Returns:
            The documents.
        """
        return [document async for document in self.alazy_load()]

    def load_and_split(
        self, text_splitter: TextSplitter | None = None
    ) -> list[Document]:
        """Load `Document` and split into chunks. Chunks are returned as `Document`.

        !!! danger

            Do not override this method. It should be considered to be deprecated!

        Args:
            text_splitter: `TextSplitter` instance to use for splitting documents.

                Defaults to `RecursiveCharacterTextSplitter`.

        Raises:
            ImportError: If `langchain-text-splitters` is not installed and no
                `text_splitter` is provided.

        Returns:
            List of `Document` objects.
        """
        if text_splitter is None:
            if not _HAS_TEXT_SPLITTERS:
                msg = (
                    "Unable to import from langchain_text_splitters. Please specify "
                    "text_splitter or install langchain_text_splitters with "
                    "`pip install -U langchain-text-splitters`."
                )
                raise ImportError(msg)

            text_splitter_: TextSplitter = RecursiveCharacterTextSplitter()
        else:
            text_splitter_ = text_splitter
        docs = self.load()
        return text_splitter_.split_documents(docs)

    # Attention: This method will be upgraded into an abstractmethod once it's
    #            implemented in all the existing subclasses.
    def lazy_load(self) -> Iterator[Document]:
        """A lazy loader for `Document`.

        Yields:
            The `Document` objects.
        """
        if type(self).load != BaseLoader.load:
            return iter(self.load())
        msg = f"{self.__class__.__name__} does not implement lazy_load()"
        raise NotImplementedError(msg)

    async def alazy_load(self) -> AsyncIterator[Document]:
        """A lazy loader for `Document`.

        Yields:
            The `Document` objects.
        """
        iterator = await run_in_executor(None, self.lazy_load)
        done = object()
        while True:
            doc = await run_in_executor(None, next, iterator, done)
            if doc is done:
                break
            yield doc  # type: ignore[misc]

loda方法

load(self) -> list[Document]
功能：加载所有数据并返回一个列表。
实现：它不应该被子类重写。它内部调用 self.lazy_load() 并将生成的迭代器转换为列表 (list(...))。
用途：为用户提供的便捷方法，适用于数据量较小或确实需要全部加载的场景。

aload方法

aload(self) -> list[Document]
作用：load 方法的异步版本。在异步环境（async/await）中一次性加载所有数据。
实现逻辑：
它内部调用了 self.alazy_load()。
使用列表推导式 [document async for document in ...] 异步遍历并收集所有结果。

load_and_split

load_and_split(self, text_splitter: TextSplitter | None = None) -> list[Document]
作用：加载文档并立即将其切分成小块（Chunks）。返回切分后的 Document 列表。
实现逻辑：
检查分割器：如果用户没传 text_splitter，它尝试导入默认的 RecursiveCharacterTextSplitter。如果没安装相关包，抛出 ImportError。
加载数据：调用 self.load()（注意：这里是一次性加载所有数据到内存）。
切分：调用分割器的 split_documents 方法处理数据。

lazy_load

lazy_load(self) -> Iterator[Document]
作用：核心方法。以惰性方式（流式）加载文档。它是一个生成器，每次只产生一个 Document，而不是一次性加载所有。
实现逻辑：
兼容性检查：代码首先检查子类是否重写了旧的 load 方法 (if type(self).load != BaseLoader.load)。
如果是（旧式写法），它将 load() 的结果转为迭代器返回，作为向后兼容的兜底方案。
如果否（即子类既没重写 load 也没重写 lazy_load），则抛出 NotImplementedError。
开发者指南：
✅ 必须重写 (Should Implement)：这是你自定义 Loader 时唯一需要主要关注的方法。
如何编写：在你的子类中实现此方法，使用 yield 关键字逐个产出 Document 对象。

def lazy_load(self):
    for line in open(self.file_path):
        yield Document(page_content=line)

alazy_load

alazy_load(self) -> AsyncIterator[Document]
作用：lazy_load 的异步版本。以异步流式方式加载文档。
实现逻辑：
线程池桥接：它使用 run_in_executor 将同步的 self.lazy_load 放入线程池中运行。
异步迭代：它在循环中异步地调用 next() 来获取下一个文档。
结束判断：使用了一个哨兵对象 done = object() 来判断迭代器是否耗尽。如果拿到的是 done，则停止循环。

Document类分析

Document类是LangChain 框架中最核心的数据单元
在 LangChain 的生态中，Document 是连接“数据加载（Loading）”、“文本分割（Splitting）”、“向量存储（Vector Store）”和“检索（Retrieval）”的关键桥梁。它不仅仅是一段文本，而是文本内容 + 元数据的结构化封装。

代码

class Document(BaseMedia):
    page_content: str
    """String text."""

    type: Literal["Document"] = "Document"

    def __init__(self, page_content: str, **kwargs: Any) -> None:
        # my-py is complaining that page_content is not defined on the base class.
        # Here, we're relying on pydantic base class to handle the validation.
        super().__init__(page_content=page_content, **kwargs)  # type: ignore[call-arg,unused-ignore]

    @classmethod
    def is_lc_serializable(cls) -> bool:
        """Return `True` as this class is serializable."""
        return True

    @classmethod
    def get_lc_namespace(cls) -> list[str]:
        """Get the namespace of the LangChain object.

        Returns:
            `["langchain", "schema", "document"]`
        """
        return ["langchain", "schema", "document"]

    def __str__(self) -> str:
        """Override `__str__` to restrict it to page_content and metadata.

        Returns:
            A string representation of the `Document`.
        """
        # The format matches pydantic format for __str__.
        #
        # The purpose of this change is to make sure that user code that feeds
        # Document objects directly into prompts remains unchanged due to the addition
        # of the id field (or any other fields in the future).
        #
        # This override will likely be removed in the future in favor of a more general
        # solution of formatting content directly inside the prompts.
        if self.metadata:
            return f"page_content='{self.page_content}' metadata={self.metadata}"
        return f"page_content='{self.page_content}'"

核心属性 (Attributes)

page_content: str
含义：这是文档的实际文本内容。
作用：当你进行向量嵌入（Embedding）、关键词搜索或让 LLM 阅读时，主要处理的就是这个字段。
示例： "Hello, world!" 或一篇完整的文章段落。

metadata: dict (继承自 BaseMedia)
含义：虽然代码片段中未直接显示定义（因为它继承自 BaseMedia），但它是 Document 的灵魂。它是一个字典，用于存储与文本相关的辅助信息。
作用：
   溯源：记录来源（如 source: "file.pdf", url: "https://..."）。
   过滤：在检索时进行元数据过滤（如只查找 year=2024 的文档）。
   上下文：记录页码、章节标题、作者等。

示例： {"source": "https://example.com", "page": 1}
type: Literal["Document"] = "Document"
含义：一个类型标识符字段。
作用：主要用于序列化/反序列化（如 JSON 导出导入）或框架内部的路由判断，明确标识这是一个 Document 对象，而不是 HumanMessage 或其他类型。

方法解析

__init__(self, page_content: str, **kwargs: Any) -> None
作用：构造函数，用于初始化 Document 实例。
逻辑细节：
它强制要求传入 page_content（可以是位置参数或命名参数）。
**kwargs：允许传入任意其他参数，这些参数通常会被 Pydantic（父类 BaseMedia 的底层机制）处理。最常见的用法是通过 kwargs 传入 metadata 字典。
调用父类：super().__init__(page_content=page_content, **kwargs) 将数据传递给 Pydantic 进行验证和赋值。

使用示例：
# 正确用法
doc = Document(page_content="内容", metadata={"source": "test.txt"})

is_lc_serializable(cls) -> bool
作用：告诉 LangChain 的运行环境（如 LangServe 或保存/加载工具），这个类是可以被序列化的。
返回值：固定返回 True。
意义：这使得 Document 对象可以方便地转换为 JSON 格式进行网络传输或持久化存储，而不会丢失结构信息。

get_lc_namespace(cls) -> list[str]
作用：返回该对象在 LangChain 生态系统中的命名空间路径。
返回值：["langchain", "schema", "document"]。
意义：当进行跨语言交互（如 TypeScript 和 Python 之间）或版本迁移时，这个路径帮助框架定位到正确的类定义，确保反序列化时能找到对应的类。

__str__(self) -> str
作用：定义了当你对 Document 对象调用 str() 或直接在打印时显示的字符串格式。
特殊设计逻辑（非常重要）：
限制输出内容：它只显示 page_content 和 metadata。
忽略其他字段：特意忽略了 id 或其他未来可能添加的内部字段。
原因：注释中解释了原因——为了保证向后兼容性。很多用户会将 Document 对象直接放入 Prompt 模板中（例如 prompt.format(docs=docs)）。如果 __str__ 包含了不断变化的内部字段（如自动生成的 ID），会导致 Prompt 的内容发生不可控的变化，进而影响 LLM 的输出稳定性或缓存命中率。

核心

核心记忆点：Document = 文本 (page_content) + 背景信息 (metadata)，专用于知识库检索，而非对话消息。