DocArray：简化数据处理与神经网络交互的Python库

作者：

在

一、Python在各领域的广泛性及DocArray的引入

Python凭借其简洁易读的语法和强大的功能，已成为当今最流行的编程语言之一。在Web开发领域，Django、Flask等框架助力开发者快速搭建高效的网站；数据分析和数据科学方面，NumPy、Pandas等库提供了强大的数据处理能力；机器学习和人工智能领域，TensorFlow、PyTorch等框架推动了各种智能应用的发展；桌面自动化和爬虫脚本中，Selenium、Requests库让自动化操作和数据采集变得轻松；金融和量化交易领域，Python也发挥着重要作用；教育和研究方面，其简单易学的特点更是受到广泛青睐。

在如此丰富的Python生态系统中，DocArray库应运而生。它为数据处理和神经网络交互提供了便捷的解决方案，能够帮助开发者更高效地完成各种任务。

二、DocArray库的用途、工作原理、优缺点及License类型

DocArray是一个用于处理、序列化和传输嵌套数据结构的库，特别适合与神经网络一起使用。它的主要用途包括：作为多模态数据结构，用于存储和处理图像、文本、音频等多种类型的数据；作为神经网络的输入输出格式，方便数据在不同模型之间的传递；支持高效的相似度搜索，可用于构建各种搜索应用。

DocArray的工作原理基于文档（Document）的概念，每个文档可以包含多个属性，这些属性可以是简单的数据类型，也可以是复杂的嵌套结构。它提供了丰富的API，使得数据的操作和处理变得简单直观。

DocArray的优点显著。它提供了统一的数据接口，支持多种数据类型，大大提高了开发效率；具有高效的序列化和传输能力，能够快速处理大量数据；支持嵌套结构，可以灵活表示复杂的数据关系。然而，它也有一些缺点，对于简单的数据结构，使用DocArray可能会显得过于复杂；并且，其性能在处理超大规模数据时可能会受到一定影响。

三、DocArray库的使用方式

3.1 安装DocArray

安装DocArray非常简单，只需使用pip命令即可：

pip install docarray

3.2 创建和操作Document

DocArray的核心是Document类，下面我们来看看如何创建和操作Document。

首先，导入必要的模块：

from docarray import Document, DocumentArray

3.2.1 创建简单的Document

我们可以创建一个简单的Document，包含文本、标签等信息：

# 创建一个包含文本的Document
doc = Document(text='Hello, DocArray!')

# 添加标签
doc.tags = {'category': 'example', 'importance': 'high'}

# 打印Document
print(doc)

在这个例子中，我们创建了一个包含文本“Hello, DocArray!”的Document，并为其添加了标签，包含类别和重要性信息。

3.2.2 创建包含嵌套结构的Document

DocArray支持嵌套结构，我们可以创建一个包含多个子Document的Document：

# 创建一个主Document
main_doc = Document(text='This is a main document')

# 创建子Document
sub_doc1 = Document(text='This is sub-document 1', tags={'type': 'text'})
sub_doc2 = Document(text='This is sub-document 2', tags={'type': 'text'})

# 将子Document添加到主Document的chunks属性中
main_doc.chunks.append(sub_doc1)
main_doc.chunks.append(sub_doc2)

# 打印主Document
print(main_doc)

这里，我们创建了一个主Document和两个子Document，并将子Document添加到主Document的chunks属性中，形成了一个嵌套结构。

3.2.3 操作Document的属性

我们可以轻松地访问和修改Document的各种属性：

# 创建一个Document
doc = Document(text='Original text')

# 访问文本属性
print(f"Original text: {doc.text}")

# 修改文本属性
doc.text = 'Modified text'
print(f"Modified text: {doc.text}")

# 添加一个新的属性
doc.custom_attribute = 'This is a custom attribute'
print(f"Custom attribute: {doc.custom_attribute}")

在这个例子中，我们创建了一个Document，访问并修改了其文本属性，还添加了一个自定义属性。

3.3 使用DocumentArray

DocumentArray是Document的集合，它提供了高效的批量操作能力。

3.3.1 创建DocumentArray

我们可以通过多种方式创建DocumentArray：

# 方式一：从列表创建
docs1 = DocumentArray([
    Document(text='Document 1'),
    Document(text='Document 2'),
    Document(text='Document 3')
])

# 方式二：逐个添加
docs2 = DocumentArray()
docs2.append(Document(text='Document A'))
docs2.append(Document(text='Document B'))

# 打印DocumentArray
print(f"docs1: {docs1}")
print(f"docs2: {docs2}")

这里展示了两种创建DocumentArray的方式，一种是从Document列表直接创建，另一种是逐个添加Document。

3.3.2 操作DocumentArray

DocumentArray提供了丰富的操作方法：

# 创建一个DocumentArray
docs = DocumentArray([
    Document(text='Hello'),
    Document(text='World'),
    Document(text='DocArray')
])

# 访问单个Document
print(f"First document: {docs[0]}")

# 切片访问
print(f"Sliced documents: {docs[1:3]}")

# 添加新的Document
docs.append(Document(text='New document'))
print(f"Updated documents: {docs}")

# 过滤DocumentArray
filtered_docs = docs.find({'text': {'$contains': 'document'}})
print(f"Filtered documents: {filtered_docs}")

在这个例子中，我们展示了如何访问DocumentArray中的单个Document和切片，如何添加新的Document，以及如何使用find方法过滤DocumentArray。

3.4 数据序列化和存储

DocArray支持将数据序列化为多种格式，方便存储和传输。

3.4.1 序列化为JSON

from docarray import DocumentArray

# 创建一个DocumentArray
docs = DocumentArray([
    Document(text='Hello'),
    Document(text='World')
])

# 序列化为JSON
json_data = docs.to_json()
print(f"JSON data: {json_data}")

# 从JSON反序列化
loaded_docs = DocumentArray.from_json(json_data)
print(f"Loaded documents: {loaded_docs}")

这里，我们将DocumentArray序列化为JSON格式的字符串，然后又从JSON字符串反序列化为DocumentArray。

3.4.2 存储到文件

from docarray import DocumentArray

# 创建一个DocumentArray
docs = DocumentArray([
    Document(text='Hello'),
    Document(text='World')
])

# 存储到二进制文件
docs.save_binary('docs.bin')

# 从二进制文件加载
loaded_docs = DocumentArray.load_binary('docs.bin')
print(f"Loaded documents: {loaded_docs}")

这个例子展示了如何将DocumentArray存储到二进制文件，以及如何从二进制文件加载DocumentArray。

3.5 与神经网络集成

DocArray可以方便地与各种神经网络框架集成，下面以处理图像数据为例进行说明。

3.5.1 处理图像数据

from docarray import Document

# 创建一个包含图像的Document
img_doc = Document(uri='https://example.com/image.jpg')

# 加载图像内容
img_doc.load_uri_to_image_tensor()

# 显示图像形状
print(f"Image tensor shape: {img_doc.tensor.shape}")

# 预处理图像
img_doc.set_image_tensor_normalization()
img_doc.set_image_tensor_channel_axis(-1, 0)

# 现在可以将图像张量输入到神经网络中
# 例如，使用torchvision的预训练模型
import torch
from torchvision import models, transforms

# 加载预训练模型
model = models.resnet18(pretrained=True)
model.eval()

# 准备输入
input_tensor = torch.tensor(img_doc.tensor)

# 模型推理
with torch.no_grad():
    output = model(input_tensor.unsqueeze(0))

# 处理输出
print(f"Model output shape: {output.shape}")

在这个例子中，我们创建了一个包含图像URI的Document，加载了图像内容，进行了预处理，然后将图像张量输入到预训练的ResNet模型中进行推理。

3.5.2 多模态数据处理

DocArray还支持处理多模态数据，例如同时包含图像和文本的文档：

from docarray import Document

# 创建一个多模态Document
multi_modal_doc = Document(
    text='A beautiful landscape',
    uri='https://example.com/landscape.jpg'
)

# 加载图像内容
multi_modal_doc.load_uri_to_image_tensor()

# 可以分别处理文本和图像
# 例如，使用BERT处理文本，使用ResNet处理图像
# 然后将两种模态的特征融合

这里，我们创建了一个同时包含文本和图像的多模态Document，可以分别对文本和图像进行处理，然后将特征融合。

四、DocArray的实际案例

4.1 图像搜索应用

下面我们通过一个图像搜索应用的案例来展示DocArray的实际应用。

import torch
from torchvision import models, transforms
from docarray import Document, DocumentArray
from PIL import Image
import os

# 加载预训练模型
model = models.resnet18(pretrained=True)
# 去掉最后的全连接层，用于提取特征
feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
feature_extractor.eval()

# 图像预处理
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 构建图像数据库
def build_image_database(image_dir):
    """构建图像数据库，提取图像特征并存储"""
    image_database = DocumentArray()

    # 遍历图像目录
    for filename in os.listdir(image_dir):
        if filename.endswith(('.jpg', '.jpeg', '.png')):
            file_path = os.path.join(image_dir, filename)

            # 创建Document
            doc = Document(uri=file_path)

            # 加载图像
            img = Image.open(file_path).convert('RGB')

            # 预处理图像
            img_tensor = preprocess(img)
            img_tensor = img_tensor.unsqueeze(0)

            # 提取特征
            with torch.no_grad():
                features = feature_extractor(img_tensor)
                features = features.squeeze().flatten()

            # 将特征添加到Document
            doc.embedding = features.numpy()

            # 添加到数据库
            image_database.append(doc)

    return image_database

# 执行图像搜索
def image_search(query_image_path, image_database, top_k=5):
    """执行图像搜索，返回最相似的top_k个图像"""
    # 创建查询Document
    query_doc = Document(uri=query_image_path)

    # 加载查询图像
    query_img = Image.open(query_image_path).convert('RGB')

    # 预处理查询图像
    query_tensor = preprocess(query_img)
    query_tensor = query_tensor.unsqueeze(0)

    # 提取查询图像特征
    with torch.no_grad():
        query_features = feature_extractor(query_tensor)
        query_features = query_features.squeeze().flatten()

    # 设置查询Document的嵌入
    query_doc.embedding = query_features.numpy()

    # 执行搜索
    image_database.match(query_doc, limit=top_k)

    return query_doc.matches

# 使用示例
if __name__ == "__main__":
    # 假设我们有一个图像目录
    image_dir = "path/to/your/images"

    # 构建图像数据库
    print("Building image database...")
    image_db = build_image_database(image_dir)

    # 保存数据库
    image_db.save_binary("image_database.bin")
    print("Image database saved.")

    # 加载数据库
    loaded_db = DocumentArray.load_binary("image_database.bin")
    print("Image database loaded.")

    # 执行搜索
    query_image = "path/to/query/image.jpg"
    print(f"Searching for similar images to: {query_image}")
    results = image_search(query_image, loaded_db)

    # 打印搜索结果
    print("Search results:")
    for idx, match in enumerate(results):
        print(f"{idx+1}. {match.uri}, similarity score: {match.scores['cosine'].value}")

这个图像搜索应用的案例展示了DocArray的强大功能。我们首先使用预训练的ResNet模型提取图像特征，然后将这些特征存储在DocumentArray中作为图像数据库。当有查询图像时，我们提取查询图像的特征，与数据库中的图像特征进行匹配，返回最相似的图像。

4.2 多模态问答系统

下面是一个多模态问答系统的案例，展示了DocArray在处理多种数据类型方面的能力。

from docarray import Document, DocumentArray
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# 加载文本编码器
text_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text_model = AutoModel.from_pretrained('bert-base-uncased')

# 加载图像编码器
# 这里使用简化的ResNet模型
from torchvision import models
image_model = models.resnet18(pretrained=True)
image_model = torch.nn.Sequential(*list(image_model.children())[:-1])
image_model.eval()

# 文本编码函数
def encode_text(text):
    """将文本编码为向量"""
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = text_model(**inputs)
    # 使用[CLS]标记的输出作为文本表示
    return outputs.last_hidden_state[:, 0, :].numpy().flatten()

# 图像编码函数
def encode_image(image_tensor):
    """将图像编码为向量"""
    with torch.no_grad():
        features = image_model(image_tensor.unsqueeze(0))
        features = features.squeeze().flatten()
    return features.numpy()

# 构建多模态知识库
def build_knowledge_base():
    """构建包含文本和图像的多模态知识库"""
    knowledge_base = DocumentArray()

    # 添加文本知识
    text_knowledge = [
        "Python is a popular programming language.",
        "Machine learning is a subfield of artificial intelligence.",
        "Deep learning uses neural networks with many layers.",
        "Natural language processing deals with text understanding.",
        "Computer vision is about understanding visual information."
    ]

    for text in text_knowledge:
        doc = Document(text=text)
        doc.embedding = encode_text(text)
        knowledge_base.append(doc)

    # 添加图像知识（这里使用简化示例）
    # 实际应用中需要加载真实图像
    image_descriptions = [
        "A cat sitting on a chair",
        "A dog running in a park",
        "A bird flying in the sky",
        "A flower in a garden",
        "A car driving on a road"
    ]

    for desc in image_descriptions:
        # 创建一个虚拟图像张量（实际应用中需要加载真实图像）
        dummy_image_tensor = torch.rand(3, 224, 224)
        doc = Document(text=desc)
        doc.embedding = encode_image(dummy_image_tensor)
        knowledge_base.append(doc)

    return knowledge_base

# 多模态问答函数
def multimodal_qa(query, knowledge_base, is_text_query=True, top_k=3):
    """执行多模态问答"""
    # 编码查询
    if is_text_query:
        query_embedding = encode_text(query)
    else:
        # 对于图像查询，需要先加载图像并编码
        # 这里简化处理，假设query是一个图像张量
        query_embedding = encode_image(query)

    # 创建查询Document
    query_doc = Document(embedding=query_embedding)

    # 在知识库中查找相似项
    knowledge_base.match(query_doc, limit=top_k)

    return query_doc.matches

# 使用示例
if __name__ == "__main__":
    # 构建知识库
    print("Building knowledge base...")
    kb = build_knowledge_base()
    print(f"Knowledge base built with {len(kb)} items.")

    # 文本查询示例
    text_query = "What is machine learning?"
    print(f"\nText query: {text_query}")
    text_results = multimodal_qa(text_query, kb)

    print("Text query results:")
    for idx, match in enumerate(text_results):
        print(f"{idx+1}. {match.text}, similarity score: {match.scores['cosine'].value:.4f}")

    # 图像查询示例（简化处理）
    print("\nImage query example (simplified):")
    dummy_image_query = torch.rand(3, 224, 224)
    image_results = multimodal_qa(dummy_image_query, kb, is_text_query=False)

    print("Image query results:")
    for idx, match in enumerate(image_results):
        print(f"{idx+1}. {match.text}, similarity score: {match.scores['cosine'].value:.4f}")

这个多模态问答系统案例展示了DocArray在处理不同类型数据方面的灵活性。我们使用BERT模型处理文本，使用ResNet模型处理图像，将它们的特征都存储在DocArray中。当有查询时，无论是文本查询还是图像查询，都可以在知识库中找到最相关的信息。

五、DocArray的Pypi地址、Github地址和官方文档地址

Pypi地址：https://pypi.org/project/docarray
Github地址：https://github.com/jina-ai/docarray
官方文档地址：https://docarray.jina.ai

通过这些资源，你可以进一步了解DocArray的详细功能和最新动态，探索更多的使用场景和技巧。

关注我，每天分享一个实用的Python自动化工具。

实用工具

DocArray：简化数据处理与神经网络交互的Python库

一、Python在各领域的广泛性及DocArray的引入

二、DocArray库的用途、工作原理、优缺点及License类型

三、DocArray库的使用方式

3.1 安装DocArray

3.2 创建和操作Document

3.2.1 创建简单的Document

3.2.2 创建包含嵌套结构的Document

3.2.3 操作Document的属性

3.3 使用DocumentArray

3.3.1 创建DocumentArray

3.3.2 操作DocumentArray

3.4 数据序列化和存储

3.4.1 序列化为JSON

3.4.2 存储到文件

3.5 与神经网络集成

3.5.1 处理图像数据

3.5.2 多模态数据处理

四、DocArray的实际案例

4.1 图像搜索应用

4.2 多模态问答系统

五、DocArray的Pypi地址、Github地址和官方文档地址

更多文章

DocArray：简化数据处理与神经网络交互的Python库

Python实用工具：轻量级文档数据库TinyDB深度解析

Python使用工具：Bottleneck库使用教程

Python数据验证神器：pandera实战指南