Question

我有一套书籍和作者的数据集，有多对多的关系。

大约有10 ^ 6本书和10 ^ 5位作者，每本书平均有10位作者。

我需要对数据集执行一系列操作，例如计算每位作者的书籍数量，或者删除某位作者的所有书籍。

什么是允许快速处理的良好数据结构？

我希望有一些现成的模块可以提供以下方法：

obj.books.add(book1)

# linking
obj.books[n].author = author1
obj.authors[m].author = book1

# deleting
obj.remove(author1) # should automatically remove all links to the books by author1, but not the linked books

我应该澄清一点，我不想为此使用数据库，而是在内存中完成所有操作。

由于

Answer 1

sqlite3（或任何其他优秀的关系数据库，但sqlite附带Python，并且对于这样一个相当小的数据集更方便）似乎是适合您的任务的正确方法。如果您不想学习SQL，SQLAlchemy是关系数据库的流行“包装器”，可以说，它允许您在您选择的几个不同抽象级别中的任何一个处理它们。

并且“在内存中完成所有操作”完全没有问题（它是愚蠢，请注意，因为您将不必要地支付从每个地方读取所有数据的开销。并且你的程序的每次运行，同时将数据库保存在磁盘文件上可以节省你的开销 - 但是，这是一个不同的问题;-)。只需打开你的sqlite数据库为':memory:'，你就是 - 一个完全在内存中的全新的关系数据库（仅在你的进程期间），程序中没有任何磁盘。那么，为什么不呢？ - ）

就个人而言，我直接使用SQL来完成这项任务 - 它让我能够很好地控制正在发生的事情，并且可以轻松地添加或删除索引以调整性能等。您将使用三个表：a { {1}}表（主键ID，其他字段，如Title＆amp; c），Books表（主键ID，其他字段，如Name＆amp; c）和“多对多”关系表“，说Authors，只有两个字段BookAuthors和BookID，每个作者 - 书籍连接一条记录。

AuthorID表的两个字段是所谓的“外键”，分别指代书籍和作者的ID字段，您可以使用BookAuthors定义它们以便记录引用被删除的书籍或作者会自动被删除 - 这是高语义级别的一个例子，即使是“裸”SQL也可以让你工作，而其他任何现有的数据结构都无法接近匹配。

Answer 2

我希望有一些现成的模块可以提供以下方法：

既然这确实有效，你还需要什么呢？

您有Book和Author类定义。您还有关系的Book-Author关联。管理添加/更改/删除所需的方法只有几行代码。

创建Authors，Books和Author-Book关联对象的大型旧词典。

使用shelve存储所有内容。

完成。

Answer 3

我只用pandas就可以了。它可以处理多对多关系。计数和删除非常简单。例如：

import pandas as pd

# Set up the dataframe with books and authors.
df = pd.DataFrame(columns=['author', 'book'])
df.loc[0] = ['John Smith', 'Programming in Python']
df.loc[1] = ['John Doe', 'Programming in Python']
df.loc[2] = ['John Smith', 'Programming in Pandas']
df.loc[3] = ['John Doe', 'Programming in Numpy']
df.loc[4] = ['Jane Doe', 'Programming in Numpy']

# Find all books by John Smith
print(list(df['John Smith' == df['author']]['book'].values))
# Result: ['Programming in Python', 'Programming in Pandas']
# Use the len function to count the number of books.

# Find all authors for 'Programming in Numpy'
print(list(df['Programming in Numpy' == df['book']]['author'].values))
# Result: ['John Doe', 'Jane Doe']

# To drop the John Doe's from the dataframe:
df = df.drop(df['John Doe' == df['author']].index)

Answer 4

嗯，如果您不想保留数据并想要一个纯 Python 解决方案，我认为您不需要任何 3rd 方或外部数据库。你这样做会更快：

为您的所有书籍和作者分配一个唯一 ID（使用例如计数器）
管理映射到各自字典中的对象
通过管理 2 个一对多关联建立多对多关系

    from typing import Dict, List
    
    # equivalent to tables
    books: Dict[int, str] = {}
    authors: Dict[int, str] = {}
    # equivalent to a many-to-many relationship
    book_to_author_map: Dict[int, List[int]] = {}
    author_to_book_map: Dict[int, List[int]] = {}
    
    # your database objects
    books[0] = 'my first book'
    books[1] = 'my second book'
    books[2] = 'my third book'
    authors[0] = 'my first author'
    authors[1] = 'my second author'
    authors[2] = 'my third author'
    
    book_to_author_map[0] = [0]
    book_to_author_map[1] = [1, 2]
    book_to_author_map[2] = [0, 2]
    
    author_to_book_map[0] = [0, 2]
    author_to_book_map[1] = [1]
    author_to_book_map[2] = [1, 2]
    
    # operations on your "database"
    
    # add a book 3 and associate it to author 0
    books[3] = 'my fourth book'
    book_to_author_map[3] = []
    book_to_author_map[3].append(0)
    author_to_book_map[0].append(3)
    
    # remove book 1 from author 2
    book_to_author_map[1].remove(2)
    author_to_book_map[2].remove(1)

Python中的多对多数据结构

4 个答案: