Question

说我有以下variables及其对应的values代表record。

name = 'abc'
age = 23
weight = 60
height = 174

请注意，value可以是types string，integer，float，引用任何其他对象，等等）。

会有很多records（至少> 100,000）。当所有这四个record（实际上是unique）放在一起时，每个variables都是values。换句话说，没有record，所有4 values都是相同的。

我正在尝试在Python中找到一个有效的数据结构，这样我就可以根据records中variables中的任何一个来{存储和}检索log(n)时间复杂性。

例如：

def retrieve(name=None,age=None,weight=None,height=None) 
    if name is not None and age is None and weight is None and height is None:
        /* get all records with the given name */
    if name is None and age is not None and weight is None and height is None:
        /* get all records with the given age */
    ....
    return records

应该调用retrieve的方式如下：

retrieve(name='abc')

以上内容应返回[{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]

retrieve(age=23)

以上内容应返回[{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]

而且，我可能需要在将来再向此记录添加一两个variables。例如，sex = 'm'。因此，retrieve函数必须是可伸缩的。

简而言之：Python中是否有数据结构允许storing a record n个columns（姓名，年龄，性别，体重，身高，等等）和retrieving records基于column logarithmic（或理想constant - O(1)查找时间）复杂度中的任何一个{{1}}？

Answer 1

Python中没有一个内置的数据结构能够满足您的需求，但是它可以很容易地结合使用它来实现您的目标并且相当有效地完成目标。

例如，假设您的输入是逗号分隔值文件中的以下数据，名为employees.csv，其字段名称定义如第一行所示：

name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

以下是工作代码，说明如何将此数据读取并存储到记录列表中，并自动创建单独的查找表，以查找与每个记录中的字段中包含的值相关联的记录。

记录是由namedtuple创建的类的实例，这是一个非常高效的内存，因为每个类都缺少类实例通常包含的__dict__属性。使用它们可以使用点语法访问每个字段的字段，如record.fieldname。

查找表是defaultdict(list)个实例，它们平均提供类似字典的 O （1）查找时间，并且还允许多个与每个值相关联的值。因此，查找键是要搜索的字段值的值，与之关联的数据将是Person列表中存储的employees记录的整数索引列表，其中包含该值 - 所以他们都会相对较小。

请注意，该类的代码完全是数据驱动的，因为它不包含任何硬编码的字段名称，而是在读入时从csv数据输入文件的第一行中获取。当然，当使用例如，所有retrieve()方法调用都必须提供有效的字段名称。

<强>更新

修改为在首次读取数据文件时不为每个字段的每个唯一值创建查找表。现在retrieve()方法“lazily”仅在需要时创建它们（并保存/缓存结果以供将来使用）。也修改为在Python 2.7+中工作，包括3.x。

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # Read data from csv format file into a list of namedtuples.
        with open(csv_filename, 'r') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields = next(csv_reader)  # Read header row.
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)

        # Create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it.
        self.lookup_tables = {}

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any).
        """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname keyword argument required for retrieve function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = kwargs.popitem()  # Keyword arg's name and value.
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # Need to create a lookup table?
            lookup_table = self.lookup_tables[field] = defaultdict(list)
            for index, record in enumerate(self.records):
                field_value = getattr(record, field)
                lookup_table[field_value].append(index)
        # Return (possibly empty) sequence of matching records.
        return tuple(self.records[index]
                        for index in self.lookup_tables[field].get(value, []))

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')

    print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
    print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
    print("retrieve(weight='150'): {}".format(empdb.retrieve(weight='150')))
    try:
        print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
    except ValueError as e:
        print("ValueError('{}') raised as expected".format(e))
    else:
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')

输出：

retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected

Answer 2

Python中是否有数据结构允许存储n列数（名称，年龄，性别，体重，身高等）的记录，并根据任何（一个）检索记录对数（或理想情况下恒定 - O（1）查找时间）复杂度的列？

不，没有。但是你可以尝试在每个值维度的一个字典的基础上实现一个。只要你的价值当然是可以清洗的。如果为记录实现自定义类，则每个字典将包含对相同对象的引用。这样可以节省一些内存。

Answer 3

鉴于http://wiki.python.org/moin/TimeComplexity如何：

为您感兴趣的每个栏目添加词典 - AGE，NAME等。
让字典（AGE，NAME）的键成为给定列（35或“m”）的可能值。
列出一个列表，列出一个“集合”的值，例如： VALUES = [ [35, "m"], ...]
让列词典（AGE，NAME）的值为VALUES列表中的索引列表。
有一个词典，它将列名称映射到VALUES中列表中的索引，这样您就知道第一列是年龄，第二列是性别（您可以避免使用词典并使用词典，但它们会引入大内存footrpint和超过100K的对象，这可能是或不是一个问题）。

然后retrieve函数可能如下所示：

def retrieve(column_name, column_value):
    if column_name == "age":
        return [VALUES[index] for index in AGE[column_value]]      
    elif ...: # repeat for other "columns"

然后，这就是你得到的

VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]

retrieve("age", 35)
# [[35, 'm']]

如果您需要字典，可以执行以下操作：

[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]

但同样，字典在内存方面有点沉重，所以如果你可以使用值列表，那可能会更好。

字典和列表检索平均为O（1） - 字典的最坏情况是O（n） - 因此这应该非常快。保持这将是一点点痛苦，但不是那么多。要“写”，您只需要附加到VALUES列表，然后将VALUES中的索引附加到每个词典中。

当然，最好是对您的实际实施进行基准测试并寻找潜在的改进，但希望这有意义并且会让您前进：）

编辑：

请注意，正如@moooeeeep所说，这只有在你的值可以清洗时才有效，因此可以用作字典键。

Answer 4

您可以使用索引（O(log(n)**k)和单列索引）在关系数据库中实现对数时间复杂度。然后检索数据只需构造适当的SQL：

names = {'name', 'age', 'weight', 'height'}

def retrieve(c, **params):
    if not (params and names.issuperset(params)):
        raise ValueError(params)
    where = ' and '.join(map('{0}=:{0}'.format, params))
    return c.execute('select * from records where ' + where, params)

示例：

import sqlite3

c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access

# create table
c.execute("""create table records
             (name text, age integer, weight real, height real)""")

# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)

# create indexes
for name in names:
    c.execute("create index idx_{0} on records ({0})".format(name))

try:
    retrieve(c, naame='abc')
except ValueError:
    pass
else:
    assert 0

for record in retrieve(c, name='abc', weight=60):
    print(record['height'])

输出：

174.0
175.0

存储一组四个（或更多）值的最佳数据结构是什么？

4 个答案: