Question

想象一下，我们有一个包含以下行的大文件

ID     value     string
1      105       abc 
1      98        edg
1      100       aoafsk
2      160       oemd
2      150       adsf 
...

假设该文件名为file.txt，并以制表符分隔。

我想保留每个ID的最大值。预期的输出是

ID     value     string
1      105       abc 
2      160       oemd
...

如何通过块读取数据并处理数据？如果我以块的形式读取数据，如何在每个块的末尾确保每个ID都有完整的记录？

Answer 1

跟踪此格式字典中的数据：

data = {
    ID: [value, 'string'],
}

当您从文件中读取每一行时，请查看该ID是否已存在于dict中。如果没有，请添加;如果是，并且当前ID更大，则在dict中替换它。

最后，你的dict应该有最大的身份证。

# init to empty dict
data = {}

# open the input file
with open('file.txt', 'r') as fp:

    # read each line
    for line in fp:

          # grab ID, value, string
          item_id, item_value, item_string = line.split()

          # convert ID and value to integers
          item_id = int(item_id)
          item_value = int(item_value)

          # if ID is not in the dict at all, or if the value we just read
          # is bigger, use the current values
          if item_id not in data or item_value > data[item_id][0]:
              data[item_id] = [item_value, item_string]

for item_id in data:
    print item_id, data[item_id][0], data[item_id][1]

字典不会强制执行其内容的任何特定排序，因此在程序结束时，当您从dict中取回数据时，它可能与原始文件的顺序不同（即您可能会看到首先是ID 2，然后是ID 1）。

如果这对您很重要，您可以使用OrderedDict，它会保留元素的原始广告订单。

（当你说“按块读取”时，你有什么特别的想法吗？如果你的意思是一个特定数量的字节，那么如果一个块边界恰好落入，你可能会遇到问题一句话的中间......）

Answer 2

<强>代码

import csv
import itertools as it
import collections as ct


with open("test.csv") as f:                                
    reader = csv.DictReader(f, delimiter=" ")              # 1
    for k, g in it.groupby(reader, lambda d: d["ID"]):     # 2
        print(max(g, key=lambda d: float(d["value"])))     # 3

# {'value': '105', 'string': 'abc', 'ID': '1'}
# {'value': '160', 'string': 'oemd', 'ID': '2'}

<强>详情

with块可确保文件f的安全打开和关闭。该文件是可迭代的，允许您循环它或理想地应用itertools。

对于f的每一行，csv.DictReader分割数据并将标题行信息维护为字典的键值对，例如。 [{'value': '105', 'string': 'abc', 'ID': '1'}, ...
此数据是可迭代的，并传递给groupby，ID按max()整理所有数据。请参阅this post from more details on how groupby works。
"value"内置结合特殊键函数返回最大for rec in (select * from table2 t)loop if INSTR(rec.param_value, ',') > 1 then --insert table3 YES else --insert table3 NO end if; end loop;的dicts。请参阅this tutorial for more details on the max() function。

以块的形式读取数据，并在Python中为每个ID保留一行

2 个答案: