Question

我有一个用逗号分隔（,）制表符（\t）的文件。

68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t

有没有简单的方法可以将其放入Python Counter中？

Answer 1

您可以使用字典理解，被认为是 pythonic 和it can be marginally faster：

import csv
from collections import Counter


def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
    return the_counter

Answer 2

我不能放过这个，偶然发现我认为是赢家。

在测试中，很明显，循环csv.DictReader的行是最慢的部分；大约需要40秒中的30秒。

我将其切换为简单的csv.reader来查看得到的结果。这导致了列表行。我将其包装在dict中以查看其是否直接转换。做到了！

然后我可以遍历本机字典而不是csv.DictReader。

结果... 在3秒内完成了400万行！

def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.reader(f, delimiter="\t")
        d = dict(csv_reader)
        the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})

    return the_counter

Answer 3

这是我的最佳尝试。它可以工作，但不是最快的。
~~大约需要1.5分钟来运行400万行输入文件。~~
根据Daniel Mesejo的建议，现在要处理400万行输入文件大约需要40秒。

_{注意：csv中的count值可以是科学计数法，需要转换。因此，int(float(强制转换。}

import csv
from collections import Counter

def convert_counter_like_csv_to_counter(file_to_convert):

    the_counter = Counter()
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        for row in csv_reader:
            the_counter[row["title"]] = int(float(row["count"]))

    return the_counter

将2列类似计数器的csv文件转换为Python集合。计数器？

3 个答案: