Question

我有大量成语的.csv数据集。每行包含三个我想分开的元素（用逗号分隔）：

1）索引号（0,1,2,3 ...）

2）成语本身

3）如果这个成语是正/负/中性

这是.csv文件的一个小例子：

0,"I did touch them one time you see but of course there was nothing doing, he wanted me.",neutral

1,We find that choice theorists admit that they introduce a style of moral paternalism at odds with liberal values.,neutral

2,"Well, here I am with an olive branch.",positive

3,"Its rudder and fin were both knocked out, and a four-foot-long gash in the shell meant even repairs on the bank were out of the question.",negative

如您所见，成语有时会包含引号，而其他时候则不会。但是，我认为这不会很困难。

我认为用Python组织此操作的最佳方法是通过字典，例如：

example_dict = {0: ['This is an idiom.', 'neutral']}

那么如何将每一行分成三个不同的字符串（基于逗号），然后将第一个字符串用作键号，将后两个字符串用作字典中的相应列表项？

我最初想到的是尝试使用以下代码分割逗号：

for line in file:    
    new_item = ','.join(line.split(',')[1:])

但是它所做的只是删除一行中第一个逗号之前的所有内容，我认为通过它进行一堆迭代不会很有效。

我想征询一些有关组织数据的最佳方法是什么的建议？

Answer 1

Python有an entire module专门用于处理csv文件。在这种情况下，您可以使用它从文件中列出列表。现在将您的文件命名为idioms.csv：

import csv
with open('idioms.csv', newline='') as idioms_file:
    reader = csv.reader(idioms_file, delimiter=',', quotechar='"')
    idioms_list = [line for line in reader]

# Now you have a list that looks like this:
# [[0, "I did touch them...", "neutral"],
#  [1, "We find that choice...", "neutral"],
#  ...
# ]

，您现在可以按自己的喜好对数据进行排序或组织。

用Python组织数据集

1 个答案: