我有大量成语的.csv数据集。每行包含三个我想分开的元素(用逗号分隔):
1)索引号(0,1,2,3 ...)
2)成语本身
3)如果这个成语是正/负/中性
这是.csv文件的一个小例子:
0,"I did touch them one time you see but of course there was nothing doing, he wanted me.",neutral
1,We find that choice theorists admit that they introduce a style of moral paternalism at odds with liberal values.,neutral
2,"Well, here I am with an olive branch.",positive
3,"Its rudder and fin were both knocked out, and a four-foot-long gash in the shell meant even repairs on the bank were out of the question.",negative
如您所见,成语有时会包含引号,而其他时候则不会。但是,我认为这不会很困难。
我认为用Python组织此操作的最佳方法是通过字典,例如:
example_dict = {0: ['This is an idiom.', 'neutral']}
那么如何将每一行分成三个不同的字符串(基于逗号),然后将第一个字符串用作键号,将后两个字符串用作字典中的相应列表项?
我最初想到的是尝试使用以下代码分割逗号:
for line in file:
new_item = ','.join(line.split(',')[1:])
但是它所做的只是删除一行中第一个逗号之前的所有内容,我认为通过它进行一堆迭代不会很有效。
我想征询一些有关组织数据的最佳方法是什么的建议?
答案 0 :(得分:1)
Python有an entire module专门用于处理csv
文件。在这种情况下,您可以使用它从文件中列出列表。现在将您的文件命名为idioms.csv
:
import csv
with open('idioms.csv', newline='') as idioms_file:
reader = csv.reader(idioms_file, delimiter=',', quotechar='"')
idioms_list = [line for line in reader]
# Now you have a list that looks like this:
# [[0, "I did touch them...", "neutral"],
# [1, "We find that choice...", "neutral"],
# ...
# ]
,您现在可以按自己的喜好对数据进行排序或组织。