用Python组织数据集

时间:2018-11-25 04:56:05

标签: python csv dictionary

我有大量成语的.csv数据集。每行包含三个我想分开的元素(用逗号分隔):

1)索引号(0,1,2,3 ...)

2)成语本身

3)如果这个成语是正/负/中性

这是.csv文件的一个小例子:

0,"I did touch them one time you see but of course there was nothing doing, he wanted me.",neutral

1,We find that choice theorists admit that they introduce a style of moral paternalism at odds with liberal values.,neutral

2,"Well, here I am with an olive branch.",positive

3,"Its rudder and fin were both knocked out, and a four-foot-long gash in the shell meant even repairs on the bank were out of the question.",negative

如您所见,成语有时会包含引号,而其他时候则不会。但是,我认为这不会很困难。

我认为用Python组织此操作的最佳方法是通过字典,例如:

example_dict = {0: ['This is an idiom.', 'neutral']}

那么如何将每一行分成三个不同的字符串(基于逗号),然后将第一个字符串用作键号,将后两个字符串用作字典中的相应列表项?

我最初想到的是尝试使用以下代码分割逗号:

for line in file:    
    new_item = ','.join(line.split(',')[1:])

但是它所做的只是删除一行中第一个逗号之前的所有内容,我认为通过它进行一堆迭代不会很有效。

我想征询一些有关组织数据的最佳方法是什么的建议?

1 个答案:

答案 0 :(得分:1)

Python有an entire module专门用于处理csv文件。在这种情况下,您可以使用它从文件中列出列表。现在将您的文件命名为idioms.csv

import csv
with open('idioms.csv', newline='') as idioms_file:
    reader = csv.reader(idioms_file, delimiter=',', quotechar='"')
    idioms_list = [line for line in reader]

# Now you have a list that looks like this:
# [[0, "I did touch them...", "neutral"],
#  [1, "We find that choice...", "neutral"],
#  ...
# ]

,您现在可以按自己的喜好对数据进行排序或组织。