我有一个像这样的csv文件:
| publish_date |sentence_number|character_count| sentence |
----------------------------------------------------------------------------
| 1 | | | |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | -1 | 0 | Sentence 1 here. |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | 0 | 14 | Sentence 2 here. |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | 1 | 28 | "Sentence 3 here. |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | 2 | 42 | Sentence 4 here." |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | 3 | 56 | Sentence 5 here. |
----------------------------------------------------------------------------
| end | | | |
----------------------------------------------------------------------------
| 2 | | | |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | -1 | 0 | Sentence 1 here. |
----------------------------------------------------------------------------
| 02/01/2012 00:12:00 | 0 | 14 | Sentence 2 here. |
----------------------------------------------------------------------------
| end | | | |
----------------------------------------------------------------------------
| end | | | |
----------------------------------------------------------------------------
我想做的是将每个句子块组合成段落以输出单个段落:
["Sentence 1 here.", "Sentence 2 here.", ""Sentence 3 here.", "Sentence 4 here."", "Sentence 5 here."]
有些句子是引用,继续成为一个新句子,而其他句子完全嵌入句子中。
到目前为止,我已经得到了这个:
def read_file():
file = open('test.csv', "rU")
reader = csv.reader(file)
included_cols = [3]
for row in reader:
content = list(row[i] for i in included_cols)
print content
return content
read_file()
但这只输出一个像这样的句子列表:
['Sentence 1 here.']
['Sentence 2 here.']
任何建议表示赞赏。
答案 0 :(得分:1)
只需从每一行获取第四个元素,即可创建每个第四个元素的列表:
def read_file():
file = open('test.csv', "rU")
reader = csv.reader(file)
return [row[3] for row in reader if len(row) > 3 and row[3]]
应输出:
['sentence', 'Sentence 1 here.', 'Sentence 2 here.', ' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.', 'Sentence 1 here.', 'Sentence 2 here.']
如果你想把这个段分成几个部分:
from itertools import groupby
def read_file():
file = open('temp.txt', "rU")
reader = csv.reader(file)
paras = (row[3] for row in reader if len(row) > 3)
return [list(v) for k, v in groupby(paras,key=lambda x: x != "") if k]
Groupby应该输出如下内容:
[['sentence', 'Sentence 1 here.', 'Sentence 2 here.',
' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.'],
['Sentence 1 here.', 'Sentence 2 here.']]