Question

我有一个像这样的csv文件：

|     publish_date     |sentence_number|character_count|    sentence       |
----------------------------------------------------------------------------
|          1           |               |               |                   |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |      -1       |       0       | Sentence 1 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       0       |      14       | Sentence 2 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       1       |      28       | "Sentence 3 here. |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       2       |      42       | Sentence 4 here." |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       3       |      56       | Sentence 5 here.  |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------
|          2           |               |               |                   |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |      -1       |       0       | Sentence 1 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       0       |      14       | Sentence 2 here.  |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------

我想做的是将每个句子块组合成段落以输出单个段落：

["Sentence 1 here.", "Sentence 2 here.", ""Sentence 3 here.", "Sentence 4 here."", "Sentence 5 here."]

有些句子是引用，继续成为一个新句子，而其他句子完全嵌入句子中。

到目前为止，我已经得到了这个：

def read_file():

    file = open('test.csv', "rU")
    reader = csv.reader(file)
    included_cols = [3]

    for row in reader:
        content = list(row[i] for i in included_cols)

        print content    
    return content

read_file()

但这只输出一个像这样的句子列表：

['Sentence 1 here.']
['Sentence 2 here.']

任何建议表示赞赏。

Answer 1

只需从每一行获取第四个元素，即可创建每个第四个元素的列表：

def read_file():
    file = open('test.csv', "rU")
    reader = csv.reader(file)
    return [row[3] for row in reader if len(row) > 3 and row[3]]

应输出：

['sentence', 'Sentence 1 here.', 'Sentence 2 here.', ' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.', 'Sentence 1 here.', 'Sentence 2 here.']

如果你想把这个段分成几个部分：

from itertools import groupby
def read_file():
    file = open('temp.txt', "rU")
    reader = csv.reader(file)
    paras = (row[3] for row in reader if len(row) > 3)
    return [list(v) for k, v in groupby(paras,key=lambda x: x != "") if k]

Groupby应该输出如下内容：

[['sentence', 'Sentence 1 here.', 'Sentence 2 here.', 
 ' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.'],
 ['Sentence 1 here.', 'Sentence 2 here.']]

使用不同大小的部分python连接CSV行

1 个答案: