使用不同大小的部分python连接CSV行

时间:2015-05-09 16:06:54

标签: python regex csv

我有一个像这样的csv文件:

|     publish_date     |sentence_number|character_count|    sentence       |
----------------------------------------------------------------------------
|          1           |               |               |                   |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |      -1       |       0       | Sentence 1 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       0       |      14       | Sentence 2 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       1       |      28       | "Sentence 3 here. |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       2       |      42       | Sentence 4 here." |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       3       |      56       | Sentence 5 here.  |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------
|          2           |               |               |                   |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |      -1       |       0       | Sentence 1 here.  |
----------------------------------------------------------------------------
| 02/01/2012  00:12:00 |       0       |      14       | Sentence 2 here.  |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------
|         end          |               |               |                   |
----------------------------------------------------------------------------

我想做的是将每个句子块组合成段落以输出单个段落:

["Sentence 1 here.", "Sentence 2 here.", ""Sentence 3 here.", "Sentence 4 here."", "Sentence 5 here."]

有些句子是引用,继续成为一个新句子,而其他句子完全嵌入句子中。

到目前为止,我已经得到了这个:

def read_file():

    file = open('test.csv', "rU")
    reader = csv.reader(file)
    included_cols = [3]

    for row in reader:
        content = list(row[i] for i in included_cols)

        print content    
    return content

read_file()

但这只输出一个像这样的句子列表:

['Sentence 1 here.']
['Sentence 2 here.']

任何建议表示赞赏。

1 个答案:

答案 0 :(得分:1)

只需从每一行获取第四个元素,即可创建每个第四个元素的列表:

def read_file():
    file = open('test.csv', "rU")
    reader = csv.reader(file)
    return [row[3] for row in reader if len(row) > 3 and row[3]]

应输出:

['sentence', 'Sentence 1 here.', 'Sentence 2 here.', ' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.', 'Sentence 1 here.', 'Sentence 2 here.']

如果你想把这个段分成几个部分:

from itertools import groupby
def read_file():
    file = open('temp.txt', "rU")
    reader = csv.reader(file)
    paras = (row[3] for row in reader if len(row) > 3)
    return [list(v) for k, v in groupby(paras,key=lambda x: x != "") if k]

Groupby应该输出如下内容:

[['sentence', 'Sentence 1 here.', 'Sentence 2 here.', 
 ' "Sentence 3 here.', ' Sentence 4 here."', ' Sentence 5 here.'],
 ['Sentence 1 here.', 'Sentence 2 here.']]