Question

我有一个很长的文本，我需要分成段落，然后创建一个.csv，每个单元格给我一个新的段落。这是我尝试过的：

paragraphs = str(chunks)
print (paragraphs)

Paragraphs1 = paragraphs.split("^\n\n")

data1 = zip(Paragraphs1)

with open('Paragraphs1.csv','wb') as f:
    w=csv.writer(f)
    w.writerow(['Paragraphs'])
    for row in data1:
        w.writerow(row)

这导致.csv有两行未解析的段落。我也尝试使用'\ n' - 它会在.csv中为每个单元格生成一个新句子，但.csv会保留段落结构。有没有人有更好的方法这样做？

Answer 1

str.split()不会使用正则表达式。您正尝试将文字拆分为文字'^\n\n'字符：

>>> 'Text with newlines\n\nand a caret at the end^\n\nwhich will be split'.split('^\n\n')
['Text with newlines\n\nand a caret at the end', 'which will be split']

如果要使用正则表达式拆分，请使用re模块：

import re

re.split(r'^\n\n', paragraphs, flags=re.MULTILINE)

re.MULTILINE标志确保^在每个换行符后匹配，而不仅仅是在字符串的开头。

请注意，这假设您希望拆分三个连续换行符。演示：

>>> import re
>>> re.split(r'^\n\n', 'Cool\n\n\nNew paragraph\nruns here\n\n\nAnother paragraph?', flags=re.MULTILINE)
['Cool\n', 'New paragraph\nruns here\n', 'Another paragraph?']

如果有两个换行符就足够了，请改用$\n\n：

>>> re.split(r'$\n\n', 'Cool\n\nNew paragraph\nruns here\n\nAnother paragraph?', flags=re.MULTILINE)
['Cool', 'New paragraph\nruns here', 'Another paragraph?']

拆分以缩进开头的段落

1 个答案: