Question

我有成绩单的文件，格式是

（名称1）：（句子）\ n（<-此模式可以是倍数）

（名称2）：（句子）\ n（句子）\ n

，依此类推。我需要所有的句子。到目前为止，我已经通过对文件中的名称进行硬编码来使它起作用，但是我需要使它具有通用性。

Python 3.6使用re。或者，如果有人知道如何使用spacy进行此操作，谢谢，那将是一个很大的帮助。

我只想在空语句后抓取\n，并将其放在自己的字符串中。而且我想我只需要像这样在结尾处获取磁带信息，因为我想不出一种方法来区分线路是否属于某人的讲话。有时候，在行首和冒号之间还有多个词。

模拟数据：

CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!

Answer 1

您可以使用先行表达式在行的开头查找与名称相同的模式，并在其后加上冒号：

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这将输出：

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

Answer 2

我将使用regular expressions并将for loops嵌套在列表推导中，以捕获所有句子，如下面的代码所示。

s ='''(name 1): (sentence1 here)\n (<-- There can be multiples of this pattern)

(name 2): (sentence2 here)\n (sentence3 here)\n'''

[y.strip('()') for x in re.split('\(name \d+\):', s) for y in re.findall('\([^\)]+\)', x)]

>>> ['sentence1 here',
    '<-- There can be multiples of this pattern',
    'sentence2 here',
    'sentence3 here']

从成绩单文件中获取句子

2 个答案: