如何捕获格式为(name):(句子)\ n(name)的文件中的所有句子:

时间:2018-10-18 02:33:49

标签: python regex spacy

我有成绩单的文件,格式是

  

(名称):(句子)\ n(<-此模式可能是倍数)

     

(名称):(句子)\ n
  (句子)\ n

,依此类推。我需要所有的句子。到目前为止,我已经通过对文件中的名称进行硬编码来使它起作用,但是我需要使它具有通用性。

utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)

Python 3.6使用re。或者,如果有人知道如何使用spacy进行此操作,谢谢,那将是一个很大的帮助。

我只想在空语句后获取\ n并将其放在自己的字符串中。而且我想我只需要像这样在结尾处获取磁带信息,因为我想不出一种方法来区分线路是否属于某人的讲话。 有时候,在行首和冒号之间还有多个词。

模拟数据:

  

CRO:您距世界贸易中心有多远,大约有几个街区?三或   四个街区?

     

63FDNY 911通话记录-EMS-第1部分9-11-01

     

呼叫者:

     

CRO:不客气。谢谢。

     

操作员:再见

     

CRO:再见

     

记录器:磁带的前面部分在0913小时36秒结束。

     

此磁带将在B面继续。

     

NEWELL行动者:等等等等。

2 个答案:

答案 0 :(得分:1)

您可以使用先行表达式在行的开头查找与名称相同的模式,并在其后加上冒号:

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这将输出:

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

答案 1 :(得分:0)

您从未提供过模拟数据,因此我将以下内容用于测试目的:

name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.

我们可以尝试使用以下模式进行匹配:

^\S+:\s+((?:(?!^\S+:).)+)

这可以解释为:

^\S+:\s+           match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+)  then match and capture everything up until the next name

请注意,这可以处理最后一句话的边缘情况,因为上面使用的否定前瞻不会是正确的,因此将捕获所有剩余内容。

代码示例:

import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)

['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']

Demo