Question

对于我的硕士论文，我需要从公司收益通话记录中提取（讲话者，文本）元组。

成绩单的格式如下：

OPERATOR: Some text with numbers, special characters and linebreaks.

NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks.

NAME: Some text with numbers, special characters and linebreaks.

我想从文档中提取所有（说话者，文本）元组。例如：

[("OPERATOR", "Some text with numbers, special characters and linebreaks."), ..]

到目前为止，我已经使用Python中的re.findall函数尝试了不同的正则表达式。

以下是示例摘录：

example = """OPERATOR: Good day, ladies and gentlemen, and welcome to the first-quarter 2012
Agilent Technologies earnings conference call. My name is Keith, and I will be
your operator for today. At this time, all participants are in a listen-only
mode. Later on, we will have a question and answer session. (Operator
Instructions) As a reminder, today's conference is being recorded for replay
purposes.

And I would now like to turn the conference over to your host for today, Ms.
Alicia Rodriguez, Vice President of Investor Relations. Please go ahead, ma'am.

ALICIA RODRIGUEZ, VP - IR, AGILENT TECHNOLOGIES INC: Thank you, Keith, and
welcome, everyone, to Agilent's first quarter conference call for fiscal-year
2012. With me are Agilent's President and CEO, Bill Sullivan, as well as Senior
Vice President and CFO, Didier Hirsch. Joining in the Q&A after Didier's
comments will be Agilent's Chief Operating Officer, Ron Nersesian, and the
Presidents of our Electronic Measurement, Life Sciences, and Chemical Analysis
Groups -- Guy Sene, Nick Roelofs, and Mike McMullen.

You can find the press release and information to supplement today's discussion
on our website at www.investor.agilent.com. While there, please click on the
link for financial results, where you will find revenue breakouts and historical
financials for Agilent's operations. We will also post a copy of the prepared
remarks following this call. For any non-GAAP financial measures, you will find
the most directly comparable GAAP financial metrics and reconciliations on our
website.

We will make forward-looking statements about the financial performance of the
Company. These statements are subject to risks and uncertainties, and are only
valid as of today. The Company assumes no obligation to update them. Please look
at the Company's recent SEC filings for a more complete picture of our risks and
other factors.

Before turning the call over to Bill, I would like to remind you that Agilent
will host its annual analysts meeting in New York City on March 8. Details about
the meeting and webcast will be available on the Agilent investor relations
website two weeks prior.

And now, I'd like to turn the call over to Bill.

BILL SULLIVAN, PRESIDENT AND CEO, AGILENT TECHNOLOGIES INC: Thanks, Alicia, and
hello, everyone. Agilent's Q1 orders of $1.62 billion were flat versus last
year. Q1 revenues of $1.64 billion were up 7% year-over-year. Non-GAAP EPS was
$0.69 per share, and operating margin was 19%."""

这是我的代码：

import re

# First approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+?)", flags=re.MULTILINE)
re.findall(r, example)

# Second approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+)", flags=re.MULTILINE)
re.findall(r, example)

第一种（非贪婪）方法的问题是它无法捕获讲话者的全文。

第二种方法（贪婪）的问题是当下一位发言者出现时它不会停止。

编辑：其他信息

文本组也可以包含双点。在某些情况下，在一行的第一个单词之后会立即出现双点，例如“对于\ n示例：...”
扬声器组还可以覆盖多条线路，例如公司名称和职位描述很长的时候

Answer 1

您可以不使用[\s\S]+进行匹配，因为它将匹配包括换行符在内的任何字符。

对于第二个捕获组，您可以匹配.*，然后使用前瞻性为负的重复组，只要下一行不以(?:(?!\n[^a-z\r\n]+:)开头，该匹配组就会匹配

^([^a-z\r\n]+):(.*(?:(?!\n[^a-z\r\n]+:)[\r\n].*)*)

Regex demo | Python demo

如何使用正则表达式从通话记录中提取（说话者，文本）元组？

1 个答案: