Question

我想从文本数据中提取关键字和句子之间的行。这是我的数据，

CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.

在 *系统行开始之前，请帮助我提取关键字“ CUSTOMER SUPPLIED DATA：”下的行。（提取客户提供的数据：和* 系统行之间的行）。

我尝试了以下代码，

m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation" 
     [546])

m.group(1)

在客户提供的数据：和***系统行之间，我只有一行

输出如下：

[out]: - topic: Sign in & Password Support

但是我所需的输出应该是这样的，

[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

预先感谢您的帮助。

Answer 1

为此，您需要regex模块。

x="""CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda  
  """
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)

输出：['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']

请参阅演示。

https://regex101.com/r/naH3C7/2

Answer 2

@vks是正确的，如果您要像这样将其拆分，则regex模块会更好。但是，如果您真的只是想要什么（一个字符串，在CUSTOMER SUPPLIED DATA：和“ *** System：”之间的所有字符串），则将regexp更改为如下所示也可以：

re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\*  System:", x, re.DOTALL).

使用“（[[^ \ n] +）”，您要求它包含所有内容，直到遇到\ n，这可能不是您想要的。

用于模式re.search的Python正则表达式

2 个答案: