我想从文本数据中提取关键字和句子之间的行。这是我的数据,
CUSTOMER SUPPLIED DATA:
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID
*** System::[chat.automatonClientOutcome] Hello! How may I help you today? *** System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
在 *系统行开始之前,请帮助我提取关键字“ CUSTOMER SUPPLIED DATA:”下的行。 (提取客户提供的数据:和* 系统行之间的行)。
我尝试了以下代码,
m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation"
[546])
m.group(1)
在客户提供的数据:和***系统行之间,我只有一行
输出如下:
[out]: - topic: Sign in & Password Support
但是我所需的输出应该是这样的,
[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID
预先感谢您的帮助。
答案 0 :(得分:1)
为此,您需要regex
模块。
x="""CUSTOMER SUPPLIED DATA:
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID
*** System::[chat.automatonClientOutcome] Hello! How may I help you today? *** System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda
"""
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)
输出:['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']
请参阅演示。
答案 1 :(得分:0)
@vks是正确的,如果您要像这样将其拆分,则regex模块会更好。 但是,如果您真的只是想要什么(一个字符串,在CUSTOMER SUPPLIED DATA:和“ *** System:”之间的所有字符串),则将regexp更改为如下所示也可以:
re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\* System:", x, re.DOTALL).
使用“([[^ \ n] +)”,您要求它包含所有内容,直到遇到\ n,这可能不是您想要的。