Question

我试图提取所有包括类别的文本（即A，B，C）。

A     <some text1> 

B     <some text2> 

C     <some text3>

但是，当我应用此正则表达式时 -

ptrn='\n[A-z]*\t'     

pattern1= '(.*)'+ptrn      

f = re.findall(pattern1,test_doc)

它给了我

f[0] = A     <some text1> 

f[1] = <some text2> 

f[2] = <some text3>

但我想 -

f[0] =  A     <some text1>

f[0] =  B     <some text2> 

f[2] =  C     <some text2>

此链接包含许多文档的原始文本。每个文件都有以下模式：

category<tab><sometext> \n

因此整个语料库看起来像这样： -

category<tab><sometext1> \n 

category<tab><sometext2> \n

.

.

我想要

doc[0] = category<tab><sometext1>

doc[1] = category<tab><sometext2>

.
.
and so on

任何答案/提示都会非常有用：）

Answer 1

尝试以下模式：

import re
pattern = r"(\w+)(\t)(.*)(\b)"

<强>解释