我的数据格式如下:
Alessandro_Volta Was Alessandro Volta a professor of chemistry? Alessandro Volta was not a professor of chemistry. easy easy data/set4/a10
Alessandro_Volta Was Alessandro Volta a professor of chemistry? No easy hard data/set4/a10
Alessandro_Volta Did Alessandro Volta invent the remotely operated pistol? Alessandro Volta did invent the remotely operated pistol. easy easy data/set4/a10
Alessandro_Volta Did Alessandro Volta invent the remotely operated pistol? Yes easy easy data/set4/a10
Alessandro_Volta Was Alessandro Volta taught in public schools? Volta was taught in public schools. easy easy data/set4/a10
Alessandro_Volta Was Alessandro Volta taught in public schools? Yes easy easy data/set4/a10
我想废除question
。即first \t
和?
之间的文字(我想到了这个解决方案,不知道是否更好)
导入重新
def f(regexStr,target):
mo = re.search(regexStr,target)
if not mo:
print "NO MATCH"
else:
print "MATCH:",mo.group()
f(r"\^[^~]*~","{Mat^chThisT~ext}")
此代码正确地在^
和~
之间提供了文字,但我在\t
和?
尝试了同样的文字,它给了NO MATCH
。
答案 0 :(得分:3)
如果输入格式一致,那么为什么不是简单的:
with open('input.txt') as input_file:
questions = [line.split('\t', 2)[1].strip() for line in input_file]
假设input.txt
文件中每行的问题部分始终以tab
字符开头,后面跟questions
字符,{{1}}将包含由问题组成的字符串列表。
答案 1 :(得分:1)
(?<=[ ]{4,}).*?\?
试试这个。看看演示。
http://regex101.com/r/yR3mM3/36
import re
p = re.compile(r'(?<=[ ]{4,}).*?\?')
test_str = "Alessandro_Volta Was Alessandro Volta a professor of chemistry? Alessandro Volta was not a professor of chemistry. easy easy data/set4/a10\nAlessandro_Volta Was Alessandro Volta a professor of chemistry? No easy hard data/set4/a10\nAlessandro_Volta Did Alessandro Volta invent the remotely operated pistol? Alessandro Volta did invent the remotely operated pistol. easy easy data/set4/a10\nAlessandro_Volta Did Alessandro Volta invent the remotely operated pistol? Yes easy easy data/set4/a10\nAlessandro_Volta Was Alessandro Volta taught in public schools? Volta was taught in public schools. easy easy data/set4/a10\nAlessandro_Volta Was Alessandro Volta taught in public schools? Yes easy easy data/set4/a10"
re.findall(p, test_str)