这是我的测试示例
JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)
在所有行以“MH”开头后,如何编写正则表达式以仅识别词汇表,然后将它们导入excel表。输出应该是这样的:
[Adult, Biomedical Research, organization & administration, Female, Health Care Reform, history, methods].
这是我的尝试:
import re
Path = "MH\s*.*"
re.findall(Path,file)
我知道这是错的,但我不知道如何解决它。
谢谢
答案 0 :(得分:2)
使用re.findall
<强>演示:强>
import re
s = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
res = []
for i in re.findall(r"MH\s+-\s+(.*)", s, flags=re.MULTILINE):
res.extend(i.split("/*"))
print( res )
<强>输出:强>
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
答案 1 :(得分:2)
看起来您需要执行一些正则表达式,因为您还希望在某些行上拆分/ *。这应该可以做到!
import re
my_file = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
my_list = my_file.splitlines()
new_list = []
for item in my_list:
if re.search("^MH\s*-", item):
item = re.sub("[^-]+-\s*", "", item)
item = item.split("/*")
new_list = new_list + item
print(new_list)
输出:
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
我将该字符串放入列表中。我认为你很有可能在导入时将该字符串作为列表。我也喜欢使用正则表达式一次使用1行,以后更容易进行故障排除。
我匹配以MH
开头然后捕获它们的项目。然后我将每个项目拆分为/*
并将所有这些项目放在一个可用于excel导出的好列表中。
答案 2 :(得分:1)
只需发布我尝试的代码,然后注意到编写更好的答案时发布了 请不要判断。这恰好发生在SO上。
s = """
JT - American journal of public health
JID - 1254074
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
"""
import re
import itertools
matches = re.findall(r"^MH[\s-]+(.*)$", s, re.MULTILINE)
splitmatches = [i.split(r"/*") for i in matches]
flattenedmatches = list(itertools.chain(*splitmatches))
print(flattenedmatches)
<强>输出:强>
['Adult', 'Biomedical Research', 'organization & administration', 'Health Care Reform', 'history', 'methods']