我坚持使用正则表达式来捕获列表元素中文件的主要名称。假设我有一个文件路径列表:
path_list = ['/Users/buggylines/histogram/offline-deployer-list_b7bacc7fdb-0e0e08077c_GERONIMO-2886_635_histogrambuglines_635.diff',
'/Users/buggylines/histogram/normal.jsp_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/hbase-env.sh_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/LICENSE-tesh_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo_dcce59ae71-8f5c1aa7a1_GERONIMO-5661_1554_histogrambuglines_54.diff',
'/Users/buggylines/histogram/catalina-6.0.18-G678601.jar.sha1_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo-naming-1.0.xsd_544dee5179-40a2ae1d41_GERONIMO-1027_131_histogrambuglines_131.diff',
'/Users/buggylines/histogram/6.0.18-G678601.README.TXT_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff'
]
我想使用regex
仅捕获文件的名称。我需要以下输出:
expected_output = ['offline-deployer-list',
'normal.jsp',
'hbase-env.sh',
'LICENSE-tesh',
'geronimo',
'catalina-6.0.18-G678601.jar.sha1',
'geronimo-naming-1.0.xsd',
'6.0.18-G678601.README.TXT'
]
这是我写的代码:
filename = []
for z, path in enumerate(path_list):
pattern = re.search("((?:\w+[-]\w+[-]\w+|\w+[-]\w+|\w+)[.]\w+[_])|(?<=histogram/)(?:\w+[-]\w+[-]\D+[_]|\D+[-]\w+[_])|(?<=histogram\/)(\w+[_])(?<=[_])", path)
pattern = pattern.groups()
filename.append(pattern[0])
但是,输出并不是我的预期。这是代码的输出:
filename = [None,
'normal.jsp_',
'hbase-env.sh_',
None,
None,
'jar.sha1_',
'0.xsd_',
'README.TXT_']
我需要帮助来修复正则表达式。非常感谢你。
答案 0 :(得分:2)
您可以像这样使用os.path.basename:
import os
path_list = ['/Users/buggylines/histogram/offline-deployer-list_b7bacc7fdb-0e0e08077c_GERONIMO-2886_635_histogrambuglines_635.diff',
'/Users/buggylines/histogram/normal.jsp_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/hbase-env.sh_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/LICENSE-tesh_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo_dcce59ae71-8f5c1aa7a1_GERONIMO-5661_1554_histogrambuglines_54.diff',
'/Users/buggylines/histogram/catalina-6.0.18-G678601.jar.sha1_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo-naming-1.0.xsd_544dee5179-40a2ae1d41_GERONIMO-1027_131_histogrambuglines_131.diff',
'/Users/buggylines/histogram/6.0.18-G678601.README.TXT_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff'
]
output = [os.path.basename(path).split('_')[0] for path in path_list]
print(output)
输出:
['offline-deployer-list', 'normal.jsp', 'hbase-env.sh', 'LICENSE-tesh', 'geronimo', 'catalina-6.0.18-G678601.jar.sha1', 'geronimo-naming-1.0.xsd', '6.0.18-G678601.README.TXT']
答案 1 :(得分:1)
这是适合我的一个:
((?<=histogram\/)[a-zA-Z0-9-.]+(?=_))
在这里查看 https://regex101.com/r/VfQIJC/4
<强>更新强>
在它之后匹配最后一个和第一个_的更通用的一个:
((?<=\/)[a-zA-Z0-9-.]+(?!.+\/)(?=_))
答案 2 :(得分:1)
import re
path_list = ['/Users/buggylines/histogram/offline-deployer-list_b7bacc7fdb-0e0e08077c_GERONIMO-2886_635_histogrambuglines_635.diff',
'/Users/buggylines/histogram/normal.jsp_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/hbase-env.sh_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/LICENSE-tesh_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo_dcce59ae71-8f5c1aa7a1_GERONIMO-5661_1554_histogrambuglines_54.diff',
'/Users/buggylines/histogram/catalina-6.0.18-G678601.jar.sha1_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo-naming-1.0.xsd_544dee5179-40a2ae1d41_GERONIMO-1027_131_histogrambuglines_131.diff',
'/Users/buggylines/histogram/6.0.18-G678601.README.TXT_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff'
]
for i in path_list:
s = re.search(r"\/\w+\/\w+\/\w+\/([^_/]+)", i)
if s:
print(s.group(1))
<强>输出:强>
offline-deployer-list
normal.jsp
hbase-env.sh
LICENSE-tesh
geronimo
catalina-6.0.18-G678601.jar.sha1
geronimo-naming-1.0.xsd
6.0.18-G678601.README.TXT
答案 3 :(得分:0)
为什么不使用拆分?
>>> list = [x.split("/")[-1].split("_")[0] for x in path_list]
>>> list
['offline-deployer-list', 'normal.jsp', 'hbase-env.sh', 'LICENSE-tesh', 'geronimo', 'catalina-6.0.18-G678601.jar.sha1', 'geronimo-naming-1.0.xsd', '6.0.18-G678601.README.TXT']
答案 4 :(得分:0)
这是一个非常pythonic的解决方案。
import re
path_list = ['/Users/buggylines/histogram/offline-deployer-list_b7bacc7fdb-0e0e08077c_GERONIMO-2886_635_histogrambuglines_635.diff',
'/Users/buggylines/histogram/normal.jsp_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/hbase-env.sh_aa0c2c26dd-90188cc2a4_GERONIMO-4597_1293_histogrambuglines_1293.diff',
'/Users/buggylines/histogram/LICENSE-tesh_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo_dcce59ae71-8f5c1aa7a1_GERONIMO-5661_1554_histogrambuglines_54.diff',
'/Users/buggylines/histogram/catalina-6.0.18-G678601.jar.sha1_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff',
'/Users/buggylines/histogram/geronimo-naming-1.0.xsd_544dee5179-40a2ae1d41_GERONIMO-1027_131_histogrambuglines_131.diff',
'/Users/buggylines/histogram/6.0.18-G678601.README.TXT_cd1ec17e43-4ebc5e8021_GERONIMO-5702_1573_histogrambuglines_785.diff'
]
filename_re = re.compile(r'^\/Users\/buggylines\/histogram\/(?P<filename>[A-Za-z0-9-.]+)_')
filename = []
for item in path_list:
filename_match = filename_re.search(item)
if filename_match:
filename.append(filename_match.group('filename'))
print(filename)