我有一个像
这样的字符串[IN]: raw_data.iloc[9]
[OUT]:
Rank 10,JOURNAL OF INDUSTRIAL
ECOLOGY,J IND ECOL,10...
Full Journal Title
NaN
JCR Abbreviated Title
NaN
ISSN
NaN
Total Cites
NaN
Journal Impact Factor
NaN
Impact Factor without Journal Self Cites
NaN
5-Year Impact Factor
NaN
Immediacy Index
NaN
Citable Items
NaN
Cited Half-Life
NaN
Citing Half-life
NaN
Eigenfactor Score
NaN
Article Influence Score
NaN
% Articles in Citable Items
NaN
Unnamed: 15
NaN
Average Journal Impact Factor Percentile
NaN
Normalized Eigenfactor
NaN
我想在里面找到以下值:
76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150
请注意,对于76/2, 115, 12, 5/12, 111, 107/1, 108/61, 457/3 and 11/150
,我想要107,/1
,但107/1
我希望107, /1
和107
,1
相同。
我尝试使用this regex,但我不知道如何只保留数字和(如果有)结果中的斜杠字符。
有可能吗?我可以迭代结果并检查每个结果是否包含不需要的字符并删除它们,但我希望找到一种正则表达式来实现它。
答案 0 :(得分:2)
不要发明纠结的正则表达式模式,而应考虑直截了当的re.sub()
解决方案:
import re
s = '76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150'
result = re.sub(r'\S+[^,\s](,)?',
lambda m: re.sub(r'[^\d/]+', '', m.group()) + (m.group(1) or ''), s)
print(result)
输出:
76/2, 115, 12, 5/12, 111 107/1, 108/61, 457/3 11/150
<强> ---------- 强>
要获得所需值的列表,可以将上述内容缩短为:
s = '76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150'
result = re.sub(r'\S+', lambda m: re.sub(r'[^\d/]+', '', m.group()), s).split()
print(result)
输出:
['76/2', '115', '12', '5/12', '111', '107/1', '108/61', '457/3', '11/150']