在每个正则表达式结果中省略不需要的char

时间:2017-11-09 15:14:34

标签: python regex

我有一个像

这样的字符串
[IN]: raw_data.iloc[9]
[OUT]:
Rank                                        10,JOURNAL OF INDUSTRIAL 
ECOLOGY,J IND ECOL,10...
Full Journal Title                                                                        
NaN
JCR Abbreviated Title                                                                     
NaN
ISSN                                                                                      
NaN
Total Cites                                                                               
NaN
Journal Impact Factor                                                                     
NaN
Impact Factor without Journal Self Cites                                                  
NaN
5-Year Impact Factor                                                                      
NaN
Immediacy Index                                                                           
NaN
Citable Items                                                                             
NaN
Cited Half-Life                                                                           
NaN
Citing Half-life                                                                          
NaN
Eigenfactor Score                                                                         
NaN
Article Influence Score                                                                   
NaN
% Articles in Citable Items                                                               
NaN
Unnamed: 15                                                                               
NaN
Average Journal Impact Factor Percentile                                                  
NaN
Normalized Eigenfactor                                                                    
NaN

我想在里面找到以下值:

76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150

请注意,对于76/2, 115, 12, 5/12, 111, 107/1, 108/61, 457/3 and 11/150 ,我想要107,/1,但107/1我希望107, /11071相同。 我尝试使用this regex,但我不知道如何只保留数字和(如果有)结果中的斜杠字符。

有可能吗?我可以迭代结果并检查每个结果是否包含不需要的字符并删除它们,但我希望找到一种正则表达式来实现它。

1 个答案:

答案 0 :(得分:2)

不要发明纠结的正则表达式模式,而应考虑直截了当的re.sub()解决方案:

import re

s = '76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150'
result = re.sub(r'\S+[^,\s](,)?', 
                lambda m: re.sub(r'[^\d/]+', '', m.group()) + (m.group(1) or ''), s)

print(result)

输出:

76/2, 115, 12, 5/12, 111 107/1, 108/61, 457/3 11/150

<强> ----------

要获得所需值的列表,可以将上述内容缩短为:

s = '76/,2., 115., 12, 5/e12, .111 107,/1, 108/61a, f457f/3 11/150'
result = re.sub(r'\S+', lambda m: re.sub(r'[^\d/]+', '', m.group()), s).split()

print(result)

输出:

['76/2', '115', '12', '5/12', '111', '107/1', '108/61', '457/3', '11/150']