正则表达式困难为Pandas Dataframe提取文本

时间:2016-04-01 16:01:17

标签: python regex pandas

我在一个在线正则表达式测试程序中运行了我的数据下面的正则表达式代码,它运行正常。但是,当我尝试在Python 3和Pandas 0.18中运行它时,我会在新的' r'中获得NaN。列。

正则表达式代码是:

\(\(\d+,\s\d+\],\s\(\d+,\s(\d+)\]\)

示例数据是:

                   WT_g      r_25_text           r
Azmuth_25   Range_25            
(0, 5]      (0, 25]     1   ((0, 5],   (0, 25])     NaN
(25, 30]    (25, 50]    1   ((25, 30], (25, 50])    NaN
(35, 40]    (25, 50]    1   ((35, 40], (25, 50])    NaN
(65, 70]    (50, 75]    1   ((65, 70], (50, 75])    NaN
(85, 90]    (50, 75]    1   ((85, 90], (50, 75])    NaN
(95, 100]   (25, 50]    1   ((95, 100], (25, 50])   NaN
(100, 105]  (50, 75]    1   ((100, 105], (50, 75])  NaN
(110, 115]  (50, 75]    1   ((110, 115], (50, 75])  NaN
(115, 120]  (0, 25]     1   ((115, 120], (0, 25])   NaN

我的代码:

df_25_sum['r'] = df_25_sum['r_25_text'].str.extract('\(\(\d+,\s\d+\],\s\(\d+,\s(\d+)\]\)')
df_25_sum

输出是上面的示例数据。当我根据提取添加新列时,我得到NaN。

3 个答案:

答案 0 :(得分:0)

如果您实际上是在尝试从r_25_text中提取最后一位数字(根据您的评论),则应遵循以下正则表达式模式:

pattern = r'(\d+)(?=(\]\)))'            # find digits next to '])'

df_25_sum['r'] = df_25_sum['r_25_text'].str.extract(pattern)
df_25_sum

r列的输出应该是列r_25_text的每一行中的最后一个数值,即25, 50, 50, 75, 75等。

请参阅regex link

答案 1 :(得分:0)

我让这个工作。这与pylang开发的答案基本相同。但我无法使用正则表达式使用'='符号。我的最终代码和正则表达式是:

pattern = r'(\d+)?\]'            # find digits next to ']'
df_25_sum['r'] = df_25_sum['r_25_text'].str.extract(pattern)


Azmuth_25    Range_25   WT_g  r_25_text r               
(0, 5]      (0, 25]     1     (0, 25]   25
(25, 30]    (25, 50]    1     (25, 50]  50
(35, 40]    (25, 50]    1     (25, 50]  50
(65, 70]    (50, 75]    1     (50, 75]  75
(85, 90]    (50, 75]    1     (50, 75]  75

我只能假设Pandas 0.18不支持正则表达式中的'='。再次感谢pylang

答案 2 :(得分:0)

你有没有尝试过:

import pandas as pd

df_25_sum = pd.DataFrame([
    '((0, 5],   (0, 25])',
    '((25, 30], (25, 50])',
    '((35, 40], (25, 50])'
    ], columns=['r_25_text'])

pattern = r'\(\(\d+,\s\d+\],\s+\(\d+,\s(\d+)\]\)' 

df_25_sum['r'] = df_25_sum['r_25_text'].str.extract(pattern)

df_25_sum

>>>>               r_25_text   r
     0   ((0, 5],   (0, 25])  25
     1  ((25, 30], (25, 50])  50
     2  ((35, 40], (25, 50])  50