从列中的字符串中提取一组n个数字

时间:2019-01-23 22:15:09

标签: python pandas

我在pandas数据框中有一列字符串,其中包含诸如"AU/4347001"之类的内容,但除此之外,还有一些组织得较不完整的字符串,例如"Who would have thought this would be so 4347009 difficult"

因此,最终,这一系列数字在字符串中出现的位置和方式没有一致的模式。它们可能在开头,中间或结尾,并且无法确切知道数字周围还有多少其他字符。

理想情况下,我想返回另一列等长的列,只包含数字。

这可能吗?

非常感谢您的帮助!

谢谢

4 个答案:

答案 0 :(得分:1)

您可以将extract与数字(\d+)的捕获组一起使用:

import pandas as pd

data = ["AU/4347001",
        "Who would have thought this would be so 4347009 difficult",
        "Another with a no numbers",
        "131242143"]

df = pd.DataFrame(data=data, columns=['txt'])
result = df.assign(res=df.txt.str.extract('(\d+)')).fillna('')
print(result)

输出

                                                 txt        res
0                                         AU/4347001    4347001
1  Who would have thought this would be so 434700...    4347009
2                          Another with a no numbers           
3                                          131242143  131242143

请注意,在上面的示例中,使用fillna来填充没有找到数字组的那些列(在这种情况下,是用空字符串填充)。

答案 1 :(得分:1)

您可以执行extract

df =pd.DataFrame({'text':["Who would have thought this would be so 4347009 difficult",
                          "24 is me"]})

df['new_col'] = df['text'].str.extract(r'(\d+)')

    text                                                new_col
0   Who would have thought this would be so 434700...   4347009
1   24 is me                                            24

答案 2 :(得分:1)

这是我们的测试数据框:

### Create an example Pandas Dataframe
df = pd.DataFrame(data=['something123', 'some456thing', '789somthing', 
                        'Lots of numbers 82849585 make a long sentence'], columns = ['strings'])

### Create a function for identifying, joining and then turning the string to an integer
def get_numbers(string):
    return int(''.join([s for s in string if s.isdigit()]))

### Now lets apply the get_numbers function to the strings column
df.loc[:,'strings_wo_numbers'] = df.loc[:,'strings']apply(get_numbers)

注意:这将连接字符串中的所有数字,即“ 10个橄榄和5个苹果”将变成105个而不是10个,5个。

答案 3 :(得分:0)

使用str.finall

df.text.str.findall('\d+').str[0]
0    4347009
1         24
Name: text, dtype: object