熊猫-从行中提取文本

时间:2018-08-24 15:23:11

标签: python python-3.x pandas

假设我有一个看起来像这样的数据框:

df2 = pd.DataFrame(['Apple, 10/01/2016, 31/10/18, david/kate', 'orange', 'pear', 'Apple', '10/01/2016', '02/20/2017'], columns=['A'])

>>> df2

                                         A       file_name
0  Apple, 10/01/2016, 31/10/18, david/kate          a.txt
1                                   orange          a.txt
2                                     pear          b.txt
3                                    Apple          a.txt
4                               10/01/2016          d.txt
5                               02/20/2017          e.txt

我想要的只是提取此数据框中的日期,因此输出将如下所示:

                        A        file_name
0    10/01/2016, 31/10/18           a.txt
1    Nothing to return              a.txt
2    Nothing to return              b.txt
3    Nothing to return              a.txt
4    10/01/2016                     d.txt
5    02/20/2017                     e.txt

有人对此有任何建议吗?我不确定从哪里开始。

编辑#1:

我编辑了原始数据框并输出了结果,以更好地反映我的需求。

3 个答案:

答案 0 :(得分:2)

与您期望的输出不完全匹配,但是这种结构可能更好,可以轻松转换为所需的内容。

基本上,这是正则表达式的工作。此代码应该找到数字/数字/数字形式的任何内容:

SELECT t1.number,
       t1.tagvalue
       FROM elbat t1
            WHERE t1.tagvalue = 'MLB'
                  AND EXISTS (SELECT *
                                     FROM elbat t2
                                     WHERE t2.number = t1.number
                                           AND t2.tagvalue = 'NFL')
                   OR t1.tagvalue = 'NFL'
                      AND EXISTS (SELECT *
                                         FROM elbat t2
                                         WHERE t2.number = t1.number
                                               AND t2.tagvalue = 'MLB');

答案 1 :(得分:1)

使用extractall添加reindex(df2.index).fillna('Nothing to return')

df2.A.str.extractall(r'(((?:\d+[/-])?\d+[/-]\d+))')[0].groupby(level=0).apply(','.join)
Out[459]: 
0    10/01/2016,31/10/18
4             10/01/2016
5             02/20/2017
Name: 0, dtype: object

更新

df2.A.str.extractall(r'(((?:\d+[/-])?\d+[/-]\d+))')[0].groupby(level=0).apply(','.join).reindex(df2.index).fillna('Nothing to return')
Out[463]: 
0    10/01/2016,31/10/18
1      Nothing to return
2      Nothing to return
3      Nothing to return
4             10/01/2016
5             02/20/2017
Name: 0, dtype: object

答案 2 :(得分:1)

import datetime
import re
def my_func(row):
    temp=''
    for d in row.split(","):
        match=re.match('(\d*/\d*/\d*)',d.strip())
        if match:
            temp =temp + match.group(0)+','
    if(temp):
        return temp[:-1]
    return "Nothing to return"
df2.A=df2.A.apply(lambda x : my_func(x))

输出:

                        A        file_name
0    10/01/2016, 31/10/18           a.txt
1    Nothing to return              a.txt
2    Nothing to return              b.txt
3    Nothing to return              a.txt
4    10/01/2016                     d.txt
5    02/20/2017                     e.txt