从pandas DataFrame中的文本中提取子字符串作为新列

时间:2017-10-24 23:21:15

标签: python regex string pandas extract

我有一个单词列表'我想算在下面

word_list = ['one','three']

我在pandas数据框中有一个列,下面有文字。

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

所需的输出如下所示,它保留原始文本列,但只将word_list中的单词提取到新列

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

有没有办法在Python 2.7中执行此操作?

1 个答案:

答案 0 :(得分:5)

使用str.extract

df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

word_list中的每个单词都由正则表达式分隔符|连接,然后传递给str.extract以进行正则表达式模式匹配。

re.IGNORECASE开关打开以进行不区分大小写的比较,并将结果匹配小写以匹配您的预期输出。