数据框中的文本处理:单词提取

时间:2020-06-01 23:09:05

标签: python regex pandas

我想检查一个数字旁边的单词。 例如,我在数据框中有此列: 食谱

Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.
Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.
2 heaped teaspoons Chinese five-spice 
100 ml Marsala
1 litre organic chicken stock

我想获得一个新的列,在其中提取它们:

New Column
[1 hour, 20 minutes]
15 minutes
2 heaped
100 ml
1 litre

我需要与值列表进行比较:

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

查看每行共有多少个元素。 感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

我们将Series.str.extractallnumbers - space - letter一起使用。然后我们检查to_compare中有哪些匹配项,最后使用GroupBy.sum来获得多少个匹配项

matches = df['Col'].str.extractall('(\d+\s\w+)')
df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()

                                                 Col  matches
0  Halve the clementine and place into the cavity...      2.0
1  Add the stock, then bring to the boil and redu...      1.0
2              2 heaped teaspoons Chinese five-spice      0.0
3                                     100 ml Marsala      1.0
4                      1 litre organic chicken stock      0.0

此外,matches返回:

                  0
  match            
0 0          1 hour
  1      20 minutes
1 0      15 minutes
2 0        2 heaped
3 0          100 ml
4 0         1 litre

要在列表中获取这些信息,请使用:

matches.groupby(level=0).agg(list)

                      0
0  [1 hour, 20 minutes]
1          [15 minutes]
2            [2 heaped]
3              [100 ml]
4             [1 litre]

答案 1 :(得分:0)

您可以使用正则表达式来构建可提取数字和后续单词的模式,然后将此功能应用于数据框的整个列

import pandas as pd
import re
df = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",
           "Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",
           "2 heaped teaspoons Chinese five-spice",
           "100 ml Marsala",
           "1 litre organic chicken stock"]})


def extract_qty(txt):
  return re.findall('\d+ \w+',txt)

df['extracted_qty'] = df['text'].apply(extract_qty)

df    
#   text                                                extracted_qty
#0  Halve the clementine and place into the cavity...   [1 hour, 20 minutes]
#1  Add the stock, then bring to the boil and redu...   [15 minutes]
#2  2 heaped teaspoons Chinese five-spice               [2 heaped]
#3  100 ml Marsala                                      [100 ml]
#4  1 litre organic chicken stock                       [1 litre]

使用to_compare和列表理解来提取公用值:

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])


#   text                        extracted_qty           common
#0  Halve the clementine ...    [1 hour, 20 minutes]    [1 hour, 20 minutes]
#1  Add the stock, then  ...    [15 minutes]            [15 minutes]
#2  2 heaped teaspoons ...      [2 heaped]              []
#3  100 ml Marsala              [100 ml]                [100 ml]
#4  1 litre organic chicken...  [1 litre]               []