我想检查一个数字旁边的单词。 例如,我在数据框中有此列: 食谱
Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.
Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.
2 heaped teaspoons Chinese five-spice
100 ml Marsala
1 litre organic chicken stock
我想获得一个新的列,在其中提取它们:
New Column
[1 hour, 20 minutes]
15 minutes
2 heaped
100 ml
1 litre
我需要与值列表进行比较:
to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]
查看每行共有多少个元素。 感谢您的帮助。
答案 0 :(得分:1)
我们将Series.str.extractall
与numbers - space - letter
一起使用。然后我们检查to_compare
中有哪些匹配项,最后使用GroupBy.sum
来获得多少个匹配项
matches = df['Col'].str.extractall('(\d+\s\w+)')
df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()
Col matches
0 Halve the clementine and place into the cavity... 2.0
1 Add the stock, then bring to the boil and redu... 1.0
2 2 heaped teaspoons Chinese five-spice 0.0
3 100 ml Marsala 1.0
4 1 litre organic chicken stock 0.0
此外,matches
返回:
0
match
0 0 1 hour
1 20 minutes
1 0 15 minutes
2 0 2 heaped
3 0 100 ml
4 0 1 litre
要在列表中获取这些信息,请使用:
matches.groupby(level=0).agg(list)
0
0 [1 hour, 20 minutes]
1 [15 minutes]
2 [2 heaped]
3 [100 ml]
4 [1 litre]
答案 1 :(得分:0)
您可以使用正则表达式来构建可提取数字和后续单词的模式,然后将此功能应用于数据框的整个列
import pandas as pd
import re
df = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",
"Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",
"2 heaped teaspoons Chinese five-spice",
"100 ml Marsala",
"1 litre organic chicken stock"]})
def extract_qty(txt):
return re.findall('\d+ \w+',txt)
df['extracted_qty'] = df['text'].apply(extract_qty)
df
# text extracted_qty
#0 Halve the clementine and place into the cavity... [1 hour, 20 minutes]
#1 Add the stock, then bring to the boil and redu... [15 minutes]
#2 2 heaped teaspoons Chinese five-spice [2 heaped]
#3 100 ml Marsala [100 ml]
#4 1 litre organic chicken stock [1 litre]
使用to_compare
和列表理解来提取公用值:
to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]
df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])
# text extracted_qty common
#0 Halve the clementine ... [1 hour, 20 minutes] [1 hour, 20 minutes]
#1 Add the stock, then ... [15 minutes] [15 minutes]
#2 2 heaped teaspoons ... [2 heaped] []
#3 100 ml Marsala [100 ml] [100 ml]
#4 1 litre organic chicken... [1 litre] []