从数据框格式的文本列中提取单词

时间:2020-05-28 18:40:51

标签: python pandas text-mining

我需要从另一列创建一个新列。 数据集是通过以下代码创建的(我仅提取了几行):

import pandas as pd

new_dataframe = pd.DataFrame({
    "Name": ['John', 'Lukas', 'Bridget', 'Carol','Madison'],
    "Notes": ["__ years old. NA", "__ years old. NA", 
        "__ years old. NA", "__ years old. Old account.", 
        "__ years old. New VIP account."], 
    "Status": [True, False, True, True, True]})

生成以下

Name        Notes                           Status
John     23 years old. NA                    True
Lukas    52 years old. NA                    False
Bridget  64 years old. NA                    True
Carol    31 years old. Old account           True
Madison  54 years old. New VIP account.      True

我需要创建两个新列,其中包含以下格式的年龄信息:

  1. __岁(三个字):例如23岁;
  2. __(仅数字):例如23

最后我应该有

Name        Notes                           Status          L_Age           S_Age
    John     23 years old. NA                    True      23 years old       23
    Lukas    52 years old. NA                    False     52 years old       52
    Bridget  64 years old. NA                    True      64 years old       64
    Carol    31 years old. Old account           True      31 years old       31
    Madison  54 years old. New VIP account.      True      54 years old       54

我不知道如何提取前三个单词,然后仅提取第一个单词,以创建新列。我尝试过

new_dataframe.loc[new_dataframe.Notes == '', 'L_Age'] = new_dataframe.Notes.str.split()[:3]
new_dataframe.loc[new_dataframe.Notes == '', 'S_Age'] = new_dataframe.Notes.str.split()[0]

,但这是错误的(ValueError: Must have equal len keys and value when setting with an iterable)。

我们将不胜感激。

2 个答案:

答案 0 :(得分:2)

您可以使用此模式提取信息并加入:

pattern = '^(?P<L_Age>(?P<S_Age>\d+) years? old)'

new_dataframe = new_dataframe.join(new_dataframe.Notes.str.extract(pattern))

输出:

      Name                           Notes  Status         L_Age S_Age
0     John                23 years old. NA    True  23 years old    23
1    Lukas                52 years old. NA   False  52 years old    52
2  Bridget                64 years old. NA    True  64 years old    64
3    Carol       31 years old. Old account    True  31 years old    31
4  Madison  54 years old. New VIP account.    True  54 years old    54

答案 1 :(得分:1)

IIUC:

def get_first_n_words(txt, n):
    l = txt.split(' ')
    assert(len(l)>=n)
    return ' '.join(l[:n])

new_dataframe['L_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 3))
new_dataframe['S_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 1))