我需要从另一列创建一个新列。 数据集是通过以下代码创建的(我仅提取了几行):
import pandas as pd
new_dataframe = pd.DataFrame({
"Name": ['John', 'Lukas', 'Bridget', 'Carol','Madison'],
"Notes": ["__ years old. NA", "__ years old. NA",
"__ years old. NA", "__ years old. Old account.",
"__ years old. New VIP account."],
"Status": [True, False, True, True, True]})
生成以下
Name Notes Status
John 23 years old. NA True
Lukas 52 years old. NA False
Bridget 64 years old. NA True
Carol 31 years old. Old account True
Madison 54 years old. New VIP account. True
我需要创建两个新列,其中包含以下格式的年龄信息:
最后我应该有
Name Notes Status L_Age S_Age
John 23 years old. NA True 23 years old 23
Lukas 52 years old. NA False 52 years old 52
Bridget 64 years old. NA True 64 years old 64
Carol 31 years old. Old account True 31 years old 31
Madison 54 years old. New VIP account. True 54 years old 54
我不知道如何提取前三个单词,然后仅提取第一个单词,以创建新列。我尝试过
new_dataframe.loc[new_dataframe.Notes == '', 'L_Age'] = new_dataframe.Notes.str.split()[:3]
new_dataframe.loc[new_dataframe.Notes == '', 'S_Age'] = new_dataframe.Notes.str.split()[0]
,但这是错误的(ValueError: Must have equal len keys and value when setting with an iterable
)。
我们将不胜感激。
答案 0 :(得分:2)
您可以使用此模式提取信息并加入:
pattern = '^(?P<L_Age>(?P<S_Age>\d+) years? old)'
new_dataframe = new_dataframe.join(new_dataframe.Notes.str.extract(pattern))
输出:
Name Notes Status L_Age S_Age
0 John 23 years old. NA True 23 years old 23
1 Lukas 52 years old. NA False 52 years old 52
2 Bridget 64 years old. NA True 64 years old 64
3 Carol 31 years old. Old account True 31 years old 31
4 Madison 54 years old. New VIP account. True 54 years old 54
答案 1 :(得分:1)
IIUC:
def get_first_n_words(txt, n):
l = txt.split(' ')
assert(len(l)>=n)
return ' '.join(l[:n])
new_dataframe['L_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 3))
new_dataframe['S_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 1))