我正在尝试捕获列表格式的数据框/熊猫内部的元素。如果该字符串存在,下面将捕获整个列表,我如何仅按行捕获特定字符串的元素,而忽略其余部分?
这是我尝试过的...
l1 = [1,2,3,4,5,6]
l2 = ['hello world \n my world','world is a great place \n we live in it','planet earth',np.NaN,'\n save the water','']
df = pd.DataFrame(list(zip(l1,l2)),
columns=['id','sentence'])
df['sentence_split'] = df['sentence'].str.split('\n')
print(df)
此代码的结果:
df[df.sentence_split.str.join(' ').str.contains('world', na=False)] # does the trick but still not exactly what I am looking for.
id sentence sentence_split
1 hello world \n my world [hello world , my world]
2 world is a great place \n we live in it [world is a great place , we live in it]
但正在寻找:
id sentence sentence_split
1 hello world \n my world hello world; my world
2 world is a great place \n we live in it world is a great place
答案 0 :(得分:1)
您要搜索系列列表中的字符串。一种方法是:
# Drop NaN rows
df = df.dropna(subset=["sentence_split"])
应用仅保留要查找列表中元素的函数
# Apply this lamda function
df["sentence_split"] = df["sentence_split"].apply(lambda x: [i for i in x if "world" in i])
id sentence sentence_split
0 1 hello world \n my world [hello world , my world]
1 2 world is a great place \n we live in it [world is a great place ]
2 3 planet earth []
4 5 \n save the water []
5 6 []