检查字符串是否在pandas Dataframe列中,并创建新的Dataframe

时间:2017-05-30 23:59:30

标签: python pandas dataframe substring

我正在尝试检查字符串是否在Pandas列中。我试过两种方式,但他们似乎都检查了一个子串。

itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]

我也尝试过这种方式,但这也会检查子串

words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]

这个词是:"eco drum"

然后我这样做了:

words = self.itemName.split(" ")
words = '|'.join(words)

最终得到这个:

eco|drum

这是"word"列:

enter image description here

谢谢,这种方式有可能与子串不匹配吗?

1 个答案:

答案 0 :(得分:2)

你有正确的想法。 .contains默认情况下将正则表达式模式匹配选项设置为True。因此,您需要做的就是为正则表达式模式添加锚点,例如"ball"将成为"^ball$"

df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])

更具体地参考您的问题,因为您要搜索多个单词,您必须创建正则表达式模式以提供给contains

# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])

代码words = "|".join("^{}$".format(word) for word in words)被称为生成器表达式。给定['eco', 'drum'],它将返回此模式:^eco$|^drum$