Question

我需要清理熊猫数据框，删除重复的信息。例如：

driver.find_element_by_xpath("//h3[contains(text(), 'Building Identification Number (BIN)')]").click()

我需要从第一列name strength 770 Vitamin B12 Tab 500mcg 500 mcg 771 Vitamin B12 Tab 5mcg 5 mcg 772 Vitamin B12 Tablets 250mcg 250 mcg 773 Vitamin B12-folic Acid None 774 Vitamin B6 & B12 With Folic Acid None 775 Vitamin Deficiency Injectable System - B12 None 776 Vitamine 110 Liq None 777 Vitamine B-12 Tab 100mcg 100 mcg 778 Vitamine B12 25 Mcg - Tablet 25 mcg 779 Vitamine B12 250mcg 250 mcg中删除name中的信息，即：

strength

请注意，name strength 770 Vitamin B12 Tab 500 mcg 771 Vitamin B12 Tab 5 mcg 772 Vitamin B12 Tablets 250 mcg 773 Vitamin B12-folic Acid None 774 Vitamin B6 & B12 With Folic Acid None 775 Vitamin Deficiency Injectable System - B12 None 776 Vitamine 110 Liq None 777 Vitamine B-12 Tab 100 mcg 778 Vitamine B12 - Tablet 25 mcg 779 Vitamine B12 250 mcg中的强度表示可能与name列中的强度表示完全不符，直到空白（500 mcg与500mcg）

我直接的解决方案是遍历strength的所有可能组合，如果strength列中有匹配项，请替换为空字符：

name

它确实有效，但是，我有大量的数据，这是最非Python且效率最高的实现方式。

有什么建议吗？

Answer 1

使用re包删除不需要的冗余字符串，并使用apply函数删除熊猫DataFrame中的行即可。

在下面的代码中，您可以看到可能的解决方案：

import pandas as pd
import re

def removeReduntantData(row):
    if row["strength"] is not None:
        string = row["strength"].replace(" ", "\s?")
        return re.sub(re.compile(string+"\s?", re.IGNORECASE), "", row["name"]).strip()
    else:
        return row["name"]

df = pd.DataFrame({"name":["Vitamin B12 Tab 500mcg","Vitamin B12 Tab 5mcg","Vitamin B12 Tablets 250mcg","Vitamin B12-folic Acid","Vitamin B6 & B12 With Folic Acid","Vitamin Deficiency Injectable System - B12","Vitamine 110 Liq","Vitamine B-12 Tab 100mcg","Vitamine B12 25 Mcg - Tablet","Vitamine B12 250mcg"],\
"strength":["500 mcg","5 mcg","250 mcg",None,None,None,None,"100 mcg","25 mcg","250 mcg"]})

df["name"] = df.apply(removeReduntantData, axis=1)

则输出DataFrame为：

>>> df
                                         name strength
0                             Vitamin B12 Tab  500 mcg
1                             Vitamin B12 Tab    5 mcg
2                         Vitamin B12 Tablets  250 mcg
3                      Vitamin B12-folic Acid     None
4            Vitamin B6 & B12 With Folic Acid     None
5  Vitamin Deficiency Injectable System - B12     None
6                            Vitamine 110 Liq     None
7                           Vitamine B-12 Tab  100 mcg
8                       Vitamine B12 - Tablet   25 mcg
9                                Vitamine B12  250 mcg

这样一来，您最终会使用strength列在name列中查找冗余字符串并删除它们，但要考虑到冗余字符串之间可能没有空格。

Answer 2

我可能不会与所有可能的强度组合匹配。由于这些项目似乎在两个列中包含大致相同的字符，因此使用强度列来模糊搜索名称列可能就足够了。

您可以在不使用空格的情况下搜索不区分大小写的内容，并且可能会完成大多数项。

不区分大小写的搜索可以使用python中的正则表达式完成：

import re

# case insensitive without whitespace
if re.search('5 mcg'.replace(" ",""), 'Vitamin B12 Tab 5mcg', re.IGNORECASE):
    # is True
elif re.search('25 mcg', 'Vitamine B12 25 Mcg - Tablet', re.IGNORECASE):
    # is True

当然，请在此处用您的变量替换文字。

编辑：使用正则表达式可能会有更有效的方法，因此，如果有人更精通它们，我将很高兴学习它。

Answer 3

new_df=[]  
df= df[df[strength]!=None]# Firstly select the column with Non None values.     
df['name']= df[name].str.split()   
for i in df[name]:  
   for j in df[strength]:    
        if j in i:   
            i.remove(j)   
        else:   
             pass   
   new_df.append(' '.join(i))

这可能是我更好的方法。首先，我们要减少您的数据和for循环之一，这会使代码o（n2）而不是o（n3）变得很复杂

Answer 4

假设：强度模式始终为“数字+空格（可选）+ mcg”。如果需要的话，将会有更多的方法来概括它。

您可以使用regex和df.apply。

首先，您将使用re.compile()定义要查找的模式。然后，您在name列上使用re.sub()，如下面的代码所示。

import re
import pandas as pd

# Creates a DataFrame for testing
df = pd.DataFrame({"name":["Vitamin B12 500 MCG tab", "Vitamin Deficiency Injectable System - B12", 
"Vitamin Deficiency Injectable System - B12 25 mcg"],"strenght":["500 mcg", "None", "25 mcg"]})

# creates the pattern we are looking for
p = re.compile(r'[\d]+\s?mcg', re.IGNORECASE) 

# Replace our column name with the value we want
df["name"] = df["name"].apply(lambda x: re.sub(p,'',x))
print(df)

您可以找到有关df.apply here以及将正则表达式与Python here一起使用的更多信息

从熊猫中删除列中的重复信息

4 个答案: