在pandas文本列中查找EXACT子字符串会产生ValueError:无法从重复的轴重新索引

时间:2018-02-05 05:41:00

标签: python pandas dataframe

我需要匹配pandas文本列中的EXACT子字符串。但是,当该数据框文本列具有重复条目时,我得到:ValueError:无法从重复轴重新索引。

我查看了以下帖子以确定如何查询行,但主要是关于 匹配整个条目而不是子字符串。 Select rows from a DataFrame based on values in a column in pandas

以下帖子展示了如何使用正则表达式模式查找子字符串,这正是我需要查找正则表达式字边界和我在下面使用的内容。 How to filter rows containing a string pattern from a Pandas dataframe

我能够从上面的第二个SO帖子中获取代码,除非我在我的帖子中有重复 评论栏。注意,下面的debug.txt文件中的条目600和700是dupes,这很好。这些欺骗是预期的,所以我如何容纳它们?

数据文件' debug.txt'因此数据框有2个唯一列,所以它不是每个帖子都有重复列名的数据帧问题: 资料来源:ValueError: cannot reindex from a duplicate axis using isin with pandas

----- debug.txt -----

PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item

我的代码如下。您可以提供解决上述ValueError的任何帮助,我们将不胜感激。

import re
import pandas as pd

# Define params used below
fileHeader = True

dictB = {}

inputFile = open("debug.txt", 'r')

if fileHeader == True:
    inputFile.readline()

for line in inputFile:

    inputText = line.split(",")

    primaryKey = inputText[0]
    inputTexttoAnalyze = inputText[1]

    # Clean inputTexttoAnalyze and do other things...

    # NOTE: Very inefficient to add 1 row at a time to a Pandas DF. 
    # They suggest combining the data in some other variable (like my dictionary)
    # then copy it to the DF. 
    # Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe

    dictB[primaryKey] = inputTexttoAnalyze

inputFile.close()

# Below is a List of words that must produce an EXACT match to a *substring* within 
# the data frame Comments column. 
findList = ["damaged product", "item"]

print("\nResults should ONLY have", findList, "\n")


dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)

for entry in findList:
    rgx = '({})'.format("".join(r'(\b%s\b)' % entry))

    # The following line gives the error: ValueError: cannot reindex from a duplicate axis. 
    # I DO have expected duplicate values in my input file.
    resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
    for key in resultDFb['PKey']:
        print(entry, key)

# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.  
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas

print(dfB.index.duplicated())

1 个答案:

答案 0 :(得分:1)

我看到的一个问题是Comments的标题中有一个前导空格(“,Comment”),这可能会导致DataFrame中的问题。

如果我理解正确,您正在尝试识别DataFrame中注释包含findList

中的一个值的所有行

以下内容可能对您有用(从Comments标题中删除前导空格后)。

import pandas as pd
import re

def check(s):
    for item in findList:
        if re.search(r'\b' + item + r'\b', s):
            return True
    return False


findList = ["damaged prod", "item"]

df = pd.read_csv("debug.txt")

df[df.Comments.apply(check)]

Out[9]: 
   PKey                          Comments
1   200                       second item
2   300            a third item goes here
4   500                       second item
5   600                              item
6   700                              item

希望有所帮助。