我需要匹配pandas文本列中的EXACT子字符串。但是,当该数据框文本列具有重复条目时,我得到:ValueError:无法从重复轴重新索引。
我查看了以下帖子以确定如何查询行,但主要是关于 匹配整个条目而不是子字符串。 Select rows from a DataFrame based on values in a column in pandas
以下帖子展示了如何使用正则表达式模式查找子字符串,这正是我需要查找正则表达式字边界和我在下面使用的内容。 How to filter rows containing a string pattern from a Pandas dataframe
我能够从上面的第二个SO帖子中获取代码,除非我在我的帖子中有重复 评论栏。注意,下面的debug.txt文件中的条目600和700是dupes,这很好。这些欺骗是预期的,所以我如何容纳它们?
数据文件' debug.txt'因此数据框有2个唯一列,所以它不是每个帖子都有重复列名的数据帧问题: 资料来源:ValueError: cannot reindex from a duplicate axis using isin with pandas
----- debug.txt -----
PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item
我的代码如下。您可以提供解决上述ValueError的任何帮助,我们将不胜感激。
import re
import pandas as pd
# Define params used below
fileHeader = True
dictB = {}
inputFile = open("debug.txt", 'r')
if fileHeader == True:
inputFile.readline()
for line in inputFile:
inputText = line.split(",")
primaryKey = inputText[0]
inputTexttoAnalyze = inputText[1]
# Clean inputTexttoAnalyze and do other things...
# NOTE: Very inefficient to add 1 row at a time to a Pandas DF.
# They suggest combining the data in some other variable (like my dictionary)
# then copy it to the DF.
# Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe
dictB[primaryKey] = inputTexttoAnalyze
inputFile.close()
# Below is a List of words that must produce an EXACT match to a *substring* within
# the data frame Comments column.
findList = ["damaged product", "item"]
print("\nResults should ONLY have", findList, "\n")
dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)
for entry in findList:
rgx = '({})'.format("".join(r'(\b%s\b)' % entry))
# The following line gives the error: ValueError: cannot reindex from a duplicate axis.
# I DO have expected duplicate values in my input file.
resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
for key in resultDFb['PKey']:
print(entry, key)
# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas
print(dfB.index.duplicated())
答案 0 :(得分:1)
我看到的一个问题是Comments
的标题中有一个前导空格(“,Comment”),这可能会导致DataFrame中的问题。
如果我理解正确,您正在尝试识别DataFrame中注释包含findList
以下内容可能对您有用(从Comments
标题中删除前导空格后)。
import pandas as pd
import re
def check(s):
for item in findList:
if re.search(r'\b' + item + r'\b', s):
return True
return False
findList = ["damaged prod", "item"]
df = pd.read_csv("debug.txt")
df[df.Comments.apply(check)]
Out[9]:
PKey Comments
1 200 second item
2 300 a third item goes here
4 500 second item
5 600 item
6 700 item
希望有所帮助。