Question

我有一个包含多个列的数据框，但是我希望单个列删除特定列中的重复值。我陷入困境是因为我的代码在逻辑上对我来说很有意义，但似乎不起作用。

以下是我正在使用的数据：

d = {'Word': ["hi", "hi", "hi", "hello", "where", "where", "for", "how", "how", "how", "how", "how"], 
    'index': [0, 0, 0, 1, 2, 2, 3, 4 , 4, 4, 4, 4]}
df = pd.DataFrame(d).set_index('index')

这是我想要的结果，对于索引列中的每个组，删除重复值。

i = 0
while i < (len(set(df.index)) - 1):
    if len(df[["Word"]].loc[i]) > 1: 
        for k in range(1, (len(df[["Word"]].loc[i].reset_index()))):
            if df[["Word"]].loc[i].reset_index().at[k, "Word"] == df[["Word"]].loc[i].reset_index().at[0, "Word"]:
                df[["Word"]].loc[i].reset_index().at[k, "Word"] = ""
    i += 1

我在这里所做的是将所有重复索引值的组作为对象，并采用每个组的范围将每个值与第一个（索引= 0）值进行比较。如果以下值与第一个值相同，则表示应将其变成空白。定位到每一行后，我还将重置索引，以便可以索引到每个值以将其与第一行进行比较。无论我做什么，这都不会改变数据框架，并且想知道我的代码发生了什么，以及为什么df根本没有更新。

Answer 1

这就是我设法使其正常工作的方式

import numpy as np
import pandas as pd

# going to use this for boolean indexing
erase = np.tile(np.array(False), 12)

#iterate over each unique word
for word in np.unique(df['Word']):
    found = df['Word'] == word

    # check if there is more than one occurance 
    if np.count_nonzero(found == True) > 1:

        # get indexes
        indexs = np.where(found.values == True)[0]
        firstIndex = indexs[0]
        lastIndex = indexs[len(indexs)-1]

        # update values to erase
        erase[firstIndex+1:lastIndex+1] = True

# update main dataframe
df[erase] = ''

如何删除索引组熊猫数据框中的重复值

1 个答案: