Question

我有一个熊猫数据框，看起来像：

     | col1 | col2 | col3 | col4 | col5 | col6 | col7
row1 |  a   |  b   |  c   |  d   |  e   |  f   |  g
row2 |  a   |  a   |  c   |  d   |  e   |  f   |  g   
row3 |  a   |  b   |  c   |  d   |  a   |  a   |  g   
row4 |  a   |  q   |  q   |  q   |  q   |  q   |  q

我想计算除少于两个条目外，与另一行相同的行数，并将它们放在一列/系列中。

因此，在这种情况下，第2行和第3行与1类似。因此，第1行的条目为2。总体结果为：

     | col1 | col2 | col3 | col4 | col5 | col6 | col7  | almost_dups
row1 |  a   |  b   |  c   |  d   |  e   |  f   |  g    |  2
row2 |  a   |  a   |  c   |  d   |  e   |  f   |  g    |  1
row3 |  a   |  b   |  c   |  d   |  e   |  a   |  a    |  1 
row4 |  a   |  q   |  q   |  q   |  q   |  q   |  q    |  0

我最初的想法是定义行之间的距离度量。

Answer 1

该代码如何。这里是初学者的快速解决方案，但我认为效果很好。

import pandas as pd
# let's create the dataframe
df = pd.DataFrame(data = {'col1': ['a','a','a','a'], 
                          'col2': ['b','a','b','q'],
                          'col3': ['c','c','c','q'],
                          'col4': ['d','d','d','q'], 
                          'col5': ['e','e','a','q'],
                          'col6': ['f','f','a','q'],
                          'col7': ['g','g','g','q']} )

almost_dups = []            # initialize the list we want to compute    
for i in range(len(df)):    # for every dataframe row
    a = df.iloc[i].values   # get row values
    count = 0               # this will count the rows similar to the selected one 
    for j in range(len(df)): # for every other row
        if i!=j:            # if rows are different
            b = df.iloc[j].values
            if sum([i == j for i, j in zip(a, b)])>= 5: # if at least 5 values are same
                count +=1   # increase counter
    almost_dups.append(count) # append the count
df['almost_dups'] = almost_dups   # append the list to dataframe, as a new column

Answer 2

那可以工作（虽然不确定是否已经优化）

app.MapSignalR();

如何找到数据框中几乎重复的行数，即相差少于两个条目？

2 个答案: