我有一个熊猫数据框,看起来像:
| col1 | col2 | col3 | col4 | col5 | col6 | col7
row1 | a | b | c | d | e | f | g
row2 | a | a | c | d | e | f | g
row3 | a | b | c | d | a | a | g
row4 | a | q | q | q | q | q | q
我想计算除少于两个条目外,与另一行相同的行数,并将它们放在一列/系列中。
因此,在这种情况下,第2行和第3行与1类似。因此,第1行的条目为2。总体结果为:
| col1 | col2 | col3 | col4 | col5 | col6 | col7 | almost_dups
row1 | a | b | c | d | e | f | g | 2
row2 | a | a | c | d | e | f | g | 1
row3 | a | b | c | d | e | a | a | 1
row4 | a | q | q | q | q | q | q | 0
我最初的想法是定义行之间的距离度量。
答案 0 :(得分:2)
该代码如何。这里是初学者的快速解决方案,但我认为效果很好。
import pandas as pd
# let's create the dataframe
df = pd.DataFrame(data = {'col1': ['a','a','a','a'],
'col2': ['b','a','b','q'],
'col3': ['c','c','c','q'],
'col4': ['d','d','d','q'],
'col5': ['e','e','a','q'],
'col6': ['f','f','a','q'],
'col7': ['g','g','g','q']} )
almost_dups = [] # initialize the list we want to compute
for i in range(len(df)): # for every dataframe row
a = df.iloc[i].values # get row values
count = 0 # this will count the rows similar to the selected one
for j in range(len(df)): # for every other row
if i!=j: # if rows are different
b = df.iloc[j].values
if sum([i == j for i, j in zip(a, b)])>= 5: # if at least 5 values are same
count +=1 # increase counter
almost_dups.append(count) # append the count
df['almost_dups'] = almost_dups # append the list to dataframe, as a new column
答案 1 :(得分:1)
那可以工作(虽然不确定是否已经优化)
app.MapSignalR();