我有一个下一个数据框:
import pandas as pd
import numpy as np
raw_data = {'col1': ['a', 'b', 'c', 'd', 'e'],
'col2': [1, 2, 3, 4, np.nan],
'col3': ['aa','b','cc','d','ff'],
'col4': [4, 6, 3, 4, np.nan]
}
df = pd.DataFrame(raw_data, columns = ['col1','col2','col3','col4'])
col1 col2 col3 col4
0 a 1.0 aa 4.0
1 b 2.0 b 6.0
2 c 3.0 cc 3.0
3 d 4.0 d 4.0
4 e NaN ff NaN
我希望找到每行具有相同值的所有列。 所以结果应该是这样的:
Row 1: col1 eq col3;
Row 2: col2 eq col4;
Row 3: col1 eq col3; col2 eq col4
Dataframe有字符串和num列,所以也许将所有内容转换为str是值得的。 NaN数据值应该被忽略,因为有很多缺失=)
非常感谢
答案 0 :(得分:2)
这里有一个你可以使用的for循环解决方案......也许piRSquared可以提出更好的纯熊猫解决方案。这应该可以解决。
row_eqs = {}
# For each row
for idx in df.index:
# Make a set of all "column equivalencies" for each row
row_eqs[idx] = set()
for col in df.columns:
# Look at all of the other columns that aren't `col`
other_cols = [c for c in df.columns if c != col]
# Column value
col_row_value = df.loc[idx, col]
for c in other_cols:
# Other column row value
c_row_value = df.loc[idx, c]
if c_row_value == col_row_value:
# Just make your strings here since lists and sets aren't hashable
eq = ' eq '.join(sorted((c, col)))
row_eqs[idx].add(eq)
打印结果:
for idx in row_eqs:
if row_eqs[idx]:
print('Row %d: %s' % (idx, '; '.join(row_eqs[idx])))
Row 1: col1 eq col3
Row 2: col2 eq col4
Row 3: col1 eq col3; col2 eq col4
编辑:通过事先对列组合对的总数进行硬编码,稍快一些方法:
column_combos = {combo for combo in itertools.combinations(df.columns, 2)}
for idx in df.index:
row_eqs[idx] = set()
for col1, col2 in column_combos:
col1_value = df.loc[idx, col1]
col2_value = df.loc[idx, col2]
if col1_value == col2_value:
eq = ' eq '.join(sorted((col1, col2)))
row_eqs[idx].add(eq)
我不知道您的数据有多大,但后一种解决方案比前者快25%左右。
答案 1 :(得分:2)
这是我提出的另一个答案。我不知道输出哪一行没有列具有相等的值,所以我只是跳过输出中的那一行。还添加了一行,其中许多列具有相同的值以显示那里发生的事情。
import pandas as pd
import numpy as np
raw_data = {'col1': ['a', 'b', 'c', 'd', 'e', 1],
'col2': [1, 2, 3, 4, np.nan, 1],
'col3': ['aa','b','cc','d','ff', 1],
'col4': [4, 6, 3, 4, np.nan, 1],
}
df = pd.DataFrame(raw_data, columns = ['col1','col2','col3','col4'])
for row in df.itertuples():
values = list(set(row)) # Get the unique values in the row
equal_columns = [] # Keep track of column names that are the same
for v in values:
# Column names that have this value
columns = [df.columns[i-1] for i, x in enumerate(row) if x == v]
if len(columns) > 1:
# If more than 1 column with this value, append to the list
equal_columns.append(' eq '.join(columns))
if len(equal_columns) > 0:
# We have at least 1 set of equal columns
equal_columns.sort() # So we always start printing in lexicographic order
print('Row {0}: {1};'.format(row.Index, '; '.join(equal_columns)))
给我输出,
Row 1: col1 eq col3;
Row 2: col2 eq col4;
Row 3: col1 eq col3; col2 eq col4;
Row 5: col1 eq col2 eq col3 eq col4;
答案 2 :(得分:1)
假设我们有以下DF:
In [1]: from numpy import nan
...: from itertools import combinations
...: import pandas as pd
...:
...: df = pd.DataFrame(
...: {'col1': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
...: 'col2': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: nan},
...: 'col3': {0: 'aa', 1: 'b', 2: 'cc', 3: 'd', 4: 'ff'},
...: 'col4': {0: 4.0, 1: 6.0, 2: 3.0, 3: 4.0, 4: nan},
...: 'col5': {0: nan, 1: 'b', 2: 'c', 3: nan, 4: 'e'}})
...:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5
0 a 1.0 aa 4.0 NaN
1 b 2.0 b 6.0 b
2 c 3.0 cc 3.0 c
3 d 4.0 d 4.0 NaN
4 e NaN ff NaN e
让我们生成一个包含相同dtype 列的所有组合的查询:
In [3]: qry = \
...: (df.dtypes
...: .reset_index(name='type')
...: .groupby('type')['index']
...: .apply(lambda x:
...: '\n'.join(['{0[0]}_{0[1]} = ({0[0]} == {0[1]})'.format(tup, tup)
...: for tup in combinations(x, 2)]))
...: .str.cat(sep='\n')
...: )
In [5]: print(qry)
col2_col4 = (col2 == col4)
col1_col3 = (col1 == col3)
col1_col5 = (col1 == col5)
col3_col5 = (col3 == col5)
现在我们可以这样做:
In [6]: cols = df.columns.tolist()
In [7]: (df.eval(qry, inplace=False)
...: .drop(cols, 1)
...: .apply(lambda r: ';'.join(r.index[r].tolist()).replace('_',' == '), axis=1)
...: )
Out[7]:
0
1 col1 == col3;col1 == col5;col3 == col5
2 col2 == col4;col1 == col5
3 col2 == col4;col1 == col3
4 col1 == col5
dtype: object
说明:
In [9]: df.eval(qry, inplace=False).drop(cols, 1)
Out[9]:
col2_col4 col1_col3 col1_col5 col3_col5
0 False False False False
1 False True True True
2 True False True False
3 True True False False
4 False False True False
答案 3 :(得分:0)
另一种有效方式:
a=df.values
equality=(a[:,newaxis,:]==a[:,:,newaxis])
match = row,col1,col2 = np.triu(equality,1).nonzero()
match
现在是:
(array([1, 2, 3, 3], dtype=int64),
array([0, 1, 0, 1], dtype=int64),
array([2, 3, 2, 3], dtype=int64))
然后漂亮的打印:
dfc=df.columns
for i,r in enumerate(row):
print( str(r),' : ',str(dfc[col1[i]]),'=',str(dfc[col2[i]]))
对于:
1 : col1 = col3
2 : col2 = col4
3 : col1 = col3
3 : col2 = col4