Question

我有一个数据框，我想比较同一行中的字符串值。行也包括空字符串。下面的代码完成了这项工作但不幸的是，当两个字符串都是“无”即空字符串

时返回True

Col = list(ENTITY.columns.values)
for i in combinations(Col,2):
    df[i[0]+' to '+i[1]+' dedication'] =df.apply(lambda row: row[i[0]] == row[i[1]],axis=1)
    df[i[0]+' to '+i[1]+' dedication'] = np.where(df[i[0]+' to '+i[1]+' dedication'], 'Y', 'N')

例如，如果row[i[0]] == "AAA1" and row[i[1]] == "AAA1"输出为True，但如果row[i[0]] == "AAA1" and row[i[1]] == None或row[i[0]] == None and row[i[1]] == None，则输出为False。

当两个字符串都不为空且匹配时，True语句是如何解决此问题的？是否可以在lambda函数中使用运算符isinstance和basestring？期望的输出：谢谢

Answer 1

您需要pandas.notnull或pandas.isnull与None进行比较（或与NaN进行比较）：

df.apply(lambda row: (row[i[0]] == row[i[1]]) and 
                      pd.notnull(row[i[0]]) and 
                      pd.notnull(row[i[1]), axis=1)

但更好的是比较列，然后它完美运行，因为np.nan != np.nan：

for i in combinations(Col,2):
    df[i[0]+' to '+i[1]+' dedication'] = np.where(df[i[0]] == df[i[1]], 'Y', 'N')

样品：

df = pd.DataFrame({'Key':[1,2,3,4],
                   'SCANNER A':['AAA1', None, None, 'AAA1'],
                   'SCANNER B':['AAA1', 'AAA2', None, 'AAA2']})

df['new'] = np.where(df['SCANNER A'] == df['SCANNER B'], 'Y', 'N')
print (df)
   Key SCANNER A SCANNER B new
0    1      AAA1      AAA1   Y
1    2      None      AAA2   N
2    3      None      None   N
3    4      AAA1      AAA2   N

Answer 2

通常NaN != NaN，所以如果将它们存储为空值，那么简单的比较就足够了。如果将它们存储为'None（字符串）

df = pd.DataFrame(data={'col1':['a', None, None, 'a', 'a'], 'col2': ['a', 'a', None, None, 'b']})

  col1        col2
0     a       a
1     None    a
2     None    None
3     a       None
4     a       b

df_result = df.copy()
for (col1_label, col1), (col2_label, col2) in itertools.combinations(df.iteritems(), 2):
    df_result[col1_label + '_' + col2_label] = col1 == col2

      col1    col2    col1_col2
0     a       a       True
1     None    a       False
2     None    None    False
3     a       None    False
4     a       b       False

2个小提示

您可以使用iteritems简化循环，无需使用索引
我尝试将计算结果保存在与初始数据和中间结果不同的DataFrame中。通过这种方式，可以更轻松地解决出错的问题并从中途开始。我只在内存出现问题时重用原始的DataFrame

Answer 3

这里的基本逻辑很简单用 numpy.nan

替换空字符串

＆gt;＆gt;＆gt; numpy.nan == numpy.nan

<强>假

import numpy as np
ENTITY.replace(to_replace="None",value=np.nan,inplace=True)
# your code below

Python Pandas比较数据框行中的字符串，不包括空字符串

3 个答案: