python熊猫数据比较

时间:2019-05-17 09:46:33

标签: python pandas csv

我试图比较两个优点,一个是用户矩阵,另一个是我从主机生成的。我想知道矩阵中的用户设置是否正确。

我从主机获得的结果,我导入了pandas:这里的用户组是列名!

    Name Users  Domain Admins     Administrators   Schema Admins 
0   xxx   NaN             Yes                Yes             NaN                                  

问题是:

excel矩阵就像

user:         groups
xxx           administrators
              schema admins
              domain admins

这是我尝试过的:

我将所有“是”替换为列名:

for i in df.columns:
df[i].replace('Yes',i,inplace=True)

从中删除空值。

group=df.dropna(axis='columns',how='all')

现在是这样的:

  Name Users  Domain  Admins     Administrators  Schema Admins 

   0     xxx   Domain admins    Administrators  Schema Admins 

另一个像:

User Account Name    Group
0    xxx             Domain Admins, Local admin,Administrators

我不知道下一步该怎么做。请指导我如何在循环中比较所有索引的索引值。

原始的两个excel是这样的:

user:         groups
xxx           administrators
              schema admins
              domain admins

yyy           administrators
              domain admins

zzz           administrators
              schema admins

其他文件,例如:

username   administrators   schema admins  domain admins
xxx               yes            yes            NaN
yyy               yes            NaN            yes

3 个答案:

答案 0 :(得分:0)

这是可以做到的:

步骤1:转换主机df

cols = ['administrators', 'schema admins', 'domain admins']
df1['merged'] = df1[cols].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1) ##df1 is host df 

第2步:转换矩阵df

df.user = df.user.ffill()  ## Fill the empty rows with same user name
grouped_df = df.groupby("user")['groups'].apply(','.join).reset_index() ## merge same user name to 1 row

第3步:比较df

result_df = pd.merge(df1, grouped_df, how='inner', left_on="merged", right_on="user") ## The left_on/right_on will change according to the column name you have

答案 1 :(得分:0)

您可以将数据添加到字典中以使事情变得容易。如果以下是数据文件:

user:         groups
xxx           administrators
              schema admins
              domain admins
user:         groups
yyy           administrators
              domain admins
user:         groups
zzz           administrators
              schema admins

以下代码将创建一个字典:

with open('userdata.txt', 'r') as f:
    # read data file and split into lines; also trim lines; 
    datalist = list(map(lambda x: x.strip(), f.readlines())) 
    userdict = {}                               # dictionary to collect data; 
    username=""; grplist = []; newuser = True   # variable to read data from file: 
    for line in datalist: 
        if line.startswith('user:'):
            if not(username=="" and len(grplist)==0):   # omit at first run
                userdict[username] = grplist            # put user data into dictionary
                username=""; grplist=[]; newuser=True       # clear variable for new user; 
        elif newuser:
            username, grpname = list(map(lambda x: x.strip(), line.split()))
            grplist.append(grpname)     # append group name to temporary list
            newuser = False
        else: 
            grplist.append(line)        # append more groups; 

userdict[username] = grplist
print(userdict)

输出:

{'yyy': ['administrators', 'domain admins'], 'zzz': ['administrators', 'schema admins'], 'xxx': ['administrators', 'schema admins', 'domain admins']}

如果第二个文件中的数据如下:

  Account Name                               Group
          xxx  administrators , schema admins, domain admins
          yyy  administrators , domain admins
          zzz  administrators , schema admins

以下代码将从中获取字典:

with open('userdata2.txt', 'r') as f:
    # read data file and split into lines; also trim lines; 
    datalines = list(map(lambda x: x.strip(), f.readlines())) 
    userdict2={}
    for line in datalines[1:]:  # omit first line which is only header
        infolist = list(map(lambda x: x.strip(), line.split(" ",1)))
        username = infolist[0].strip()
        grplist = list(map(lambda x: x.strip(), infolist[1].split(",")))
        userdict2[username] = grplist

print(userdict2)

输出:

{'zzz': ['administrators', 'schema admins'], 'xxx': ['administrators', 'schema admins', 'domain admins'], 'yyy': ['administrators', 'domain admins']}

要比较2个字典,只需使用==

print(userdict == userdict2)

输出:

True

要比较特定用户的组:

print(userdict['xxx'] == userdict1['xxx'])

输出:

True

答案 2 :(得分:0)

我会让从宿主导入的熊猫(我们称其为df_host)保持不变,并为从 matrix 导入的熊猫(称为{{1} }):

df_matrix

接下来,我将在两个数据框中将用户名用作索引:

groups = ['Users', 'Domain Admins', 'Administrators', 'Schema Admins']

for g in groups:
    df_matrix[g] = df_matrix.Group.str.contains(g)

您现在可以轻松地加入数据框:

df_matrix.set_index('Account Name', inplace=True)
df_host.set_index('Name', inplace=True)

最后,您应该有一个数据帧,每个用户一行,并且从主机和excel矩阵中看到一组用于分组的列,这应该使比较容易。

相关问题