Python3 - 使用pandas对行进行分组,其中两个列包含正向或反向顺序的值:v1,v2或v2,v1

时间:2017-12-09 17:08:46

标签: excel pandas python-3.6

我对python和pandas很新,但是我编写了一个读取excel工作簿的代码,并根据两列中包含的值对行进行分组。

因此,当Col_1 = A且Col_2 = B, Col_1 = B且Col_2 = A时,两者都将被分配GroupID = 1.

sample spreadsheet data, with rows color coded for ease of visibility 我已经设法让这个工作,但我想知道是否有更简单/有效/更清洁/更少笨重的方式来做到这一点。

import pandas as pd
df = pd.read_excel('test.xlsx')

# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)

# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
    if (b,a) not in new_list:
        new_list.append((a,b))

# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
    for a,b in new_list:
        # if the values exist in forward direction
        if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
            # GroupID value given, where value is index in the new_list[]
            df.loc[index,"GroupID"] = new_list.index((a,b))+1
        # else check if value exists in the reverse direction
        if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
            df.loc[index,"GroupID"] = new_list.index((a,b))+1

# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')

我知道 pandas.groupby([columnA,columnB])选项,但我想办法创建包含(v1,v2)和<的组/ strong>(v2,v1)。

2 个答案:

答案 0 :(得分:0)

布尔掩码应该可以解决这个问题:

import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
        (df['Header_2'] == 'B') & (df['Header_3'] == 'A'))

# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)

# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]

编辑:更新了解决groupby要求的答案。

答案 1 :(得分:0)

我会做这样的事情。

import pandas as pd
df = pd.read_excel('test.xlsx')

#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)

#group them together
df = df.sort_values(by=["group1","group2"])

如果你需要处理两个以上的专栏,我可以写一个更通用的方法来做到这一点。