我对python和pandas很新,但是我编写了一个读取excel工作簿的代码,并根据两列中包含的值对行进行分组。
因此,当Col_1 = A且Col_2 = B,或 Col_1 = B且Col_2 = A时,两者都将被分配GroupID = 1.
sample spreadsheet data, with rows color coded for ease of visibility 我已经设法让这个工作,但我想知道是否有更简单/有效/更清洁/更少笨重的方式来做到这一点。
import pandas as pd
df = pd.read_excel('test.xlsx')
# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)
# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
if (b,a) not in new_list:
new_list.append((a,b))
# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
for a,b in new_list:
# if the values exist in forward direction
if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
# GroupID value given, where value is index in the new_list[]
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# else check if value exists in the reverse direction
if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')
我知道 pandas.groupby([columnA,columnB])选项,但我想办法创建包含(v1,v2)和<的组/ strong>(v2,v1)。
答案 0 :(得分:0)
布尔掩码应该可以解决这个问题:
import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
(df['Header_2'] == 'B') & (df['Header_3'] == 'A'))
# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)
# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]
编辑:更新了解决groupby
要求的答案。
答案 1 :(得分:0)
我会做这样的事情。
import pandas as pd
df = pd.read_excel('test.xlsx')
#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)
#group them together
df = df.sort_values(by=["group1","group2"])
如果你需要处理两个以上的专栏,我可以写一个更通用的方法来做到这一点。