我正在尝试使用两个电子表格执行“更改数据捕获”。 我已将我的结果数据分组,并遇到了一个奇怪的问题。 要求:
案例1)团体规模== 2,做某些任务
案例2)团体规模== 1,做某些任务
案例3)size_of_a_group> 2,做某些任务
问题无论我如何尝试我都无法根据其大小分解groupby的结果然后循环它
我想做点什么:
if(group_by_1.filter(lambda x : len(x) ==2):
for grp,rows in sub(??)group:
for j in range(len(rows)-1):
#check rows[j,'column1'] != rows[j+1,'column1']:
do something
这是我的代码段。非常感谢任何帮助。
import pandas as pd
import numpy as np
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")
# defining metadata
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)
df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)
df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'
#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True).reset_index()
#print(df_m_d)
first_pass_get_duplicates = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep='first')] # This datframe has records which are DUPES on NEW and OLD
#print(first_pass_get_duplicates)
first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) # This datframe has records which are unique on desired values evn for first time
#print(first_pass_drop_duplicate)
group_by_1 = first_pass_drop_duplicate.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'],as_index=False)
#Best Case group_by has 2 elements on big key and at least one row is 'new'
#print(group_by_1.grouper.group_info[0])
#for i,rows in group_by_1:
#if(.transform(lambda x : len(x)==2)):
#print(group_by_1.grouper.group_info[0])
#print(group_by_1.describe())
'''for i,rows in group_by_1:
temp_rows = rows.reset_index()
temp_rows.reindex(index=range(0,len(rows)))
print("group has: ", len(temp_rows))
for j in range(len(rows)-1):
print(j)
print("this iteration: ", temp_rows.loc[j,'Start'])
print("next iteration: ", temp_rows.loc[j+1,'Start'])
if(temp_rows.loc[j+1,'Start'] == temp_rows.loc[j,'Start']):
print("Match")
else:
print("no_match")
print(temp_rows.loc[j,'Start'])
print("++++-----++++")'''
非常感谢任何帮助。
答案 0 :(得分:3)
groupby
transformation
df
与np.size
df
考虑数据框df = pd.DataFrame([
[1, 2, 3],
[1, 2, 3],
[2, 3, 4],
[2, 3, 4],
[2, 3, 4],
[3, 4, 5],
], columns=list('abc'))
my_function
和函数def my_function(df):
if df.name == 1:
return 'blue'
elif df.name == 2:
return 'red'
else:
return 'green'
grouper
分组的事情是grouper = df.groupby('a').b.transform(np.size)
grouper
0 2
1 2
2 3
3 3
4 3
5 1
Name: b, dtype: int64
df.groupby(grouper).apply(my_function)
b
1 blue
2 red
3 green
dtype: object
p:commandLink
你应该能够把它拼凑起来得到你想要的东西。
答案 1 :(得分:2)
根据您需要执行的操作,使用新索引可能会让您的生活更轻松。我试图模仿你的一些数据:
In [1]:
...: pd.set_option('display.max_rows', 10)
...: pd.set_option('display.max_columns', 50)
...:
...:
...: df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
...: 'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
...: 'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
...: 'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
...: 'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryP
...: hysicalHV',
...: 'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExempti
...: on','Any','ContractID',
...: 'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
...:
...:
...: import itertools as it
...:
...: catcols = 'StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'
...:
...: headers = list(catcols) + [chr(c + 65) for c in range(10)]
...:
...: df = pd.DataFrame(data=np.random.rand(100, len(headers)), columns=headers)
...:
...: df.StoreLanguage = list(it.islice((it.cycle(["en", "fr"])), 100))
...:
...: df.Territory =list(it.islice(it.cycle(["us", "fr", "po", "nz", "au"]), 100) )
...:
...: df.TitleInternalAlias =list(it.islice(it.cycle(['a', 'b', 'c']), 100) )
...:
...: df.LocalizationType =list(it.islice(it.cycle(['d', 'g']), 100) )
...:
...: df.LicenseType =list(it.islice(it.cycle(["free", "com", "edu", "home"]), 100) )
...:
...: df.FormatProfile =list(it.islice(it.cycle(["g", "q"]), 100) )
...:
这就是诀窍:
...: gb = df.groupby(catcols, as_index=False)
...:
...: reindexed = (df.assign(group_size = gb['A'].transform(lambda x: x.shape[0]))
...: .set_index("group_size")
...: )
...:
In [2]: reindexed.head()
Out[2]:
StoreLanguage Territory TitleInternalAlias LocalizationType \
group_size
2.0 en us a d
2.0 fr fr b g
2.0 en po c d
2.0 fr nz a g
2.0 en au b d
LicenseType FormatProfile A B C D \
group_size
2.0 free g 0.312705 0.346577 0.910688 0.317494
2.0 com q 0.575515 0.627054 0.025820 0.943633
2.0 edu g 0.489421 0.518020 0.988816 0.833306
2.0 home q 0.146965 0.823234 0.155927 0.865554
2.0 free g 0.327784 0.107795 0.678729 0.178454
E F G H I J
group_size
2.0 0.032420 0.232436 0.279712 0.167969 0.847725 0.777870
2.0 0.833150 0.261634 0.832250 0.511341 0.865027 0.850981
2.0 0.924992 0.129079 0.419342 0.603113 0.705015 0.683255
2.0 0.560832 0.434411 0.260553 0.208577 0.259383 0.997590
2.0 0.431881 0.729873 0.606323 0.806250 0.000556 0.793380
In [3]: reindexed.loc[2, "FormatProfile"].head()
Out[3]:
group_size
2.0 g
2.0 q
2.0 g
2.0 q
2.0 g
Name: FormatProfile, dtype: object
你可以在这里删除重复...
In [4]: reindexed.loc[2, "FormatProfile"].drop_duplicates()
Out[4]:
group_size
2.0 g
2.0 q
Name: FormatProfile, dtype: object
并按照您认为合适的方式重新组合切片。