熊猫:对groupby大小做出决定()

时间:2016-09-09 21:48:42

标签: python pandas dataframe

我正在尝试使用两个电子表格执行“更改数据捕获”。 我已将我的结果数据分组,并遇到了一个奇怪的问题。 要求:

案例1)团体规模== 2,做某些任务

案例2)团体规模== 1,做某些任务

案例3)size_of_a_group> 2,做某些任务

问题无论我如何尝试我都无法根据其大小分解groupby的结果然后循环它

我想做点什么:

if(group_by_1.filter(lambda x : len(x) ==2):
         for grp,rows in sub(??)group:
             for j in range(len(rows)-1):
                  #check rows[j,'column1'] != rows[j+1,'column1']:
                            do something

这是我的代码段。非常感谢任何帮助。

import pandas as pd
import numpy as np

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")


# defining metadata

df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
         'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
         'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
         'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
         'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
          'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
          'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)

df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)





df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'

#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True).reset_index()

#print(df_m_d)

first_pass_get_duplicates = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
                                   'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
                                   'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep='first')] # This datframe has records which are DUPES on NEW and OLD
#print(first_pass_get_duplicates) 

first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
                                   'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
                                   'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) # This datframe has records which are unique on desired values evn for first time

#print(first_pass_drop_duplicate)


group_by_1 = first_pass_drop_duplicate.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'],as_index=False)
#Best Case group_by has 2 elements on big key and at least one row is 'new'
#print(group_by_1.grouper.group_info[0])
  #for i,rows in group_by_1:

  #if(.transform(lambda x : len(x)==2)):
      #print(group_by_1.grouper.group_info[0])

#print(group_by_1.describe())

'''for i,rows in group_by_1:
     temp_rows = rows.reset_index()
     temp_rows.reindex(index=range(0,len(rows)))
     print("group has: ", len(temp_rows))
     for j in range(len(rows)-1):
     print(j)
     print("this iteration: ", temp_rows.loc[j,'Start'])
     print("next iteration: ", temp_rows.loc[j+1,'Start'])
     if(temp_rows.loc[j+1,'Start'] == temp_rows.loc[j,'Start']):
        print("Match")
    else:
        print("no_match")
        print(temp_rows.loc[j,'Start'])
    print("++++-----++++")'''

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:3)

groupby transformation dfnp.size

一起使用df

考虑数据框df = pd.DataFrame([ [1, 2, 3], [1, 2, 3], [2, 3, 4], [2, 3, 4], [2, 3, 4], [3, 4, 5], ], columns=list('abc'))

my_function

和函数def my_function(df): if df.name == 1: return 'blue' elif df.name == 2: return 'red' else: return 'green'

grouper

分组的事情是grouper = df.groupby('a').b.transform(np.size) grouper 0 2 1 2 2 3 3 3 4 3 5 1 Name: b, dtype: int64

df.groupby(grouper).apply(my_function)

b
1     blue
2      red
3    green
dtype: object
p:commandLink

你应该能够把它拼凑起来得到你想要的东西。

答案 1 :(得分:2)

根据您需要执行的操作,使用新索引可能会让您的生活更轻松。我试图模仿你的一些数据:

In [1]:
   ...: pd.set_option('display.max_rows', 10)
   ...: pd.set_option('display.max_columns', 50)
   ...:
   ...:
   ...: df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
   ...:          'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
   ...:          'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
   ...:          'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
   ...:          'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryP
   ...: hysicalHV',
   ...:           'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExempti
   ...: on','Any','ContractID',
   ...:           'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
   ...:
   ...:
   ...: import itertools as it
   ...:
   ...: catcols = 'StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'
   ...:
   ...: headers = list(catcols) + [chr(c + 65) for c in range(10)]
   ...:
   ...: df = pd.DataFrame(data=np.random.rand(100, len(headers)), columns=headers)
   ...:
   ...: df.StoreLanguage = list(it.islice((it.cycle(["en", "fr"])), 100))
   ...:
   ...: df.Territory  =list(it.islice(it.cycle(["us", "fr", "po", "nz", "au"]), 100) )
   ...:
   ...: df.TitleInternalAlias  =list(it.islice(it.cycle(['a', 'b', 'c']), 100) )
   ...:
   ...: df.LocalizationType  =list(it.islice(it.cycle(['d', 'g']), 100) )
   ...:
   ...: df.LicenseType  =list(it.islice(it.cycle(["free", "com", "edu", "home"]), 100) )
   ...:
   ...: df.FormatProfile  =list(it.islice(it.cycle(["g", "q"]), 100) )
   ...:

这就是诀窍:

   ...: gb = df.groupby(catcols, as_index=False)
   ...:
   ...: reindexed = (df.assign(group_size = gb['A'].transform(lambda x: x.shape[0]))
   ...:              .set_index("group_size")
   ...:              )
   ...:


In [2]: reindexed.head()
Out[2]:
           StoreLanguage Territory TitleInternalAlias LocalizationType  \
group_size
2.0                   en        us                  a                d
2.0                   fr        fr                  b                g
2.0                   en        po                  c                d
2.0                   fr        nz                  a                g
2.0                   en        au                  b                d

           LicenseType FormatProfile         A         B         C         D  \
group_size
2.0               free             g  0.312705  0.346577  0.910688  0.317494
2.0                com             q  0.575515  0.627054  0.025820  0.943633
2.0                edu             g  0.489421  0.518020  0.988816  0.833306
2.0               home             q  0.146965  0.823234  0.155927  0.865554
2.0               free             g  0.327784  0.107795  0.678729  0.178454

                   E         F         G         H         I         J
group_size
2.0         0.032420  0.232436  0.279712  0.167969  0.847725  0.777870
2.0         0.833150  0.261634  0.832250  0.511341  0.865027  0.850981
2.0         0.924992  0.129079  0.419342  0.603113  0.705015  0.683255
2.0         0.560832  0.434411  0.260553  0.208577  0.259383  0.997590
2.0         0.431881  0.729873  0.606323  0.806250  0.000556  0.793380

In [3]: reindexed.loc[2, "FormatProfile"].head()
Out[3]:
group_size
2.0    g
2.0    q
2.0    g
2.0    q
2.0    g
Name: FormatProfile, dtype: object

你可以在这里删除重复...

In [4]: reindexed.loc[2, "FormatProfile"].drop_duplicates()
Out[4]:
group_size
2.0    g
2.0    q
Name: FormatProfile, dtype: object

并按照您认为合适的方式重新组合切片。