在python中获取两个或更多数据帧并在唯一键上提取数据

时间:2016-06-04 12:33:58

标签: python python-2.7 pandas

首先,我有2个数据框,其中我有一个人的名字和他在列中喜欢的页面。所以不行。对于不同的人来说,这里的列将是不同的例子。第一列是用户名。然后他喜欢的页面存储在整个行中。所以没有。 '随机人'的列将与'mank rion'不同。 'BlackBick','500 Startups'e.t.c是页面的名称。假设此数据框的名称是User_page

random guy      BlackBuck            GiveMeSport    Analytics Ninja 
mank nion       DJ CHETAS            Celina Jaitly  Gurkeerat Singh
pop rajuel      WOW Editions         500 Startups   Biswapati Sarkar
Roshan ghai     MensXP               No Abuse       the smartian 

现在我有另一个数据框,其中有一个与上一个相同,但在页面名称的位置有一个页面类别。你现在可能在fb上有不同类别的页面。所以说'BlacBuck'的类别是'运输/运费'。有些页面具有相同的名称和不同的类别。这就是为什么我不能直接使用名称作为键,这就是我的数据框的样子。请说出这个数据框的名称User_category。

random guy      Transport/Freight    Sport      Insurance Company 
mank nion       Arts/Entertainment   Actress    Actor/Director
pop rajuel      Concert Tour         App Page   Actor/Director
Roshan ghai     News/Media Website   Community  Public Figure  

现在我还有两个数据框。其中我将fb页面的名称作为第1列,另外162个列,每个页面有一些标记,如果第i个页面进入第j个标记,则i * j元素的值为1,否则为空,所以它看起来像。此数据框的名称是Page_tag

    name of page              tag 1        tag2        tag3
    BlackBuck                     1          1             
    GiveMeSport                   1                      1
    Analytics Ninja               1                      1
    DJ CHETAS                                1           1

另一个具有类别名称作为第1列,并且进一步具有相同的162。像这样。假设此数据框的名称为Category_tag。

   category_name              tag 1        tag2        tag3
    Sport                                     1           1
    App Page                      1                       1
    Actor/Director                1                                        
    Public Figure                         1               1

现在,我必须从他喜欢的页面获取每个用户的标签数量。首先,我必须首先检查他所喜欢的页面在我的问题中是第3个数据帧的Page_tag的数据框中是否存在,如果它存在那里需要标记的数量,即该用户出现特定标签的次数。这是第一步,如果没有找到页面的名称为否。 Page_tag数据框(第3个)中的页面数量有限。我将转到页面的类别(从这个问题的第2个数据框)中删除的页面,对于该类别,我将从名为Category_tags的数据框(此问题中的第4个数据帧)计算特定用户的标签计数,并将标签计数和我的输出是这样的。 输出

username             tag1                   tag2           tag3 
random guy              1                      2             2 
mank nion               2                      1             3
pop rajuel              4                      0             2 
Roshan ghai             0                      2             1

此数据框上的i * j元素显示为no。为第i个用户显示第j个标记的次数。我已经为此编写了代码以及更多内容。我陷入了这个特殊的步骤。 R的代码并不是最优的,因为我多次使用循环。我希望能够达到最佳状态,希望可以在熊猫中完成。

这是我到目前为止编写的代码。但我认为可以以更有效的方式完成 PS:实际没有。用户和列非常大。

from io import StringIO
import pandas as pd
import numpy as np
# DATA FRAME IMPORT AND MELT
data1 = u'''  
random guy,BlackBuck,GiveMeSport,Analytics Ninja 
mank nion,DJ CHETAS,Celina Jaitly,Gurkeerat Singh
pop rajuel,WOW Editions,500 Startups,Biswapati Sarkar
Roshan ghai,MensXP,No Abuse,the smartian 
'''

## reading and melting the datasheet of user name and page_liked
df1 = pd.read_csv(StringIO(data1), sep=",", header=None)

df1 = pd.melt(df1, id_vars=[0], value_vars=[1,2,3])[[0,'value']]
df1.columns = ['user', 'page_name']

data2 = u'''
random guy,Transport/Freight,Sport,Insurance Company
mank nion,Arts/Entertainment,Actress,Actor/Director
pop rajuel,Concert Tour,App Page,Actor/Director
Roshan ghai,News/Media Website,Community,Public Figure
'''


#reading and melting the data sheet of user name and category of page liked
df2 = pd.read_csv(StringIO(data2), sep=",", header=None)
df2 = pd.melt(df2, id_vars=[0], value_vars=[1,2,3])[[0,'value']]
df2.columns = ['user', 'categories']




data3 = u'''
page_name,tag1,tag2,tag3
BlackBuck,1,1,0             
GiveMeSport,1,0,1
Gurkeerat Singh,1,0,1
DJ CHETAS,0,1,1
'''

##reading the meta data of page_name and tag
df3 = pd.read_csv(StringIO(data3), sep=",")

data4 = u'''
category,tag1,tag2,tag3
Sport,0,1,1
App Page,1,0,1
Actor/Director,1,0,0                                        
Public Figure,0,1,1
'''
df4 = pd.read_csv(StringIO(data4), sep=",")

##reading the data of category and tag

##adding a column ctegory in df1 based on index
category = df2['categories']
df1['category'] = category


##creating a list of page which i have in meta_data
meta_list = list(df3.iloc[:,0])

## creating two empty dataframes with column name as same as df1
new_df1=pd.DataFrame(columns=['user','page_name','category'])
new_df2=pd.DataFrame(columns=['user','page_name','category'])

#checking if page in meta list if it is there add that row in newdf1   else     in newdf2 
for i in range(len(df1)):
    if df1.iloc[i,1] in meta_list:
        x = df1.iloc[i]
        new_df1 = new_df1.append(x, ignore_index=True)
    else:
        y = df1.iloc[i]
        new_df2 = new_df2.append(y, ignore_index=True)


## merging newdf1 and newdf2 on page_name and category repectively 

mdf1 = pd.merge(new_df1, df3, how= 'left', on = ['page_name'])

mdf2 = pd.merge(new_df2, df4, how= 'left', on=['category'])
## concatenating the 2 data frame mdf1 and mdf2 and summing the tags for     each of them
finaldf = pd.concat([mdf1[['user', 'tag1', 'tag2', 'tag3']].groupby(['user']).agg(sum),
                 mdf2[['user', 'tag1', 'tag2', 'tag3']].groupby(['user']).agg(sum)]).reset_index()

   ## finally grouping on user and summing the tags for each user
   finaldf1 = finaldf.groupby(['user']).agg(sum).reset_index()

1 个答案:

答案 0 :(得分:0)

感谢您的代码。现在更清楚了。

我尝试优化您的循环,我认为您可以使用isin any maskboolean indexing。另外,我在concat中简化了代码:

##adding a column category in df1 based on index
df1['category'] =  df2['categories']

##creating a list of page which i have in meta_data
meta_list = list(df3.iloc[:,0])

mask = df1.isin(meta_list).any(1)
new_df1 = (df1[mask])
new_df2 = (df1[~mask])

## merging newdf1 and newdf2 on page_name and category repectively 
mdf1 = pd.merge(new_df1, df3, how= 'left', on ='page_name')
mdf2 = pd.merge(new_df2, df4, how= 'left', on='category')
## concatenating the 2 data frame mdf1 and mdf2 and summing the tags for     each of them
finaldf = pd.concat([mdf1,mdf2])
## finally grouping on user and summing the tags for each user
finaldf1 = finaldf.groupby('user', as_index=False).sum()
print (finaldf1)
          user  tag1  tag2  tag3
0  Roshan ghai   0.0   1.0   1.0
1    mank nion   1.0   1.0   2.0
2   pop rajuel   2.0   0.0   1.0
3   random guy   2.0   1.0   1.0