Python熊猫数据框和列表合并

时间:2019-10-21 09:33:17

标签: python pandas

我目前有一个pandas数据框df

         paper     reference
2171686  p84       r51
3816503  p41       r95
4994553  p112      r3
2948201  p112      r61
2957375  p32       r41
2938471  p65       r41
...

此处,df的每一行显示了paperreference之间的引文关系(其中paper引用了reference)。

我需要以下数字进行分析:

  1. paper中的df元素的频率

  2. paper中随机选择两个元素时,它们共同引用的reference的数量

对于数字1,我​​执行了以下操作:

df_count = df.groupby(['paper'])['paper'].count()

对于数字2,我执行了返回paper中引用了reference中相同元素的成对元素的操作:

from collections import defaultdict

pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
    d[row['paper']].append(row['paper'])
for ref, lst in d.items():
    for i in range(len(lst)):
        for j in range(i+1, len(lst)):
            pair.append([lst[i], lst[j], ref])

pair是一个包含三个元素的列表:前两个元素是一对paper,第三个元素来自reference,两个paper元素都引用。以下是pair的样子:

[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]

我想以以下格式检索DataFrame:

paper1      freq1       paper2       freq2        common
p17         4           p45          3            2
p5          2           p8           5            2
...

其中paper1paper2代表pair的每个列表的前两个元素,freq1freq2代表由df_countcommonreferencepaper1共同使用的多个paper2

如何从dfdf_countpair中检索所需的数据集(以所需的格式)?

1 个答案:

答案 0 :(得分:1)

我认为只有使用pandas.DataFrame.merge才能解决。不过,我不确定这是否是最有效的方法。

首先,生成通用参考计数:

# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])

# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]

# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]

第二,为每篇论文生成参考文献数量(您已经知道了):

df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]

第三,合并两个数据框:

# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")

# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]