我目前有一个pandas
数据框df
:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
此处,df
的每一行显示了paper
和reference
之间的引文关系(其中paper
引用了reference
)。
我需要以下数字进行分析:
paper
中的df
元素的频率
从paper
中随机选择两个元素时,它们共同引用的reference
的数量
对于数字1,我执行了以下操作:
df_count = df.groupby(['paper'])['paper'].count()
对于数字2,我执行了返回paper
中引用了reference
中相同元素的成对元素的操作:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair
是一个包含三个元素的列表:前两个元素是一对paper
,第三个元素来自reference
,两个paper
元素都引用。以下是pair
的样子:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
我想以以下格式检索DataFrame:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
其中paper1
和paper2
代表pair
的每个列表的前两个元素,freq1
和freq2
代表由df_count
和common
是reference
和paper1
共同使用的多个paper2
。
如何从df
,df_count
和pair
中检索所需的数据集(以所需的格式)?
答案 0 :(得分:1)
我认为只有使用pandas.DataFrame.merge才能解决。不过,我不确定这是否是最有效的方法。
首先,生成通用参考计数:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
第二,为每篇论文生成参考文献数量(您已经知道了):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
第三,合并两个数据框:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]