这个文件保存了论文信息和引文网络。格式如下:
#index ---- index id of this paper
#* ---- paper title
#@ ---- authors (separated by semicolons)
#t ---- year
#c ---- publication venue
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
以下是一个例子:
#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal
#index 2
#* MOSFET table look-up models for circuit simulation
#@
#t 1984
#c Integration, the VLSI Journal
#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
我正在尝试将此文件转换为我可以使用的数据框。 我尝试使用 for 循环遍历行并创建了许多 if 语句来确定信息的类型并将其附加到指定的列表(paper_title、authors 等),最后,将它们组合成一个数据框。但是,我注意到可能有多个参考 ID,如示例所示,因此参考 ID 的行将与其余行不匹配。请帮忙!
答案 0 :(得分:0)
如果每个元素(索引、标题、作者、年份、pub)只使用一行,那么你可以像这样简单地转换它
text = '''#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal
#index 2
#* MOSFET table look-up models for circuit simulation
#@
#t 1984
#c Integration, the VLSI Journal
#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
'''
import pandas as pd
# --- functions ---
def parse(cit):
lines = cit.strip().split('\n')
index = lines[0].replace('#index ', '')
title = lines[1].replace('#* ', '')
authors = lines[2].replace('#@ ', '')
if authors:
authors = authors.split(';')
else:
authors = []
year = lines[3].replace('#t ', '')
pub = lines[4].replace('#c ', '')
ids = lines[5:]
ids = [x.replace('#% ', '') for x in ids]
print('found index:', index)
print('found title:', title)
print('found authors:', authors)
print('found year:', year)
print('found pub:', pub)
print('found IDs:', ids)
print('---')
return [index, title, authors, year, pub, ids]
# --- main ---
citations = text.split('\n\n')
rows = []
for cit in citations:
rows.append(parse(cit))
df = pd.DataFrame(rows, columns=['index', 'title', 'authors', 'year', 'pub', 'ids'])
pd.options.display.max_columns = 100
print(df)
结果:
found index: 1
found title: Book Review: Discover Linux
found authors: ['Marjorie Richardson']
found year: 1998
found pub: Linux Journal
found IDs: []
---
found index: 2
found title: MOSFET table look-up models for circuit simulation
found authors: []
found year: 1984
found pub: Integration, the VLSI Journal
found IDs: []
---
found index: 1083734
found title: ArnetMiner: extraction and mining of academic social networks
found authors: ['Jie Tang', 'Jing Zhang', 'Limin Yao', 'Juanzi Li', 'Li Zhang', 'Zhong Su']
found year: 2008
found pub: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
found IDs: ['197394', '220708', '280819', '387427', '464434', '643007']
---
index title \
0 1 Book Review: Discover Linux
1 2 MOSFET table look-up models for circuit simula...
2 1083734 ArnetMiner: extraction and mining of academic ...
authors year \
0 [Marjorie Richardson] 1998
1 [] 1984
2 [Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, L... 2008
pub \
0 Linux Journal
1 Integration, the VLSI Journal
2 Proceedings of the 14th ACM SIGKDD internation...
ids
0 []
1 []
2 [197394, 220708, 280819, 387427, 464434, 643007]
引文之间有空行,所以我在这一行拆分文本,然后运行 for
-loop 以在函数 parse()
中单独转换每个引文。我假设每个元素只使用一行,所以不需要 if/else
来检查行。
因为引用可能有不同数量的 ID,所以它必须作为一个列/单元格中的列表。这比将 everu ID 放在新列中更有用。
编辑:现在我也拆分作者以将它们作为列表保留。
最终,您可以使用单个 ID 在新行中复制引文。在某些情况下它可能有用,但在其他情况下它可能会产生问题。
或者您可以将 ID 保存在带有 index, id
列的分隔表中 - 就像在数据库中一样 - 但再次:在某些情况下它可能很有用,但在其他情况下它可能会产生问题。
我会将它保存在数据库中,因为它似乎是处理此类数据的更好工具,并且它具有处理许多表(关系)的工具。
它甚至可以为作者分隔表格 - 列 index, author
。
但现在我跳过这部分。