Question

这个文件保存了论文信息和引文网络。格式如下：

#index ---- index id of this paper
#* ---- paper title
#@ ---- authors (separated by semicolons)
#t ---- year
#c ---- publication venue
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)

以下是一个例子：

#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal

#index 2
#* MOSFET table look-up models for circuit simulation
#@ 
#t 1984
#c Integration, the VLSI Journal

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data 
mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007

我正在尝试将此文件转换为我可以使用的数据框。我尝试使用 for 循环遍历行并创建了许多 if 语句来确定信息的类型并将其附加到指定的列表（paper_title、authors 等），最后，将它们组合成一个数据框。但是，我注意到可能有多个参考 ID，如示例所示，因此参考 ID 的行将与其余行不匹配。请帮忙！

Answer 1

如果每个元素（索引、标题、作者、年份、pub）只使用一行，那么你可以像这样简单地转换它

text = '''#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal

#index 2
#* MOSFET table look-up models for circuit simulation
#@ 
#t 1984
#c Integration, the VLSI Journal

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
'''

import pandas as pd

# --- functions ---

def parse(cit):
    lines = cit.strip().split('\n')
    index = lines[0].replace('#index ', '')
    title = lines[1].replace('#* ', '')
    authors = lines[2].replace('#@ ', '')
    if authors:
       authors = authors.split(';')
    else:
       authors = []
    year = lines[3].replace('#t ', '')
    pub = lines[4].replace('#c ', '')
    ids = lines[5:]
    ids = [x.replace('#% ', '')  for x in ids]
    
    print('found index:', index)
    print('found title:', title)
    print('found authors:', authors)
    print('found year:', year)
    print('found pub:', pub)
    print('found IDs:', ids)
    print('---')

    return [index, title, authors, year, pub, ids]

# --- main ---
    
citations = text.split('\n\n')

rows = []

for cit in citations:
    rows.append(parse(cit))
    
df = pd.DataFrame(rows, columns=['index', 'title', 'authors', 'year', 'pub', 'ids'])    
pd.options.display.max_columns = 100
print(df)

结果：

found index: 1
found title: Book Review: Discover Linux
found authors: ['Marjorie Richardson']
found year: 1998
found pub: Linux Journal
found IDs: []
---
found index: 2
found title: MOSFET table look-up models for circuit simulation
found authors: []
found year: 1984
found pub: Integration, the VLSI Journal
found IDs: []
---
found index: 1083734
found title: ArnetMiner: extraction and mining of academic social networks
found authors: ['Jie Tang', 'Jing Zhang', 'Limin Yao', 'Juanzi Li', 'Li Zhang', 'Zhong Su']
found year: 2008
found pub: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
found IDs: ['197394', '220708', '280819', '387427', '464434', '643007']
---
     index                                              title  \
0        1                        Book Review: Discover Linux   
1        2  MOSFET table look-up models for circuit simula...   
2  1083734  ArnetMiner: extraction and mining of academic ...   

                                             authors  year  \
0                              [Marjorie Richardson]  1998   
1                                                 []  1984   
2  [Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, L...  2008   

                                                 pub  \
0                                      Linux Journal   
1                      Integration, the VLSI Journal   
2  Proceedings of the 14th ACM SIGKDD internation...   

                                                ids  
0                                                []  
1                                                []  
2  [197394, 220708, 280819, 387427, 464434, 643007]

引文之间有空行，所以我在这一行拆分文本，然后运行 for-loop 以在函数 parse() 中单独转换每个引文。我假设每个元素只使用一行，所以不需要 if/else 来检查行。

因为引用可能有不同数量的 ID，所以它必须作为一个列/单元格中的列表。这比将 everu ID 放在新列中更有用。

编辑：现在我也拆分作者以将它们作为列表保留。

最终，您可以使用单个 ID 在新行中复制引文。在某些情况下它可能有用，但在其他情况下它可能会产生问题。

或者您可以将 ID 保存在带有 index, id 列的分隔表中 - 就像在数据库中一样 - 但再次：在某些情况下它可能很有用，但在其他情况下它可能会产生问题。

我会将它保存在数据库中，因为它似乎是处理此类数据的更好工具，并且它具有处理许多表（关系）的工具。它甚至可以为作者分隔表格 - 列 index, author。

但现在我跳过这部分。

Python如何读取特定格式的文本文件并将其转换为数据帧

1 个答案: