Python如何读取特定格式的文本文件并将其转换为数据帧

时间:2021-06-17 02:25:35

标签: python text import

这个文件保存了论文信息和引文网络。格式如下:

#index ---- index id of this paper
#* ---- paper title
#@ ---- authors (separated by semicolons)
#t ---- year
#c ---- publication venue
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)

以下是一个例子:

#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal

#index 2
#* MOSFET table look-up models for circuit simulation
#@ 
#t 1984
#c Integration, the VLSI Journal

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data 
mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007

我正在尝试将此文件转换为我可以使用的数据框。 我尝试使用 for 循环遍历行并创建了许多 if 语句来确定信息的类型并将其附加到指定的列表(paper_title、authors 等),最后,将它们组合成一个数据框。但是,我注意到可能有多个参考 ID,如示例所示,因此参考 ID 的行将与其余行不匹配。请帮忙!

1 个答案:

答案 0 :(得分:0)

如果每个元素(索引、标题、作者、年份、pub)只使用一行,那么你可以像这样简单地转换它

text = '''#index 1
#* Book Review: Discover Linux
#@ Marjorie Richardson
#t 1998
#c Linux Journal

#index 2
#* MOSFET table look-up models for circuit simulation
#@ 
#t 1984
#c Integration, the VLSI Journal

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
'''

import pandas as pd

# --- functions ---

def parse(cit):
    lines = cit.strip().split('\n')
    index = lines[0].replace('#index ', '')
    title = lines[1].replace('#* ', '')
    authors = lines[2].replace('#@ ', '')
    if authors:
       authors = authors.split(';')
    else:
       authors = []
    year = lines[3].replace('#t ', '')
    pub = lines[4].replace('#c ', '')
    ids = lines[5:]
    ids = [x.replace('#% ', '')  for x in ids]
    
    print('found index:', index)
    print('found title:', title)
    print('found authors:', authors)
    print('found year:', year)
    print('found pub:', pub)
    print('found IDs:', ids)
    print('---')

    return [index, title, authors, year, pub, ids]

# --- main ---
    
citations = text.split('\n\n')

rows = []

for cit in citations:
    rows.append(parse(cit))
    
df = pd.DataFrame(rows, columns=['index', 'title', 'authors', 'year', 'pub', 'ids'])    
pd.options.display.max_columns = 100
print(df)

结果:

found index: 1
found title: Book Review: Discover Linux
found authors: ['Marjorie Richardson']
found year: 1998
found pub: Linux Journal
found IDs: []
---
found index: 2
found title: MOSFET table look-up models for circuit simulation
found authors: []
found year: 1984
found pub: Integration, the VLSI Journal
found IDs: []
---
found index: 1083734
found title: ArnetMiner: extraction and mining of academic social networks
found authors: ['Jie Tang', 'Jing Zhang', 'Limin Yao', 'Juanzi Li', 'Li Zhang', 'Zhong Su']
found year: 2008
found pub: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
found IDs: ['197394', '220708', '280819', '387427', '464434', '643007']
---
     index                                              title  \
0        1                        Book Review: Discover Linux   
1        2  MOSFET table look-up models for circuit simula...   
2  1083734  ArnetMiner: extraction and mining of academic ...   

                                             authors  year  \
0                              [Marjorie Richardson]  1998   
1                                                 []  1984   
2  [Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, L...  2008   

                                                 pub  \
0                                      Linux Journal   
1                      Integration, the VLSI Journal   
2  Proceedings of the 14th ACM SIGKDD internation...   

                                                ids  
0                                                []  
1                                                []  
2  [197394, 220708, 280819, 387427, 464434, 643007]  

引文之间有空行,所以我在这一行拆分文本,然后运行 ​​for-loop 以在函数 parse() 中单独转换每个引文。我假设每个元素只使用一行,所以不需要 if/else 来检查行。

因为引用可能有不同数量的 ID,所以它必须作为一个列/单元格中的列表。这比将 everu ID 放在新列中更有用。

编辑:现在我也拆分作者以将它们作为列表保留。


最终,您可以使用单个 ID 在新行中复制引文。在某些情况下它可能有用,但在其他情况下它可能会产生问题。

或者您可以将 ID 保存在带有 index, id 列的分隔表中 - 就像在数据库中一样 - 但再次:在某些情况下它可能很有用,但在其他情况下它可能会产生问题。

我会将它保存在数据库中,因为它似乎是处理此类数据的更好工具,并且它具有处理许多表(关系)的工具。 它甚至可以为作者分隔表格 - 列 index, author

但现在我跳过这部分。