尝试将.txt
文件读入我的Jupyter笔记本中。
这是我的代码:
acm = pd.read_csv('outputacm.txt', header=None, error_bad_lines=False)
print(acm)
这是我的txt文件的示例:
2244018
#*OQL[C++]: Extending C++ with an Object Query Capability.
#@José A. Blakeley
#year1995
#confModern Database Systems
#citation14
#index0
#arnetid2
#*Transaction Management in Multidatabase Systems.
#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
#year1995
#confModern Database Systems
#citation22
#index1
#arnetid3
#*Overview of the ADDS System.
#@Yuri Breitbart,Tom C. Reyes
#year1995
#confModern Database Systems
#citation-1
#index2
#arnetid4
并且不同的符号应该对应于:
#* --- paperTitle
#@ --- Authors
#year ---- Year
#conf --- publication venue
#citation --- citation number (both -1 and 0 means none)
#index ---- index id of this paper
#arnetid ---- pid in arnetminer database
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! --- Abstract
不确定如何设置,以便正确读取数据。理想情况下,我需要一个数据框,其中每个类别都是不同的列,然后文档中的所有条目都是行。谢谢!
答案 0 :(得分:0)
我的正则表达式没有达到应有的速度,但是只要数据保持相同的格式并且列名在其他行中不重复,以下内容可能就会起作用:
import re
import pandas as pd
path = r"filepath.txt"
f = open(path, 'r')
year = []
confModern = []
#continue for all columns
for ele in f:
if len(re.findall('year', ele)) > 0:
year.append(ele[5:])
if len(re.findall('confModern', ele)) > 0:
year.append(ele[12:])
# continue for all columns with the needed string
df = pd.DataFrame(data={'year' : year ...#continue for each list})