我想从一个txt中提取一些信息。文件(名为inf.txt
),用于在python中构建数据框。 inf.txt
的一个例子就是:
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id (Label1, 43)
encrypted 723 beneficiary id (Label5, 4)
dtype: object 0
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
from (Variable4, 95)
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94)
dtype: object 2
first day on claim billing statement (Label4, 95)
first day on claim billing statement (Label6, 94)
dtype: object 2
thru (Variable4, 140)
thru (VNAME4, 140)
thru (Variable6, 142)
thru (VNAME6, 142)
dtype: object 3
last day on claim billing statement (Label4, 140)
last day on claim billing statement (Label6, 142)
dtype: object 3
想要的结果是:
1 2 3 4 5 6
0 43 na na 0 4 na
1 na na na na na na
2 4 5 na 95 na 94
3 na na na 140 na 142
行号来自 dtype:object 之后的数字,列号来自每个括号中的第二个数字。
例如,在第一行中它(Variable1,43):它属于 dtype:object 0 ,所以它在第一行;变量1,所以它在第一列。
另一个例子,在倒数第二行,它(Label6,142):它属于 dtype:object 3 ,所以它在第三行; Label6,所以它在第六列。
所有这些字符串类似于" bene_id_18900","变量","标签"实际上没有任何意义。
我的想法是在每个括号中添加相应的行号,以便稍后我可以保留所有有用的信息并删除所有无用的信息。像这样:
(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......
我的尝试,我真的不知道......
with open('/Users/xccxken/Dropbox/inf.txt') as f:
content = f.readlines()
content = [x.strip() for x in content]
for x in content:
答案 0 :(得分:0)
让我们说你知道文本文件中的行(M)和列(N)的数量。获取max dtype和max label(no)变量(no)的简单解析将获得此信息。 接下来创建一个MxN数组
import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
tmp = list(x)
data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects
with open('/Users/xccxken/Dropbox/inf.txt') as fh:
for line in fh:
line = line.strip()
if 'dtype' in line:
# get the x axis index
index_x = int(line.split(' ')[-1])
if 'Label' in line:
# get y axis index
c = re.search('Label(\d), (\d+)', line)
index_y = int(c.groups()[0])
# reduce index_y by 1 as the col names start with 1 and python list is 0 index
if index_y > 0:
index_y -= 1
# get value
value = int(c.groups()[1])
if index_x >= 0: # fix the NameError and a logical bug
# populate the correct x,y location in the list of lists
data[index_x][index_y] = value
if 'Variable' in line:
c = re.search('Variable(\d), (\d+)', line)
index_y = int(c.groups()[0])
value = int(c.groups()[1])
if index_y > 0:
index_y -= 1
if index_x >= 0: # fix the NameError and a logical bug
data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)
希望这有帮助,这对我有用 我把它作为样本:
dtype: object 0
encrypted 723 beneficiary id (Label1, 43)
encrypted 723 beneficiary id (Label5, 4)
dtype: object 0
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
from (Variable4, 95)
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94)
dtype: object 2
first day on claim billing statement (Label4, 95)
first day on claim billing statement (Label6, 94)
dtype: object 2
thru (Variable4, 140)
thru (VNAME4, 140)
thru (Variable6, 142)
thru (VNAME6, 142)
dtype: object 3
last day on claim billing statement (Label4, 140)
last day on claim billing statement (Label6, 142)
dtype: object 3
,输出为:
1 2 3 4 5 6
0 43 na na 95 4 94
1 na na na na na na
2 na na na 140 na 142
3 na na na 140 na 142
只是fyi,我认为这些也是有效的数据:
dtype: object 0
from (Variable4, 95) # is valid
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94) # is valid