从txt构建数据框

时间:2017-06-20 17:49:24

标签: python pandas dataframe

我想从一个txt中提取一些信息。文件(名为inf.txt),用于在python中构建数据框。 inf.txt的一个例子就是:

bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

想要的结果是:

    1   2   3   4   5   6
0   43  na  na  0   4   na
1   na  na  na  na  na  na
2   4   5   na  95  na  94
3   na  na  na  140 na  142

行号来自 dtype:object 之后的数字,列号来自每个括号中的第二个数字。

例如,在第一行中它(Variable1,43):它属于 dtype:object 0 ,所以它在第一行;变量1,所以它在第一列。

另一个例子,在倒数第二行,它(Label6,142):它属于 dtype:object 3 ,所以它在第三行; Label6,所以它在第六列。

所有这些字符串类似于" bene_id_18900","变量","标签"实际上没有任何意义。

我的想法是在每个括号中添加相应的行号,以便稍后我可以保留所有有用的信息并删除所有无用的信息。像这样:

(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......

我的尝试,我真的不知道......

with open('/Users/xccxken/Dropbox/inf.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
for x in content:

1 个答案:

答案 0 :(得分:0)

让我们说你知道文本文件中的行(M)和列(N)的数量。获取max dtype和max label(no)变量(no)的简单解析将获得此信息。 接下来创建一个MxN数组

import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
    tmp = list(x)
    data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects

with open('/Users/xccxken/Dropbox/inf.txt') as fh:
    for line in fh:
        line = line.strip()
        if 'dtype' in line:
            # get the x axis index
            index_x = int(line.split(' ')[-1])
        if 'Label' in line:
            # get y axis index
            c = re.search('Label(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            # reduce index_y by 1 as the col names start with 1 and python list is 0 index
            if index_y > 0:
                index_y -= 1
            # get value
            value = int(c.groups()[1])
            if index_x >= 0: # fix the NameError and a logical bug
                # populate the correct x,y location in the list of lists
                data[index_x][index_y] = value
        if 'Variable' in line:
            c = re.search('Variable(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            value = int(c.groups()[1])
            if index_y > 0:
                index_y -= 1
            if index_x >= 0: # fix the NameError and a logical bug
                data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)

希望这有帮助,这对我有用 我把它作为样本:

dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

,输出为:

    1   2   3    4   5    6
0  43  na  na   95   4   94
1  na  na  na   na  na   na
2  na  na  na  140  na  142
3  na  na  na  140  na  142

只是fyi,我认为这些也是有效的数据:

dtype: object 0
from      (Variable4, 95) # is valid
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94) # is valid