Question

我想从一个txt中提取一些信息。文件（名为inf.txt），用于在python中构建数据框。 inf.txt的一个例子就是：

bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

想要的结果是：

    1   2   3   4   5   6
0   43  na  na  0   4   na
1   na  na  na  na  na  na
2   4   5   na  95  na  94
3   na  na  na  140 na  142

行号来自 dtype：object 之后的数字，列号来自每个括号中的第二个数字。

例如，在第一行中它（Variable1,43）：它属于 dtype：object 0 ，所以它在第一行;变量1，所以它在第一列。

另一个例子，在倒数第二行，它（Label6,142）：它属于 dtype：object 3 ，所以它在第三行; Label6，所以它在第六列。

所有这些字符串类似于＆＃34; bene_id_18900＆＃34;，＆＃34;变量＆＃34;，＆＃34;标签＆＃34;实际上没有任何意义。

我的想法是在每个括号中添加相应的行号，以便稍后我可以保留所有有用的信息并删除所有无用的信息。像这样：

(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......

我的尝试，我真的不知道......

with open('/Users/xccxken/Dropbox/inf.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
for x in content:

Answer 1

让我们说你知道文本文件中的行（M）和列（N）的数量。获取max dtype和max label（no）变量（no）的简单解析将获得此信息。接下来创建一个MxN数组

import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
    tmp = list(x)
    data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects

with open('/Users/xccxken/Dropbox/inf.txt') as fh:
    for line in fh:
        line = line.strip()
        if 'dtype' in line:
            # get the x axis index
            index_x = int(line.split(' ')[-1])
        if 'Label' in line:
            # get y axis index
            c = re.search('Label(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            # reduce index_y by 1 as the col names start with 1 and python list is 0 index
            if index_y > 0:
                index_y -= 1
            # get value
            value = int(c.groups()[1])
            if index_x >= 0: # fix the NameError and a logical bug
                # populate the correct x,y location in the list of lists
                data[index_x][index_y] = value
        if 'Variable' in line:
            c = re.search('Variable(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            value = int(c.groups()[1])
            if index_y > 0:
                index_y -= 1
            if index_x >= 0: # fix the NameError and a logical bug
                data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)

希望这有帮助，这对我有用我把它作为样本：

dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

，输出为：

    1   2   3    4   5    6
0  43  na  na   95   4   94
1  na  na  na   na  na   na
2  na  na  na  140  na  142
3  na  na  na  140  na  142

只是fyi，我认为这些也是有效的数据：

dtype: object 0
from      (Variable4, 95) # is valid
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94) # is valid

从txt构建数据框

1 个答案: