我有一个以下形状的.txt文件。不切实际,未知值只是空白:
----Header---
Description,
a few lines of description
Still description
# RESIDUE AA STRUCTURE BP1 BP2
1 79 A G 0 0 97
2 80 A A - 0 0 28
3 81 A V E -A 134 0A 53
4 82 A F E -A 133 0A 6
5 83 A K E -A 132 0A 52
11 ! 0 0 0
12 101 A D H 0 0 137
我想提取第2,第4和第5列,其中应考虑不存在的值。所以,我想要的是:
function(textfile,1,3,4)
>[79,80,81,82,83,"",101]
>["G","A","V","F","K","!","D"]
>["","","E","E","E","","H"]
输出的确切形状无关紧要,例如,是一个n x 3阵列或......由于将未知数留空的错误选择,我不能使用np.loadtxt,因为它会立即跳转到下一列。
答案 0 :(得分:0)
您是否尝试过将pandas.read_csv与分隔符设置为空格一起使用。
e.g。
pandas.read_csv(filename = 'filename.txt', delim_whitespace=True).
您似乎缺少列名。
答案 1 :(得分:0)
您可以按照以下方式使用Pandas进行调查:
<juso>
这会显示:
print pd.read_fwf('input.txt', widths=(4, 5, 2, 2, 3, 7, 5, 6, 5), usecols=[1, 3, 4], skiprows=6, header=None)
或者,您可以手动提取必要的列,如下所示:
1 3 4
0 79.0 G NaN
1 80.0 A NaN
2 81.0 V E
3 82.0 F E
4 83.0 K E
5 NaN ! NaN
6 101.0 D H
这会给你一个如下列表:
import itertools
col_locations = [(3,8), (11, 12), (13,15)]
with open('input.txt') as f_input:
# Skip over initial lines until the header row
next(itertools.dropwhile(lambda x: "RESIDUE" not in x, f_input))
lines = [row.rstrip() for row in f_input]
data = []
for row in lines:
data.append([row[start:end].strip() for start, end in col_locations])
data = zip(*data) # Transpose the data
print data
如果您确实希望将第一列转换为数字,则可以按如下方式应用每列转换功能:
[('79', '80', '81', '82', '83', '', '101'), ('G', 'A', 'V', 'F', 'K', '!', 'D'), ('', '', 'E', 'E', 'E', '', 'H')]
给你:
import itertools
def num_convert(x):
try:
return int(x)
except:
return ''
col_locations = [(3, 8, num_convert), (11, 12, str.strip), (13, 15, str.strip)]
with open('input.txt') as f_input:
# Skip over initial lines until the header row
next(itertools.dropwhile(lambda x: "RESIDUE" not in x, f_input))
lines = [row.rstrip() for row in f_input]
data = []
for row in lines:
data.append([conversion(row[start:end]) for start, end, conversion in col_locations])
data = zip(*data) # Transpose the data
print data
答案 2 :(得分:0)
您可以使用struct module:
import struct
line = ' 5 83 A K E -A 132 0A 52 '
extracted_line = map(lambda x: x.strip(), struct.unpack("6s3s2s3s6s4s7s5s6s", line[:42])))
print(list(extracted_line))
可能需要进行一些调整,因为我不知道随着值的增长,它们是向左还是向右移动。但这是一种方式。