Question

我想把TSV读成numpy数组。是否有通用的方法从文件中读取数据并将其转换为numpy浮点数组。（缺少值也很少）

该文件看起来像

Variable_1 ..... Variable_100
 0.001     ..... 0.25
  ...            ...
 1.65      ..... 1.32

我试过

def converter(x): 
   return float(x)

data = np.genfromtxt(fname="file.tsv", delimiter="\t", skip_header=0, names=True, converters={"Variable_" + str(n):converter for n in range(1554)})

然而，在读取文件形状后是1D数组而不是ndarray of rows = 200 cols = 100

data.shape
(200,)

Answer 1

在不知道file.tsv文件的样子的情况下，您可以使用pandas read_csv方法将.tsv文件作为数据帧读入内存，然后访问.values数据帧，它将返回感兴趣的数组：

import pandas as pd
import numpy as np

# make a dummy .tsv file, save it to disk
dummy = pd.DataFrame(np.random.randint(0,10,(200,100)))
save_path = "foo.tsv"
dummy.to_csv(save_path, index=False, sep="\t")

df = pd.read_csv(save_path, sep="\t")   # read dummy .tsv file into memory

a = df.values  # access the numpy array containing values

现在你将拥有一个形状数组（200,100）：

print a.shape
print a

(200, 100)
[[4 1 8 ... 2 7 0]
 [0 1 9 ... 7 1 3]
 [7 6 6 ... 9 0 2]
 ...
 [1 5 1 ... 1 8 7]
 [7 4 6 ... 9 6 0]
 [2 0 1 ... 3 2 9]]

您提到原始.tsv文件中缺少值。为了适应这种情况，您可以利用pandas的fillna方法填充特定列或整个数据帧中的值：

df.col_1.fillna(1, inplace=True)  # fill missing values with 1 in a single col

df.fillna(1, inplace=True) # fill all missing values with 1 in entire frame

更新

OP请求仅使用 numpy的genfromtxt()。在这种情况下，需要以下内容：

data = np.genfromtxt(fname="foo.tsv", delimiter="\t", skip_header=1, filling_values=1)  # change filling_values as req'd to fill in missing values

print data.shape  # (200,100)

numpy将TSV文件读为ndarray

1 个答案:

更新