Question

我有大量的.txt文件（可能大约有1000万个），每个文件都有相同的行数/列数。它们实际上是一些单通道图像，像素值用空格分隔。这是我为编写工作而编写的代码，但速度非常慢。我想知道是否有人可以提出更优化/更有效的方法：

require 'torch'

f = assert(io.open(txtFilePath, 'r'))
local tempTensor = torch.Tensor(1, 64, 64):fill(0)
local i = 1
for line in f:lines() do
    local l = line:split(' ')
    for key, val in ipairs(l) do
        tempTensor[{1, i, key}] = tonumber(val)
    end
    i = i + 1
end
f:close()

Answer 1

简而言之，如果可能，请更改源文件。

我唯一建议的是使用二进制数据而不是txt作为源。您有长期方法：f:lines()，line:split(' ')和tonumber(val)。所有这些都使用字符串作为变量。

据我所知，你有这样的文件：

0 10 20

11 18 22

...

所以，将你的源码改为二进制，如下所示：

＆℃，GT;＆LT 18 GT;＆LT; 20 - ;＆LT 11为H.;＆LT 18 GT;＆LT 22为氢; ...

其中<18>是十六进制形式的字节，即12，<20>是16，等等。

阅读

fid = io.open(sup_filename, "rb")
while true do
  local bytes = fid:read(1)
  if bytes == nil then break end -- EOF
  local st = bytes[0]
  print(st)
end

fid:close()

https://www.lua.org/pil/21.2.2.html 它会快得多。

可能正在使用正则表达式（而不是:split()和lines()）对您有所帮助，但我不认为。

有效地在Torch张量中读取，解析和存储.txt文件内容

1 个答案: