我试图在J中解析一个大型的CSV文件,这是我提出的线路分割路由:
splitlines =: 3 : 0
NB. y is the input string
nl_positions =. (y = (10 { a.)) NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do. NB. For each newline
to_drop =. i_index { prev_idx NB. The number of characters from the start of the string to skip
to_take =. i - to_drop NB. The number of characters in the current line
result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)
但是,它确实很慢。性能监视器显示第8行占用时间最长,可能是因为在删除和获取元素以及扩展结果列表时所有内存分配:
Time (seconds)
┌────────┬────────┬─────┬─────────────────────────────────────────┐
│all │here │rep │splitlines │
├────────┼────────┼─────┼─────────────────────────────────────────┤
│0.000011│0.000011│ 1│monad │
│0.003776│0.003776│ 1│[1] nl_positions=.(y=(10{a.)) │
│0.012429│0.012429│ 1│[2] nl_idx=.(#i.@#)nl_positions │
│0.000144│0.000144│ 1│[3] prev_idx =.(#nl_idx){.0,nl_idx │
│0.000002│0.000002│ 1│[4] result=.'' │
│0.027566│0.027566│ 1│[5] for_i. nl_idx do. │
│0.940466│0.940466│20641│[6] to_drop=.i_index{prev_idx │
│0.011238│0.011238│20641│[7] to_take=.i-to_drop │
│4.310495│4.310495│20641│[8] result=.result,<(to_take{.to_drop}.y)│
│0.006926│0.006926│20641│[9] end. │
│5.313052│5.313052│ 1│total monad │
└────────┴────────┴─────┴─────────────────────────────────────────┘
有更好的方法吗? 我正在寻找一种方法:
for
循环答案 0 :(得分:4)
如果我理解正确,您目前只想将包含多行的字符串拆分为单独的行。 (我想将线条拆分成字段将是后期的下一步?)
对于你想做的大部分工作而言,繁重的关键原语是cut(;.
)。例如:
<;._2 InputString NB. box each segment terminated by the last character in the string
<;._1 InputString NB. box each segment of InputString starting with the first character in the string
cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces
您可能会发现有用的其他相关资源有:splitstring
,freads
,tables/dsv
和tables/csv
插件。 freads
和splitstring
都可以在标准库中找到(在J6之后)。
'b' freads 'myfile.txt' NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
'","' splitstring InputString NB. boxed sub-strings of input string delimited by left argument
可以使用Package Manager安装tables/dsv
和tables/csv
个插件。安装后,它们可用于分割行内的行和字段,如下所示:
require 'tables/csv'
readcsv 'myfile.csv'
',' readdsv 'myfile.txt'
TAB readdsv 'myfile.txt'