Question

我有兴趣在Julia中使用正则表达式拆分一行。我的输入是Blei的LDA-C格式的语料库，由docId wordID : wordCNT组成。例如，包含五个单词的文档表示如下：

186 0:1 12:1 15:2 3:1 4:1

我正在寻找一种方法将单词及其计数聚合成单独的数组，即我想要的输出：

words =  [0, 12, 15, 3, 4]
counts = [1,  1,  2, 1, 1]

我尝试过使用m = match(r"(\d+):(\d+)",line)。但是，它只找到第一对0:1。我正在寻找类似于Python re.compile(r'[ :]').split(line)的东西。我如何根据朱莉娅的正则表达式划分一条线？

Answer 1

这里没有必要使用正则表达式; Julia的split函数允许使用多个字符来定义拆分的位置：

julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
 "186"
 "0"
 "1"
 "12"
 "1"
 "15"
 "2"
 "3"
 "1"
 "4"
 "1"

julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
 "0"
 "12"
 "15"
 "3"
 "4"

julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
 "1"
 "1"
 "2"
 "1"
 "1"

Answer 2

我发现eachmatch方法返回正则表达式匹配的迭代器。另一种解决方案是迭代每场比赛：

words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
    wd, cnt = m.captures
    push!(words,  parse(Int64, wd))
    push!(counts, parse(Int64, cnt))
end

Answer 3

马特B 。提到，这里不需要正则表达式，因为Julia lib split（）可以使用字符数组。

然而 - 当需要Regex时 - 同样的split（）函数才有效，类似于其他人的建议：

line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]

我最近不得不在一些Unicode处理代码中做到这一点（其中分裂字符 - 其中“组合字符”，因此不适合julia'单引号'）意味着：

split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )

基于朱莉娅的正则表达式的分割线

3 个答案: