使用功能样式在流中匹配任意数量的后续标记

时间:2017-06-05 08:05:56

标签: ruby functional-programming elixir

问题如下:

  1. 有一个包含令牌的文件 - 每个令牌都在一个单独的行中,并附有一些元数据(例如文档ID),
  2. 应计算一些令牌序列,序列可能是一个或多个令牌,
  3. 序列保存在trie中,但这不是必需的,
  4. 实现必须非常高效,因为要处理的文件具有千兆字节的数据。
  5. 我目前的实现(在Ruby中)如下:

    def convert_tuple(tuple)
      document_id, token_index, space, token = *tuple
      token = token.chomp
      token.force_encoding("ascii-8bit")
      document_id = document_id.to_i
      [document_id, token_index, space, token]
    end
    
    def count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)
      token_id = index[string]
      if token_id
        STDERR.puts "%s\t%s\t%s\t%s" % [document_id, first_token_index, last_token_index, string]
        counts[string] += 1
      end
      index.search(string).size > 0
    end
    
    counts = Hash.new(0)
    index = Melisa::IntTrie.new
    index.load(index_path)
    
    CSV.open(input_path, col_sep: "\t") do |input|
      input.each do |tuple|
        document_id, first_token_index, space, token = convert_tuple(tuple)
        recoreded_pos = input.pos
        last_token_index = first_token_index
        string = token.dup
        while(count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)) do
          last_document_id, last_token_index, space, last_token = convert_tuple(input.shift)
          break if document_id != last_document_id
          string << " " if space == "1"
          string << last_token
        end
        input.pos = recoreded_pos
      end
    end  
    
    CSV.open(output_path,"w") do |output|
      counts.each do |tuple|
        output << tuple
      end
    end
    

    convert_tuple函数仅对数据进行基本转换(即将字符串转换为数字等)。

    如果传递的字符串参数是不同字符串的前缀,count_and_match_tokens函数会对标记进行计数并返回true。我使用trie结构来有效地验证这种情况。

    我想知道如何看待使用功能样式编写的解决方案。我面临的问题是匹配的序列可能跨越许多令牌。

    在Ruby(或一般的OO风格)中,当子序列匹配结束时,我可以记录开始匹配的位置(recorded_pos = input.pos)和“重置”流({{1} })。因此,对input.pos = recorded_pos的后续调用将返回流中的下一个标记。因此,已经识别的序列内的令牌(在each循环内处理的令牌)也可以首先匹配其他子序列中的令牌。

    我会感谢Elixir的解决方案,但任何其他功能语言都可以。

    修改

    我提供了whileconvert_tuple的定义以及示例输入和输出(文件被截断,因此计数不直接对应于输入文件)。

    代码中出现的索引数据结构是Maris Trie(Melisa gem:https://github.com/wordtreefoundation/melisa/

    示例输入:

    count_and_match_tokens

    要识别的令牌序列:

    0   746 1   The
    0   748 1   river
    0   751 1   Bosna
    0   754 1   (
    0   763 0   )
    0   765 1   (
    0   766 0   Cyrillic
    0   767 0   :
    0   769 1   Босна
    0   770 0   )
    0   772 1   is
    0   774 1   the
    0   776 1   third
    0   778 1   longest
    0   781 1   river
    0   784 1   in
    0   787 1   Bosnia
    0   789 1   and
    0   791 1   Herzegovina
    0   793 0   ,
    0   795 1   and
    0   797 1   is
    0   799 1   considered
    0   801 1   one
    0   803 1   of
    0   805 1   the
    0   807 1   country
    0   808 0   '
    0   809 0   s
    0   811 1   three
    0   813 1   major
    0   815 1   internal
    0   817 1   rivers
    

    示例输出:

    Bosnia
    Bosnia and Herzegovina
    river
    Herzegovina
    

    我希望这有助于理解我想要解决的问题。

1 个答案:

答案 0 :(得分:1)

可运行的程序(count_sequences.rb):

server {
    listen       80;
    server_name  example.com;
    return       301 http://www.example.com$request_uri;
}

server {
    listen 80 default_server;
    ...
}

您可以使用

运行它
#!/usr/bin/env ruby
require 'set'

sequence_file, token_file = ARGV

sequences = Set.new

forest = File.readlines(sequence_file).each{|s| sequences << s.tap(&:chomp!)}.map!(&:split).each_with_object({}) do |words, root|
  words.reduce(root) do |parent, word|
    (parent[word] ||= [0, {}])[1]
  end
end
#=>  {
#      "Bosnia" => [0, {
#        "and" => [0, {
#          "Herzegovina" => [0, {}]
#        }]
#      }],
#      "river" => [0, {}]
#    }

File.open(token_file) do |f|
  current_node = forest

  f.each_line do |line|
    token = line.tap(&:chomp!).split[-1]
    spec = current_node[token] || forest[token]
    if spec
      spec[0] += 1
      current_node = spec[1]
    else
      current_node = forest
    end
  end
end
#=>  {
#      "Bosnia" => [1, {
#        "and" => [1, {
#          "Herzegovina" => [1, {}]
#        }]
#      }],
#      "river" => [2, {}]
#    }

def print_tree(node, sequences, parent = nil)
  node.each do |word, spec|
    sequence = [parent, word].compact.join(' ')
    puts "#{sequence},#{spec[0]}" if sequences.include? sequence
    print_tree(spec[1], sequences, sequence)
  end
end

print_tree(forest, sequences)

输出

$ ruby count_sequences.rb /path/to/sequences.txt /path/to/tokens.txt