Question

我想在txt文件中搜索特定单词。如果我找到那个单词，我想检索文件中紧跟其后的单词。如果我的文本文件包含：

"My name is Jay and I want to go to the store"

我正在搜索单词"want"，并希望将单词"to"添加到我的数组中。我会查看一个非常大的文本文件，所以关于性能的任何注释都会很棒。

Answer 1

最直观的阅读方式可能如下：

a = []
str = "My name is Jack and I want to go to the store"
str.scan(/\w+/).each_cons(2) {|x, y| a << y if x == 'to'}
a
  #=> ["go", "the"]

要将文件读入字符串，请使用File.read。

Answer 2

这是一种方式：

<强>代码

def find_next(fname, word)
  enum = IO.foreach(fname)
  loop do
    e = (enum.next).scan(/\w+/)
    ndx = e.index(word)
    if ndx
      return e[ndx+1] if ndx < e.size-1
      loop do
        e = enum.next
        break if e =~ /\w+/
      end
      return e[/\w+/]
    end
  end
  nil
end

示例

text =<<_ It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, . . . . . it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair… _ FName = "two_cities" File.write(FName, text) find_next(FName, "worst") # of find_next(FName, "wisdom") # it find_next(FName, "foolishness") # it find_next(FName, "dispair") #=> nil find_next(FName, "magpie") #=> nil

较短但效率较低，并且对大文件有问题：

File.read(FName)[/(?<=\b#{word}\b)\W+(\w+)/,1]

Answer 3

这可能不是最快的方法，但这些方面应该有效：

filename = "/path/to/filename"
target_word = "weasel"
next_word = ""

File.open(filename).each_line do |line|
  line.split.each_with_index do |word, index|
    if word == target_word
      next_word = line.split[index + 1]
    end
  end
end

Answer 4

给定存储在文件中的文件，字符串或字符串：

pattern, match = 'want', nil
catch :found do
  file.each_line do |line|
    line.split.each_cons(2) do |words|
      if words[0] == pattern
        match = words.pop 
        throw :found
      end
    end
  end
end
match
#=> "to"

请注意，此答案最多可以找到每个文件的一个匹配速度，而行式操作将节省内存。如果您想在每个文件中找到多个匹配项，或者在换行符中找到匹配项，那么this other answer可能就是您的选择。 YMMV。

Answer 5

这是我能想到的最快的，假设你的文件是用字符串加载的：

word = 'want'
array = []
  string.scan(/\b#{word}\b\s(\w+)/) do
  array << $1
end

这将找到跟随您的特定单词的所有单词。例如：

word = 'want'
string = 'My name is Jay and I want to go and I want a candy'
array = []
string.scan(/\b#{word}\b\s(\w+)/) do
  array << $1
end
p array #=> ["to", "a"]

在我的机器上测试这个，我将这个字符串复制了500,000次，我的执行时间达到了0.6秒。我也尝试过其他方法，例如拆分字符串等，但这是最快的解决方案：

require 'benchmark'

Benchmark.bm do |bm|
  bm.report do
    word = 'want'
    string = 'My name is Jay and I want to go and I want a candy' * 500_000
    array = []
    string.scan(/\b#{word}\b\s(\w+)/) do
      array << $1
    end
  end
end

逐字搜索文本

5 个答案: