无法用Ruby组合英文单词

时间:2009-05-09 09:33:40

标签: ruby knuth

我需要找到所有可以用字符串中的字母组成的英语单词

 sentence="Ziegler's Giant Bar"

我可以通过

制作一系列字母
 sentence.split(//)  

如何从Ruby中的句子中创建超过4500个英语单词?

[编辑]

最好将问题分成几部分:

  1. 仅制作10个字母或更少字母的单词数组
  2. 可以单独查找较长的单词

4 个答案:

答案 0 :(得分:8)

[假设您可以在一个单词中重复使用源字母]:对于字典列表中的每个单词,构造两个字母数组 - 一个用于候选单词,另一个用于输入字符串。从单词array-of-letters中减去输入的字母数组,如果没有剩下任何字母,你就得到了匹配。执行此操作的代码如下所示:

def findWordsWithReplacement(sentence)
    out=[]
    splitArray=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        if (word.strip!.downcase.split(//) - splitArray).empty?
            out.push word
        end
     }
     return out
end

您可以从irb调试器中调用该函数,如下所示:

output=findWordsWithReplacement("some input string"); puts output.join(" ")

...或者这里是一个包装器,可以用来从脚本中以交互方式调用该函数:

puts "enter the text."
ARGF.each {|line|
    puts "working..."
    out=findWordsWithReplacement(line)
    puts out.join(" ")
    puts "there were #{out.size} words."
}

在Mac上运行时,输出如下所示:

  

$ ./findwords.rb
  输入文字。
  齐格勒的巨人酒吧   工作...
  一个aa   aal aalii Aani Ab aba abaiser   Abanienate Abantes Abaris abas abase   abaser Abasgi abasia Abassin abatable   ab ab ab ab ab ab Ab Ab Ab Ab Ab Ab Ab Ab Ab Ab Ab   abbas abbasi abbassi abbatial abbess   Abbie Abe abear Abel abele Abelia   Abelian Abelite abelite abeltree   阿比亚异常异常教唆abettal   Abie Abies abietate abiettene abietin   Abietineae Abiezer阿比盖尔阿比盖尔   abigeat abilla abintestate
  [....]
  Z Z   za Zabaean zabeta Zabian zabra zabti   zabtie zag zain zan zanella zant zante   Zanzalian zanze Zanzibari zar zaratite   zareba zat zati zattare Zea zeal   无热情的热情斑马斑马   Zebrina zebrine zee zein zeist zel   Zelanian Zeltinger Zen Zenaga zenana   zer zest zeta ziara ziarat zibeline   zibet ziega zieger zig zigzag   Zigzagger Zilla zing zingel Zingiber   zingiberene Zinnia zinsang Zinzar zira   zirai Zirbanit Zirian Zirianian   Zizania Zizia zizz
  有6725个单词。

这超过4500字,但那是因为Mac字典非常大。如果你想完全重现Knuth的结果,请从这里下载并解压缩Knuth的字典:http://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gz并将“/ usr / share / dict / words”替换为你解压缩替换目录的路径。如果你做得对,你会得到4514个单词,以这个集合结尾:

  

zanier zanies zaniness Zanzibar zazen   热情斑马斑马蔡司时代精神禅宗   Zennist zest zestier zeta Ziegler zig   zigging zigzag zigzagging zigzags zing   zingier zings zinnia

我相信这回答了原来的问题。

或者,提问者/读者可能想要列出可以从字符串构造的所有单词,而不用重用任何输入字母。我建议的代码完成如下工作:复制候选词,然后对输入字符串中的每个字母,从副本中破坏性地删除该字母的第一个实例(使用“slice!”)。如果此过程吸收了所有字母,请接受该字样。

def findWordsNoReplacement(sentence)
    out=[]
    splitInput=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        copy=word.strip!.downcase
        splitInput.each {|o| copy.slice!(o) }
        out.push word if copy==""
     }
     return out
end

答案 1 :(得分:3)

如果您想查找其字母和频率受给定短语限制的字词, 你可以构建一个正则表达式来为你做这个:

sentence = "Ziegler's Giant Bar"

# count how many times each letter occurs in the 
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
  counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }

# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
  /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end

# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
          "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
          ...
          "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
          "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
          ...
          "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
          "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
          ...
          "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
          "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
          ...
          "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
          "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
          ...
          "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
          "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
          ...
          "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
          "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
          ...
          "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
          "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
          ...
          "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
          "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
          ...
          "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
          "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
          ...
          "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
          "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]

正向前瞻使您可以创建一个与字符串中的位置匹配的正则表达式,其中某些指定的模式匹配而不会消耗匹配的字符串部分。 我们在这里使用它们来匹配单个正则表达式中的多个模式的相同字符串。 只有当我们所有的模式匹配时,该位置才匹配。

如果我们允许无限次重用原始短语中的字母(就像Knuth根据glenra的评论那样做),那么构建正则表达式就更容易了:

sentence = "Ziegler's Giant Bar"

# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq

# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
           ...
           "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
           "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
           ...
           "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
           "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
           "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
           ...
           "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
           "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
           "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
           ...
           "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
           "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
           ...
           "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
           "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
           ...
           "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
           "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
           "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
           ...
           "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
           "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
           "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
           ...
           "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
           "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
           ...
           "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
           "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
           "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
           ...
           "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
           "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
           ...
           "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]

答案 2 :(得分:1)

我认为Ruby没有英文字典。但是您可以尝试将原始字符串的所有排列存储在一个数组中,并检查这些字符串是否与Google相关?假设一个单词实际上是一个单词,如果有超过100.000次点击或其他什么?

答案 3 :(得分:1)

您可以获得一系列字母:

sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)