返回单词的频率

时间:2015-08-01 19:03:01

标签: ruby regex

我有以下文字:

Grier et al. (1983) reported father and 2 sons with typical Aarskog
syndrome, including short stature, hypertelorism, and shawl scrotum.
They tabulated the findings in 82 previous cases. X-linked recessive
inheritance has repeatedly been suggested (see 305400). The family
reported by Welch (1974) had affected males in 3 consecutive
generations. Thus, there is either genetic heterogeneity or this is an
autosomal dominant with strong sex-influence and possibly ascertainment
bias resulting from use of the shawl scrotum as a main criterion.
Stretchable skin was present in the cases of Grier et al. (1983).

我正在尝试返回上面文字中的单词列表。

我做了如下的事情:

input_file.read.downcase.scan(/\b[a-z]\b/) {|word| frequency[word] = frequency[word] + 1}

我在文档中收到了字母(即abc,...,z)及其频率,而不是文字。这是为什么?而且,我怎样才能获得单词而不是单独的字母?

2 个答案:

答案 0 :(得分:3)

http://rubular.com是一个很好的资源。

\b[a-z]\b表示两个单词边界之间的任何单个字符。

如果您想允许使用多个字符:\b[a-z]+\b

即两个字边界之间的任何一个或多个字母。

答案 1 :(得分:1)

我这样做:

text = 'Foo. (1983). Bar baz foo bar.'
text.downcase
# => "foo. (1983). bar baz foo bar."

downcase将文字折叠为小写,以便在不管大小写的情况下轻松找到匹配的字词。

text.downcase.gsub(/[^a-z ]+/i, '')
# => "foo  bar baz foo bar"

gsub(/[^a-z ]+/i, '')删除不属于单词的字符,例如标点符号和数字。

text.downcase.gsub(/[^a-z ]+/i, '').split
# => ["foo", "bar", "baz", "foo", "bar"]

split会将字符串分成"字"这是空间划分的。

text.downcase.gsub(/[^a-z ]+/i, '').split.each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }
# => {"foo"=>2, "bar"=>2, "baz"=>1}

each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 }是如何遍历数组并计算元素的频率。 Hash.new{ |h,k| h[k] = 0}是如何定义哈希值,该哈希值会自动为不存在的密钥创建0值。

考虑到所有这些:

text = 'Grier et al. (1983) reported father and 2 sons with typical Aarskog syndrome, including short stature, hypertelorism, and shawl scrotum. They tabulated the findings in 82 previous cases. X-linked recessive inheritance has repeatedly been suggested (see 305400). The family reported by Welch (1974) had affected males in 3 consecutive generations. Thus, there is either genetic heterogeneity or this is an autosomal dominant with strong sex-influence and possibly ascertainment bias resulting from use of the shawl scrotum as a main criterion. Stretchable skin was present in the cases of Grier et al. (1983).'
text.downcase
    .gsub(/[^a-z ]+/i, '')
    .split
    .each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 } 

结果是:

# => {"grier"=>2, "et"=>2, "al"=>2, "reported"=>2, "father"=>1, "and"=>3, "sons"=>1, "with"=>2, "typical"=>1, "aarskog"=>1, "syndrome"=>1, "including"=>1, "short"=>1, "stature"=>1, "hypertelorism"=>1, "shawl"=>2, "scrotum"=>2, "they"=>1, "tabulated"=>1, "the"=>4, "findings"=>1, "in"=>3, "previous"=>1, "cases"=>2, "xlinked"=>1, "recessive"=>1, "inheritance"=>1, "has"=>1, "repeatedly"=>1, "been"=>1, "suggested"=>1, "see"=>1, "family"=>1, "by"=>1, "welch"=>1, "had"=>1, "affected"=>1, "males"=>1,...

如果您坚持使用正则表达式和scan

text.downcase
    .scan(/\b [a-z]+ \b/x)
    .each_with_object(Hash.new{ |h,k| h[k] = 0}){ |w, h| h[w] += 1 } 
# => {"grier"=>2, "et"=>2, "al"=>2, "reported"=>2, "father"=>1, "and"=>3, "sons"=>1, "with"=>2, "typical"=>1, "aarskog"=>1, "syndrome"=>1, "including"=>1, "short"=>1, "stature"=>1, "hypertelorism"=>1, "shawl"=>2, "scrotum"=>2, "they"=>1, "tabulated"=>1, "the"=>4, "findings"=>1, "in"=>3, "previous"=>1, "cases"=>2, "x"=>1, "linked"=>1, "recessive"=>1, "inheritance"=>1, "has"=>1, "repeatedly"=>1, "been"=>1, "suggested"=>1, "see"=>1, "family"=>1, "by"=>1, "welch"=>1, "had"=>1, "affected"=>1, ...

真正的区别在于gsub().splitscan(/\b [a-z]+ \b/x)快。