n-gram是所有单词中最常见的一个

时间:2014-09-04 00:27:34

标签: c algorithm n-gram

我遇到了以下编程面试问题:

挑战1:N-克

N-gram是来自给定单词的N个连续字符的序列。对于“飞行员”这个词,有三个3克:“pil”,“ilo”和“lot”。 对于给定的单词集和n-gram长度 你的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

请注意,您的函数将收到以下参数:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

数据约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

效率约束

• your function is expected to print the result in less than 2 seconds

实施例 输入 文字:“aaaab a0a baaab c”

输出aaa ngramLength:3

解释

对于上面显示的输入,按频率排序的3克是:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

如果我只有一个小时的时间来解决问题,我选择使用C语言来解决问题:实现哈希表以计算N-gram的频率是否是一个好主意?因为在C库中没有哈希表的实现......

如果是,我正在考虑使用单独链接和有序链接列表来实现哈希表。这些实现减少了您必须解决问题的时间....

这是最快的选择吗?

谢谢!!!

8 个答案:

答案 0 :(得分:5)

如果实现效率是重要的并且你正在使用C,我会初始化一个指向字符串中n-gram开头的指针数组,使用qsort根据n-gram对指针进行排序它们是其中的一部分,然后循环遍历排序的数组并计算出计数。

这应该足够快地执行,并且不需要编写任何花哨的数据结构。

答案 1 :(得分:1)

抱歉发布了python,但这就是我要做的: 您可能会对算法有所了解。请注意,此程序可以解决更多的单词。

from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**

答案 2 :(得分:1)

所以这个问题的基本方法是:

  1. 查找字符串
  2. 中的所有n-gram
  3. 将所有重复条目映射到具有n-gram及其出现次数的新结构
  4. 您可以在此处找到我的c ++解决方案:http://ideone.com/MNFSis

    假设:

    const unsigned int MAX_STR_LEN = 250000;
    const unsigned short NGRAM = 3;
    const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
    //we will need a maximum of "the length of our string" - "the length of our n-gram"
    //places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
    char ngrams[NGRAMS][NGRAM+1] = { 0 };
    

    然后,第一步 - 这是代码:

    const char *ptr = str;
    int idx = 0;
    //notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
    while (notTerminated(ptr)) { 
        //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
        if (noSpace(ptr)) {
            //safely copy our current n-gram over to the ngrams array
            //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
            //are valid letters
            for (int i=0; i<NGRAM; i++) {
                ngrams[idx][i] = ptr[i];
            }
            ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
            idx++;
        }
        ptr++;
    }
    

    此时,我们列出了所有n-gram。让我们找到最受欢迎的一个:

    FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list
    
    for (int i=0; i<NGRAMS; i++) {
        if (ngrams[i][0] == '\0') break;
        //insertFreqNode takes a start node, this where we will start to search for duplicates
        //the simplest description is like this:
        //  1 we search from head down each child, if we find a node that has text equal to
        //    ngrams[i] then we update it's frequency count
        //  2 if the freq is >= to the current winner we place this as head.next
        //  3 after program is complete, our most popular nodes will be the first nodes
        //    I have not implemented sorting of these - it's an exercise for the reader ;)
        insertFreqNode(&head, ngrams[i]);
    }
    
    //as the list is ordered, head.next will always be the most popular n-gram
    cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl
    

    祝你好运!

答案 3 :(得分:1)

为了好玩,我写了一个SQL版本(SQL Server 2012):

if object_id('dbo.MaxNgram','IF') is not null
    drop function dbo.MaxNgram;
go

create function dbo.MaxNgram(
     @text      varchar(max)
    ,@length    int
) returns table with schemabinding as
return
    with 
    Delimiter(c) as ( select ' '),
    E1(N) as (
        select 1 from (values 
            (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
        )T(N)
    ),
    E2(N) as (
        select 1 from E1 a cross join E1 b
    ),
    E6(N) as (
        select 1 from E2 a cross join E2 b cross join E2 c
    ),
    tally(N) as (
        select top(isnull(datalength(@text),0))
             ROW_NUMBER() over (order by (select NULL))
        from E6
    ),
    cteStart(N1) as (
        select 1 union all
        select t.N+1 from tally t cross join delimiter 
            where substring(@text,t.N,1) = delimiter.c
    ),
    cteLen(N1,L1) as (
        select s.N1,
               isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000)
        from cteStart s
        cross join delimiter
    ),
    cteWords as (
        select ItemNumber = row_number() over (order by l.N1),
               Item       = substring(@text, l.N1, l.L1)
        from cteLen l
    ),
    mask(N) as ( 
        select top(@length) row_Number() over (order by (select NULL))
        from E6
    ),
    topItem as (
        select top 1
             substring(Item,m.N,@length) as Ngram
            ,count(*)                    as Length
        from cteWords   w
        cross join mask m
        where m.N     <= datalength(w.Item) + 1 - @length
          and @length <= datalength(w.Item) 
        group by 
            substring(Item,m.N,@length)
        order by 2 desc, 1 
    )
    select d.s
    from (
        select top 1 NGram,Length
        from topItem
    ) t
    cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go

当使用OP提供的样本输入调用时

set nocount on;
select s as [ ] from MaxNgram(
    'aaaab a0a baaab c aab'
   ,3
);
go

根据需要产生

------------------------------
aaa
3

答案 4 :(得分:0)

您可以将trigram转换为RADIX50代码。 见http://en.wikipedia.org/wiki/DEC_Radix-50

在radix50中,trigram的输出值适合16位无符号整数值。

此后,您可以使用基数编码的trigram作为数组中的索引。

所以,你的代码就像:

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

此后,只需搜索数组中的最大值,然后将index解码为trigram。

我使用这个技巧,实施Wilbur-Khovayko算法进行模糊搜索〜10年前。

您可以在此处下载资源:http://itman.narod.ru/source/jwilbur1.tar.gz

答案 5 :(得分:0)

如果你没有绑定C,我在大约10分钟内编写了这个Python脚本,处理1.5Mb文件,包含超过 265000字,寻找3- 0.4s 克(除了在屏幕上打印数值)
用于测试的文本是 James Joyce的尤利西斯,你可以在这里免费找到它https://www.gutenberg.org/ebooks/4300

这里的单词分隔符都是space和回车\n

import sys

text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
    word_len = len(word)
    if word_len < ngram_len:
        continue
    for i in range(0, (word_len - ngram_len) + 1):
        ngram = word[i:i+ngram_len]
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
        if val not in ngrams_by_freq:
                ngrams_by_freq[val] = [key]
        else:
                ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
        print('{} with frequency of {}'.format(key[1:], key[0]))

答案 6 :(得分:0)

您可以在 O(nk)时间内解决此问题,其中 n 是单词数, k 是n的平均数 - 每个单词。

你认为哈希表是解决问题的一个很好的解决方案是正确的。

但是,由于您编写解决方案的时间有限,我建议您使用open addressing而不是链接列表。实施可能更简单:如果您发生碰撞,您只需沿着列表走得更远。

另外,请确保为哈希表分配足够的内存:大约是预期的n-gram数量的两倍的东西应该没问题。由于预期的n-gram数量<= 250,000,因此500,000的哈希表应该足够了。

就编码速度而言,小输入长度(250,000)使得排序和计数成为可行的选择。最快的方法可能是生成一个指向每个n-gram的指针数组,使用适当的比较器对数组进行排序,然后沿着它继续跟踪哪个n-gram出现最多。

答案 7 :(得分:0)

这个问题的一个简单的python解决方案

your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])