Question

我遇到了以下编程面试问题：

挑战1：N-克

N-gram是来自给定单词的N个连续字符的序列。对于“飞行员”这个词，有三个3克：“pil”，“ilo”和“lot”。对于给定的单词集和n-gram长度你的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

请注意，您的函数将收到以下参数：

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

数据约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

效率约束

• your function is expected to print the result in less than 2 seconds

实施例输入文字：“aaaab a0a baaab c”

输出aaa ngramLength：3

解释

对于上面显示的输入，按频率排序的3克是：

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

如果我只有一个小时的时间来解决问题，我选择使用C语言来解决问题：实现哈希表以计算N-gram的频率是否是一个好主意？因为在C库中没有哈希表的实现......

如果是，我正在考虑使用单独链接和有序链接列表来实现哈希表。这些实现减少了您必须解决问题的时间....

这是最快的选择吗？

谢谢!!!

Answer 1

如果实现效率是重要的并且你正在使用C，我会初始化一个指向字符串中n-gram开头的指针数组，使用qsort根据n-gram对指针进行排序它们是其中的一部分，然后循环遍历排序的数组并计算出计数。

这应该足够快地执行，并且不需要编写任何花哨的数据结构。

Answer 2

抱歉发布了python，但这就是我要做的：您可能会对算法有所了解。请注意，此程序可以解决更多的单词。

from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**

Answer 3

所以这个问题的基本方法是：

查找字符串
将所有重复条目映射到具有n-gram及其出现次数的新结构

您可以在此处找到我的c ++解决方案：http://ideone.com/MNFSis

假设：

const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };

然后，第一步 - 这是代码：

const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) { 
    //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
    if (noSpace(ptr)) {
        //safely copy our current n-gram over to the ngrams array
        //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
        //are valid letters
        for (int i=0; i<NGRAM; i++) {
            ngrams[idx][i] = ptr[i];
        }
        ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
        idx++;
    }
    ptr++;
}

此时，我们列出了所有n-gram。让我们找到最受欢迎的一个：

FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list

for (int i=0; i<NGRAMS; i++) {
    if (ngrams[i][0] == '\0') break;
    //insertFreqNode takes a start node, this where we will start to search for duplicates
    //the simplest description is like this:
    //  1 we search from head down each child, if we find a node that has text equal to
    //    ngrams[i] then we update it's frequency count
    //  2 if the freq is >= to the current winner we place this as head.next
    //  3 after program is complete, our most popular nodes will be the first nodes
    //    I have not implemented sorting of these - it's an exercise for the reader ;)
    insertFreqNode(&head, ngrams[i]);
}

//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl

祝你好运！

Answer 4

为了好玩，我写了一个SQL版本（SQL Server 2012）：

if object_id('dbo.MaxNgram','IF') is not null
    drop function dbo.MaxNgram;
go

create function dbo.MaxNgram(
     @text      varchar(max)
    ,@length    int
) returns table with schemabinding as
return
    with 
    Delimiter(c) as ( select ' '),
    E1(N) as (
        select 1 from (values 
            (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
        )T(N)
    ),
    E2(N) as (
        select 1 from E1 a cross join E1 b
    ),
    E6(N) as (
        select 1 from E2 a cross join E2 b cross join E2 c
    ),
    tally(N) as (
        select top(isnull(datalength(@text),0))
             ROW_NUMBER() over (order by (select NULL))
        from E6
    ),
    cteStart(N1) as (
        select 1 union all
        select t.N+1 from tally t cross join delimiter 
            where substring(@text,t.N,1) = delimiter.c
    ),
    cteLen(N1,L1) as (
        select s.N1,
               isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000)
        from cteStart s
        cross join delimiter
    ),
    cteWords as (
        select ItemNumber = row_number() over (order by l.N1),
               Item       = substring(@text, l.N1, l.L1)
        from cteLen l
    ),
    mask(N) as ( 
        select top(@length) row_Number() over (order by (select NULL))
        from E6
    ),
    topItem as (
        select top 1
             substring(Item,m.N,@length) as Ngram
            ,count(*)                    as Length
        from cteWords   w
        cross join mask m
        where m.N     <= datalength(w.Item) + 1 - @length
          and @length <= datalength(w.Item) 
        group by 
            substring(Item,m.N,@length)
        order by 2 desc, 1 
    )
    select d.s
    from (
        select top 1 NGram,Length
        from topItem
    ) t
    cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go

当使用OP提供的样本输入调用时

set nocount on;
select s as [ ] from MaxNgram(
    'aaaab a0a baaab c aab'
   ,3
);
go

根据需要产生

------------------------------
aaa
3

Answer 5

您可以将trigram转换为RADIX50代码。见http://en.wikipedia.org/wiki/DEC_Radix-50

在radix50中，trigram的输出值适合16位无符号整数值。

此后，您可以使用基数编码的trigram作为数组中的索引。

所以，你的代码就像：

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

此后，只需搜索数组中的最大值，然后将index解码为trigram。

我使用这个技巧，实施Wilbur-Khovayko算法进行模糊搜索〜10年前。

您可以在此处下载资源：http://itman.narod.ru/source/jwilbur1.tar.gz。

Answer 6

如果你没有绑定C，我在大约10分钟内编写了这个Python脚本，处理1.5Mb文件，包含超过 265000字，寻找3- 0.4s 克（除了在屏幕上打印数值）
用于测试的文本是 James Joyce的尤利西斯，你可以在这里免费找到它https://www.gutenberg.org/ebooks/4300

这里的单词分隔符都是space和回车\n

import sys

text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
    word_len = len(word)
    if word_len < ngram_len:
        continue
    for i in range(0, (word_len - ngram_len) + 1):
        ngram = word[i:i+ngram_len]
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
        if val not in ngrams_by_freq:
                ngrams_by_freq[val] = [key]
        else:
                ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
        print('{} with frequency of {}'.format(key[1:], key[0]))

Answer 7

您可以在 O（nk）时间内解决此问题，其中 n 是单词数， k 是n的平均数 - 每个单词。

你认为哈希表是解决问题的一个很好的解决方案是正确的。

但是，由于您编写解决方案的时间有限，我建议您使用open addressing而不是链接列表。实施可能更简单：如果您发生碰撞，您只需沿着列表走得更远。

另外，请确保为哈希表分配足够的内存：大约是预期的n-gram数量的两倍的东西应该没问题。由于预期的n-gram数量<= 250,000，因此500,000的哈希表应该足够了。

就编码速度而言，小输入长度（250,000）使得排序和计数成为可行的选择。最快的方法可能是生成一个指向每个n-gram的指针数组，使用适当的比较器对数组进行排序，然后沿着它继续跟踪哪个n-gram出现最多。

Answer 8

这个问题的一个简单的python解决方案

your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])

n-gram是所有单词中最常见的一个

8 个答案: