我遇到了以下编程面试问题:
挑战1:N-克
N-gram是来自给定单词的N个连续字符的序列。对于“飞行员”这个词,有三个3克:“pil”,“ilo”和“lot”。 对于给定的单词集和n-gram长度 你的任务是
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
请注意,您的函数将收到以下参数:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
数据约束
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
效率约束
• your function is expected to print the result in less than 2 seconds
实施例 输入 文字:“aaaab a0a baaab c”
输出aaa ngramLength:3
解释
对于上面显示的输入,按频率排序的3克是:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
如果我只有一个小时的时间来解决问题,我选择使用C语言来解决问题:实现哈希表以计算N-gram的频率是否是一个好主意?因为在C库中没有哈希表的实现......
如果是,我正在考虑使用单独链接和有序链接列表来实现哈希表。这些实现减少了您必须解决问题的时间....
这是最快的选择吗?
谢谢!!!
答案 0 :(得分:5)
如果实现效率是重要的并且你正在使用C,我会初始化一个指向字符串中n-gram开头的指针数组,使用qsort
根据n-gram对指针进行排序它们是其中的一部分,然后循环遍历排序的数组并计算出计数。
这应该足够快地执行,并且不需要编写任何花哨的数据结构。
答案 1 :(得分:1)
抱歉发布了python,但这就是我要做的: 您可能会对算法有所了解。请注意,此程序可以解决更多的单词。
from itertools import groupby
someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3
ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
for i in range(len(word)-n+1):
ngrams.append(word[i:i+n])
# you could inline all logic here
# add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word
ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])
ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)
popular_ngrams = []
for freq in ngrams_freq_sorted:
if freq[0] == ngrams_freq_sorted[0][0]:
popular_ngrams.append(freq[1])
else:
break
print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**
答案 2 :(得分:1)
所以这个问题的基本方法是:
您可以在此处找到我的c ++解决方案:http://ideone.com/MNFSis
假设:
const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };
然后,第一步 - 这是代码:
const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) {
//noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
if (noSpace(ptr)) {
//safely copy our current n-gram over to the ngrams array
//we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
//are valid letters
for (int i=0; i<NGRAM; i++) {
ngrams[idx][i] = ptr[i];
}
ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
idx++;
}
ptr++;
}
此时,我们列出了所有n-gram。让我们找到最受欢迎的一个:
FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list
for (int i=0; i<NGRAMS; i++) {
if (ngrams[i][0] == '\0') break;
//insertFreqNode takes a start node, this where we will start to search for duplicates
//the simplest description is like this:
// 1 we search from head down each child, if we find a node that has text equal to
// ngrams[i] then we update it's frequency count
// 2 if the freq is >= to the current winner we place this as head.next
// 3 after program is complete, our most popular nodes will be the first nodes
// I have not implemented sorting of these - it's an exercise for the reader ;)
insertFreqNode(&head, ngrams[i]);
}
//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl
祝你好运!
答案 3 :(得分:1)
为了好玩,我写了一个SQL版本(SQL Server 2012):
if object_id('dbo.MaxNgram','IF') is not null
drop function dbo.MaxNgram;
go
create function dbo.MaxNgram(
@text varchar(max)
,@length int
) returns table with schemabinding as
return
with
Delimiter(c) as ( select ' '),
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
)T(N)
),
E2(N) as (
select 1 from E1 a cross join E1 b
),
E6(N) as (
select 1 from E2 a cross join E2 b cross join E2 c
),
tally(N) as (
select top(isnull(datalength(@text),0))
ROW_NUMBER() over (order by (select NULL))
from E6
),
cteStart(N1) as (
select 1 union all
select t.N+1 from tally t cross join delimiter
where substring(@text,t.N,1) = delimiter.c
),
cteLen(N1,L1) as (
select s.N1,
isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000)
from cteStart s
cross join delimiter
),
cteWords as (
select ItemNumber = row_number() over (order by l.N1),
Item = substring(@text, l.N1, l.L1)
from cteLen l
),
mask(N) as (
select top(@length) row_Number() over (order by (select NULL))
from E6
),
topItem as (
select top 1
substring(Item,m.N,@length) as Ngram
,count(*) as Length
from cteWords w
cross join mask m
where m.N <= datalength(w.Item) + 1 - @length
and @length <= datalength(w.Item)
group by
substring(Item,m.N,@length)
order by 2 desc, 1
)
select d.s
from (
select top 1 NGram,Length
from topItem
) t
cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go
当使用OP提供的样本输入调用时
set nocount on;
select s as [ ] from MaxNgram(
'aaaab a0a baaab c aab'
,3
);
go
根据需要产生
------------------------------
aaa
3
答案 4 :(得分:0)
您可以将trigram转换为RADIX50代码。 见http://en.wikipedia.org/wiki/DEC_Radix-50
在radix50中,trigram的输出值适合16位无符号整数值。
此后,您可以使用基数编码的trigram作为数组中的索引。
所以,你的代码就像:
uint16_t counters[1 << 16]; // 64K counters
bzero(counters, sizeof(counters));
for(const char *p = txt; p[2] != 0; p++)
counters[radix50(p)]++;
此后,只需搜索数组中的最大值,然后将index解码为trigram。
我使用这个技巧,实施Wilbur-Khovayko算法进行模糊搜索〜10年前。
您可以在此处下载资源:http://itman.narod.ru/source/jwilbur1.tar.gz。
答案 5 :(得分:0)
如果你没有绑定C,我在大约10分钟内编写了这个Python脚本,处理1.5Mb文件,包含超过 265000字,寻找3- 0.4s 克(除了在屏幕上打印数值)
用于测试的文本是 James Joyce的尤利西斯,你可以在这里免费找到它https://www.gutenberg.org/ebooks/4300
这里的单词分隔符都是space
和回车\n
import sys
text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
word_len = len(word)
if word_len < ngram_len:
continue
for i in range(0, (word_len - ngram_len) + 1):
ngram = word[i:i+ngram_len]
if ngram in ngrams:
ngrams[ngram] += 1
else:
ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
if val not in ngrams_by_freq:
ngrams_by_freq[val] = [key]
else:
ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
print('{} with frequency of {}'.format(key[1:], key[0]))
答案 6 :(得分:0)
您可以在 O(nk)时间内解决此问题,其中 n 是单词数, k 是n的平均数 - 每个单词。
你认为哈希表是解决问题的一个很好的解决方案是正确的。
但是,由于您编写解决方案的时间有限,我建议您使用open addressing而不是链接列表。实施可能更简单:如果您发生碰撞,您只需沿着列表走得更远。
另外,请确保为哈希表分配足够的内存:大约是预期的n-gram数量的两倍的东西应该没问题。由于预期的n-gram数量<= 250,000,因此500,000的哈希表应该足够了。
就编码速度而言,小输入长度(250,000)使得排序和计数成为可行的选择。最快的方法可能是生成一个指向每个n-gram的指针数组,使用适当的比较器对数组进行排序,然后沿着它继续跟踪哪个n-gram出现最多。
答案 7 :(得分:0)
这个问题的一个简单的python解决方案
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3
for str in str_list:
start = 0
end = ngram_len
len_word = len(str)
for i in range(0,len_word):
if end <= len_word :
if str_hash.get(str[start:end]):
str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
else:
str_hash[str[start:end]] = 1
start = start +1
end = end +1
else:
break
keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])