如果您有65536个随机英文单词,每个单词的长度为1-32,您需要根据字典或外观等级计算外观和排序,您如何构建数据以及您将使用哪种排序技术来处理它最快?
答案 0 :(得分:17)
65,000字是严肃的,是一个微不足道的排序问题。除非您每分钟必须重新排序许多次,否则我建议您只使用语言中内置的qsort()
。毕竟,这就是它的用途。
我建议为alpha命令使用一个简单的char指针数组。为了维护频率顺序,您可以使用如下结构:
typedef struct {
char *word; // points to one of the strings.
int frequency; // counts the number of occurrences.
} tFreq;
在另一个数组中,每当你创建或修改alpha排序的指针数组时,你都可以完全填充(参见下面我的理由为什么这个看似效率低的过程是合适的)。
作为速度的一个例子,请考虑以下代码:
#include <stdio.h>
#define MAXWDS 66000
static char *words[MAXWDS];
static int compFn (const void *p1, const void *p2) {
return strcmp (*((const char **)p1), *((const char **)p2));
}
int main() {
char *pWord;
int i, j, sz;
time_t t0, t1;
srand (time (0));
for (i = 0; i < MAXWDS; i++) {
sz = rand() % 32 + 1;
pWord = words[i] = malloc (sz + 1);
for (j = 0; j < sz; j++)
pWord[j] = 'A' + (rand() % 26);
pWord[sz] = '\0';
}
t0 = time(0);
qsort (words, MAXWDS, sizeof (char*), compFn);
t1 = time(0);
printf ("Time taken for %7d elements was %2d second(s)\n", MAXWDS, t1 - t0);
return 0;
}
在3GHz双核英特尔芯片上,这是MAXWDS的几个选择值的输出:
MAXWDS Output
--------- ------
66,000 Time taken for 66000 elements was 0 second(s)
100,000 Time taken for 100000 elements was 0 second(s)
500,000 Time taken for 500000 elements was 0 second(s)
600,000 Time taken for 600000 elements was 1 second(s)
1,000,000 Time taken for 1000000 elements was 1 second(s)
2,000,000 Time taken for 2000000 elements was 2 second(s)
3,000,000 Time taken for 3000000 elements was 5 second(s)
4,000,000 Time taken for 4000000 elements was 7 second(s)
5,000,000 Time taken for 5000000 elements was 9 second(s)
6,000,000 Time taken for 6000000 elements was 10 second(s)
7,000,000 Time taken for 7000000 elements was 11 second(s)
9,999,999 Time taken for 9999999 elements was 21 second(s)
因此,正如您所看到的,qsort对于您正在讨论的数据集大小非常有效。
事实上,实现维护两个排序列表的整个过程(如下面的代码所示),可以准确地显示66,000个元素的无关紧要。基本前提是:
t0 to t1
)。t1 to t2
)。t2 to t3
)。以下代码显示了如何完成。唯一有点棘手的是从alpha数组到频率数组的转换。
#include <stdio.h>
#define MAXWDS 66000
typedef struct {
char *word;
int frequency;
} tFreq;
static char *words[MAXWDS];
static tFreq freq[MAXWDS];
static int numFreq;
static int compFn (const void *p1, const void *p2) {
return strcmp (*((const char **)p1), *((const char **)p2));
}
static int compFn2 (const void *p1, const void *p2) {
return ((tFreq*)p2)->frequency - ((tFreq*)p1)->frequency;
}
int main() {
char *pWord;
int i, j, sz;
time_t t0, t1, t2, t3;
// Generate random words.
srand (time (0));
for (i = 0; i < MAXWDS; i++) {
sz = rand() % 32 + 1;
pWord = words[i] = malloc (sz + 1);
for (j = 0; j < sz; j++)
pWord[j] = 'A' + (rand() % 26);
pWord[sz] = '\0';
}
t0 = time(0);
// Alpha sort.
qsort (words, MAXWDS, sizeof (char*), compFn);
t1 = time(0);
// Pre-condition to simplify loop: make first word with zero frequency.
freq[0].word = words[0];
freq[0].frequency = 0;
// Transfer to frequency array.
for (i = numFreq = 0; i < MAXWDS; i++) {
// If new word, add it and set frequency to 0.
if (strcmp (freq[numFreq].word, words[i]) != 0) {
numFreq++;
freq[numFreq].word = words[i];
freq[numFreq].frequency = 0;
}
// Increment frequency (for old and new words).
freq[numFreq].frequency++;
}
numFreq++;
t2 = time(0);
// Sort frequency array.
qsort (freq, numFreq, sizeof (tFreq), compFn2);
t3 = time(0);
// Output stats.
printf ("Time taken for sorting %5d elements was %d seconds.\n",
MAXWDS, t1 - t0);
printf ("Time taken for transferring %5d elements was %d seconds.\n",
numFreq, t2 - t1);
printf ("Time taken for sorting %5d elements was %d seconds.\n",
numFreq, t3 - t2);
printf ("Time taken for everything %5s was %d seconds.\n\n",
"", t3 - t0);
for (i = 0; i < 28; i++) {
printf ("[%-15s] %5d\n", freq[i].word, freq[i].frequency);
}
return 0;
}
66,000个随机字符串的输出是(前几个字符串在那里,所以你可以看到排序有效):
Time taken for sorting 66000 elements was 0 seconds.
Time taken for transferring 62422 elements was 0 seconds.
Time taken for sorting 62422 elements was 0 seconds.
Time taken for everything was 0 seconds.
[Z ] 105
[H ] 97
[X ] 95
[P ] 90
[D ] 89
[K ] 87
[T ] 86
[J ] 85
[G ] 84
[F ] 83
[Q ] 83
[W ] 81
[V ] 81
[M ] 80
[I ] 79
[O ] 78
[A ] 78
[B ] 75
[U ] 74
[N ] 73
[C ] 73
[S ] 70
[Y ] 68
[L ] 65
[E ] 60
[R ] 59
[NQ ] 8
[XD ] 8
所以,即使你每次插入或删除一个值进行这些操作,它们都没有明显的影响(除非显然,如果你每隔几次做一次以上秒,但是你会考虑批量更改效率)。
答案 1 :(得分:4)
查看http://www.sorting-algorithms.com/以获得不同排序方法的可视化表示。
答案 2 :(得分:2)
你可以写一篇关于“最快”的论文,但仍然没有得到具体答案。
答案 3 :(得分:1)
答案 4 :(得分:1)
我会使用我的运行时库碰巧提供的任何排序算法。通常,sort()使用quicksort。
在您知道标准的之前因为您已经测量之前,请不要担心排序算法的选择。
答案 5 :(得分:0)
Merge Sort应该可以很好地工作,很容易在c。
中工作答案 6 :(得分:0)
您可以使用
struct elem { char *word, int frequency; } // pointer to 'string' word
struct elem dict[1<<16]; // number of words
使用标准qsort按字或按频率排序,或者如果同时需要两个订单,则使用第二个数组。
答案 7 :(得分:0)
选择排序算法取决于您拥有的数据量(65k并不多)以及您选择的时间和内存之间的权衡。如果要快速检索数据,则必须使用更多内存。另一方面,如果您决定节省内存,则无法快速找到记录。
算法的选择非常简单 - 使用您的语言库提供的任何内容,除非您有证据证明这不能很好地运作。
您需要按两个条件排序的数据,因此您实际需要两个已排序的数组。它们都应该是某种指针数组。
答案 8 :(得分:0)
听起来好像你必须以两种不同的方式对它进行排序:
答案 9 :(得分:0)
使用特里。这样,两种“排序”都将是图表的简单遍历。