请查看以下代码:
wcmapper.php (hadoop流媒体工作的映射器)
#!/usr/bin/php
<?php
//sample mapper for hadoop streaming job
$word2count = array();
// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
// tab-delimited
echo "$word\t$count\n";
}
?>
wcreducer.php (示例hadoop作业的reducer脚本)
#!/usr/bin/php
<?php
//reducer script for sample hadoop job
$word2count = array();
// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace
$line = trim($line);
// parse the input we got from mapper.php
list($word, $count) = explode("\t", $line);
// convert count (currently a string) to int
$count = intval($count);
// sum counts
if ($count > 0) $word2count[$word] += $count;
}
ksort($word2count); // sort the words alphabetically
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
echo "$word\t$count\n";
}
?>
此代码适用于使用PHP在commoncrawl数据集上的 Wordcount流式传输作业。
在这里,这些代码读取整个输入。这不是我需要的,我需要读取前100行并将它们写入文本文件。我是Hadoop,CommonCrawl和PHP的初学者。那么,我该怎么做呢?
请帮忙。
答案 0 :(得分:1)
在第一个循环中使用计数器,并在计数器达到100时停止循环。然后,有一个只读取直到输入结束的虚拟循环,然后继续使用代码(将结果写入STDOUT) 。写入结果也可以在虚拟循环之前读取,直到STDIN输入结束。示例代码如下:
...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
// read the line from STDIN; you
// can add a check to exit if done ($line == false)
$line = fgets(STDIN);
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
// tab-delimited
echo "$word\t$count\n";
}
// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}
答案 1 :(得分:0)
我不确定你如何定义“行”,但如果你想要单词,你可以这样做:
for ($count=0; $count<=100; $count++){
echo $word2count[$count]\t$count\n";
}