我一直在尝试创建一个Twitter Hashtag-Count Hadoop程序。 我成功地提取了文本,得到了标签并开始尝试计算它们。 我遇到的最早的问题之一是许多主题标签非常相似(测试,测试,测试!,T-est等)。
我开始清除所有特殊字符的字符串并删除标签内的所有空格。但是当存在诸如(“鹰派”,“鹰派”,“鹰派”)等案件时,问题仍然存在。 我在一个单独的类中实现了Dice的Coefficient算法,如下所示:
//Using Dice's Coefficient algorithm
public class WordSimilarity {
public static boolean isStringSimilar(String str1,String str2){
return doComparison(str1,str2)>=Analyzer.getSimilarity();
}
/** @return lexical similarity value in the range [0,1] */
private static double doComparison(String str1, String str2) {
// If the strings are too small, do not compare them at all.
try {
if(str1.length()>3 && str2.length()>3) {
ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
int intersection = 0;
int union = pairs1.size() + pairs2.size();
for (int i = 0; i < pairs1.size(); i++) {
Object pair1 = pairs1.get(i);
for (int j = 0; j < pairs2.size(); j++) {
Object pair2 = pairs2.get(j);
if (pair1.equals(pair2)) {
intersection++;
pairs2.remove(j);
break;
}
}
}
return (2.0 * intersection) / union;
}
else{
return 0;
}
}catch(NegativeArraySizeException ex){
return 0;
}
}
/** @return an ArrayList of 2-character Strings. */
private static ArrayList wordLetterPairs(String str){
ArrayList allPairs = new ArrayList();
// Tokenize the string and put the tokens/words into an array
String[] words = str.split("\\s");
// For each word
for(int w=0; w<words.length;w++){
// Find the pairs of characters
String[] pairsInWord = letterPairs(words[w]);
for(int p=0;p<pairsInWord.length;p++){
allPairs.add(pairsInWord[p]);
}
}
return allPairs;
}
/** @return an array of adjacent letter pairs contained in the input string */
private static String[] letterPairs(String str){
int numPairs = str.length() -1;
String[] pairs = new String[numPairs];
for(int i=0; i<numPairs;i++){
pairs[i]=str.substring(i,i+2);
}
return pairs;
}
}
tl; dr比较两个单词并返回0到1之间的数字,表示这些字符串的相似程度。
然后我创建了一个自定义的WritableComparable(我打算将它用作项目中的值,尽管它现在只是关键。):
public class Hashtag implements WritableComparable<Hashtag> {
private Text hashtag;
public Hashtag(){
this.hashtag = new Text();
}
public Hashtag(String hashtag) {
this.hashtag = new Text(hashtag);
}
public Text getHashtag() {
return hashtag;
}
public void setHashtag(String hashtag) {
// Remove characters that add no information to the analysis, but cause problems to the result
this.hashtag = new Text(hashtag);
}
public void setHashtag(Text hashtag) {
this.hashtag = hashtag;
}
// Compare To uses the WordSimilarity algorithm to determine if the hashtags are similar. If they are,
// they are considered equal
@Override
public int compareTo(Hashtag o) {
if(o.getHashtag().toString().equalsIgnoreCase(this.getHashtag().toString())){
return 0;
}else if(WordSimilarity.isStringSimilar(this.hashtag.toString(),o.hashtag.toString())){
return 0;
}else {
return this.hashtag.toString().compareTo(o.getHashtag().toString());
}
}
@Override
public String toString() {
return this.hashtag.toString();
}
@Override
public void write(DataOutput dataOutput) throws IOException {
this.hashtag.write(dataOutput);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.hashtag.readFields(dataInput);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof Hashtag)) return false;
Hashtag hashtag1 = (Hashtag) o;
return WordSimilarity.isStringSimilar(this.getHashtag().toString(),hashtag1.getHashtag().toString());
}
@Override
public int hashCode() {
return Objects.hash(getHashtag());
}
}
最后,编写了MapReduce代码:
public class HashTagCounter {
private final static IntWritable one = new IntWritable(1);
public static class HashtagCountMapper extends Mapper<Object, Text, Hashtag, IntWritable> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//If the line does not start with '{', it is not a valid JSON. Ignore.
if (value.toString().startsWith("{")) {
Status tweet = null;
try {
//Create a status object from Raw JSON
tweet = TwitterObjectFactory.createStatus(value.toString());
if (tweet!=null && tweet.getText() != null) {
StringTokenizer itr = new StringTokenizer(tweet.getText());
while (itr.hasMoreTokens()) {
String temp = itr.nextToken();
//Check only hashtags
if (temp.startsWith("#") && temp.length()>=3 && LanguageChecker.checkIfStringIsInLatin(temp)){
temp = purifyString(temp);
context.write(new Hashtag('#'+temp), one);
}
}
}
} catch (TwitterException tex) {
System.err.println("Twitter Exception thrown: "+ tex.getErrorMessage());
}
}
}
}
public static class HashtagCountCombiner extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static class HashtagCountReducer extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
private static String purifyString(String s){
s = s.replaceAll(Analyzer.PURE_TEXT.pattern(),"").toLowerCase();
s = Normalizer.normalize(s, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
return s.trim();
}
}
请注意,所有导入都在代码中,我只是在这里省略它们以减少已经发短信的帖子。
代码运行正常,没有错误,并且它主要起作用。我主要是说,因为在part-r-0000文件中我得到了几个这样的条目:
等等。我在记事本上测试了这些字符串,它们看起来完全相同(我原本以为它可能是编码问题。不是这样,原始文件中的所有这些主题标签都显示为UTF8)。
并非所有主题标签都会发生,但它会发生在相当多的主题标签中。理论上我可以在输出上运行第二个Mapreduce作业,并且正确地组合它们而没有麻烦(我们正在谈论从10GB输入文件生成的100kb文件),但我相信这是浪费计算能力。
这让我相信我在MapReduce的工作方式上缺少一些东西。这让我疯狂。任何人都可以向我解释我做错了什么,我逻辑中的错误在哪里?
答案 0 :(得分:0)
我猜HashTag实现导致了这个问题。当遇到UTF-8字符序列中的双字节字符时,Text和String会有所不同。此外Text是可变的而String不是,并且使用String操作的预期行为可能与Text操作不同。
所以只需从下面的链接阅读4页[115,118](包括两个内容),这将打开一个指向hadoop的pdf文件 - 权威指南..
希望此阅读可以帮助您解决确切问题..
谢谢..