Hadoop MapReduce Word(Hashtag)计数与类似的单词分组无法正常工作

时间:2018-01-10 23:46:48

标签: java hadoop text mapreduce similarity

我一直在尝试创建一个Twitter Hashtag-Count Hadoop程序。 我成功地提取了文本,得到了标签并开始尝试计算它们。 我遇到的最早的问题之一是许多主题标签非常相似(测试,测试,测试!,T-est等)。

我开始清除所有特殊字符的字符串并删除标签内的所有空格。但是当存在诸如(“鹰派”,“鹰派”,“鹰派”)等案件时,问题仍然存在。 我在一个单独的类中实现了Dice的Coefficient算法,如下所示:

//Using Dice's Coefficient algorithm
public class WordSimilarity {


    public static boolean isStringSimilar(String str1,String str2){
        return doComparison(str1,str2)>=Analyzer.getSimilarity();
    }

    /** @return lexical similarity value in the range [0,1] */
    private static double doComparison(String str1, String str2) {
        // If the strings are too small, do not compare them at all.
        try {
            if(str1.length()>3 && str2.length()>3) {
                ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
                ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
                int intersection = 0;
                int union = pairs1.size() + pairs2.size();
                for (int i = 0; i < pairs1.size(); i++) {
                    Object pair1 = pairs1.get(i);
                    for (int j = 0; j < pairs2.size(); j++) {
                        Object pair2 = pairs2.get(j);
                        if (pair1.equals(pair2)) {
                            intersection++;
                            pairs2.remove(j);
                            break;
                        }
                    }
                }
                return (2.0 * intersection) / union;
            }
            else{
                return 0;
            }
        }catch(NegativeArraySizeException ex){

            return 0;
        }
    }


    /** @return an ArrayList of 2-character Strings. */
    private static ArrayList wordLetterPairs(String str){
        ArrayList allPairs = new ArrayList();
        // Tokenize the string and put the tokens/words into an array
        String[] words = str.split("\\s");
        // For each word
        for(int w=0; w<words.length;w++){
            // Find the pairs of characters
            String[] pairsInWord = letterPairs(words[w]);
            for(int p=0;p<pairsInWord.length;p++){
                allPairs.add(pairsInWord[p]);
            }
        }
        return allPairs;
    }

    /** @return an array of adjacent letter pairs contained in the input string */
    private static String[] letterPairs(String str){
        int numPairs = str.length() -1;
        String[] pairs = new String[numPairs];
        for(int i=0; i<numPairs;i++){
            pairs[i]=str.substring(i,i+2);
        }
        return pairs;
    }

}

tl; dr比较两个单词并返回0到1之间的数字,表示这些字符串的相似程度。

然后我创建了一个自定义的WritableComparable(我打算将它用作项目中的值,尽管它现在只是关键。):

public class Hashtag implements WritableComparable<Hashtag> {

    private Text hashtag;

    public Hashtag(){
        this.hashtag = new Text();
    }

    public Hashtag(String hashtag) {
        this.hashtag = new Text(hashtag);
    }

    public Text getHashtag() {
        return hashtag;
    }

    public void setHashtag(String hashtag) {
        // Remove characters that add no information to the analysis, but cause problems to the result
        this.hashtag = new Text(hashtag);
    }

    public void setHashtag(Text hashtag) {
        this.hashtag = hashtag;
    }

    // Compare To uses the WordSimilarity algorithm to determine if the hashtags are similar. If they are,
    // they are considered equal
    @Override
    public int compareTo(Hashtag o) {
        if(o.getHashtag().toString().equalsIgnoreCase(this.getHashtag().toString())){
            return 0;
        }else if(WordSimilarity.isStringSimilar(this.hashtag.toString(),o.hashtag.toString())){
            return 0;
        }else {
            return this.hashtag.toString().compareTo(o.getHashtag().toString());
        }
    }

    @Override
    public String toString() {
        return this.hashtag.toString();
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        this.hashtag.write(dataOutput);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.hashtag.readFields(dataInput);
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Hashtag)) return false;
        Hashtag hashtag1 = (Hashtag) o;
        return WordSimilarity.isStringSimilar(this.getHashtag().toString(),hashtag1.getHashtag().toString());
    }

    @Override
    public int hashCode() {
        return Objects.hash(getHashtag());
    }

}

最后,编写了MapReduce代码:

public class HashTagCounter {

    private final static IntWritable one = new IntWritable(1);

    public static class HashtagCountMapper extends Mapper<Object, Text, Hashtag, IntWritable> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            //If the line does not start with '{', it is not a valid JSON. Ignore.
            if (value.toString().startsWith("{")) {
                Status tweet = null;
                try {
                    //Create a status object from Raw JSON
                    tweet = TwitterObjectFactory.createStatus(value.toString());
                    if (tweet!=null && tweet.getText() != null) {
                        StringTokenizer itr = new StringTokenizer(tweet.getText());
                        while (itr.hasMoreTokens()) {
                            String temp = itr.nextToken();
                            //Check only hashtags
                            if (temp.startsWith("#") && temp.length()>=3 &&  LanguageChecker.checkIfStringIsInLatin(temp)){
                                temp = purifyString(temp);
                                context.write(new Hashtag('#'+temp), one);
                            }
                        }
                    }
                } catch (TwitterException tex) {
                    System.err.println("Twitter Exception thrown: "+ tex.getErrorMessage());
                }
            }
        }
    }

    public static class HashtagCountCombiner extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static class HashtagCountReducer extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    private static String purifyString(String s){
        s = s.replaceAll(Analyzer.PURE_TEXT.pattern(),"").toLowerCase();
        s  = Normalizer.normalize(s, Normalizer.Form.NFD)
                .replaceAll("[^\\p{ASCII}]", "");
        return s.trim();
    }
}

请注意,所有导入都在代码中,我只是在这里省略它们以减少已经发短信的帖子。

代码运行正常,没有错误,并且它主要起作用。我主要是说,因为在part-r-0000文件中我得到了几个这样的条目:

  • milwauke 2
  • XXXXXXXX&lt; ----其他一些字符串
  • 密尔沃基1
  • XXXXXXXX&lt; ----其他一些字符串
  • 密尔沃基7

等等。我在记事本上测试了这些字符串,它们看起来完全相同(我原本以为它可能是编码问题。不是这样,原始文件中的所有这些主题标签都显示为UTF8)。

并非所有主题标签都会发生,但它会发生在相当多的主题标签中。理论上我可以在输出上运行第二个Mapreduce作业,并且正确地组合它们而没有麻烦(我们正在谈论从10GB输入文件生成的100kb文件),但我相信这是浪费计算能力。

这让我相信我在MapReduce的工作方式上缺少一些东西。这让我疯狂。任何人都可以向我解释我做错了什么,我逻辑中的错误在哪里?

1 个答案:

答案 0 :(得分:0)

我猜HashTag实现导致了这个问题。当遇到UTF-8字符序列中的双字节字符时,Text和String会有所不同。此外Text是可变的而String不是,并且使用String操作的预期行为可能与Text操作不同。

所以只需从下面的链接阅读4页[115,118](包括两个内容),这将打开一个指向hadoop的pdf文件 - 权威指南..

http://javaarm.com/file/apache/Hadoop/books/Hadoop-The.Definitive.Guide_4.edition_a_Tom.White_April-2015.pdf

希望此阅读可以帮助您解决确切问题..

谢谢..