Question

嘿所有我在一个小型Java程序中遇到了一个奇怪的错误，我正在为一个学校项目写作。我很清楚代码是多么草率（它仍然是一个正在进行的工作），但无论如何，我的字符串变量“year”在断开循环后变得腐败。我使用Java与Mapreduce和hadoop来计算unigrams和bigrams并按年份/作者对它们进行排序。使用print语句，我确定当设置它等于temp时确实设置了“year”，但是在设置循环后的任何时候，变量都会以某种方式被破坏。年份数字被替换为大量的空白（至少它是在控制台中显示的方式）。我尝试过设置year=year.trim()并使用正则表达式year=year.replaceAll("[^0-9]","")，但都不起作用。有人有什么想法吗？我只包含了地图类，因为这就是问题所在。还应该注意的是，正在解析的文本文件是来自Project Gutenberg的文件。我正在处理来自项目的大约40个随机文本的小样本。

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text(); 
    public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        line = line.toLowerCase();
        line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " "); 
        String year=""; // variable to hold date -- somehow this gets cleared out before I need it
        String temp=""; // variable to hold each token
        StringTokenizer tokenizer = new StringTokenizer(line); // Splits document into individual words for parsing
        while (tokenizer.hasMoreTokens()) {

            temp = tokenizer.nextToken(); // grab first token of document
            if (temp.equals("***")) // hit first triple star, break out and move to next while loop
                break;

            if (temp.equals("release")&&tokenizer.hasMoreTokens()){ // if token is "release" followed by "date", extract year
                if (tokenizer.nextToken().equals("date")){
                    while(tokenizer.hasMoreTokens()){
                        temp = tokenizer.nextToken();
                        for (int i = 0; i<temp.length();i++){
                            if (Character.isDigit(temp.charAt(0))){
                                if (temp.length()>3||Integer.parseInt(temp)>=40){
                                    year = temp; // set year = token if token is a number greater than 40 or has >3 digits
                                    break;
                                }
                            }
                        }
                        if (!year.equals("")){ //if date isn't an empty string, it means we have date and break
                            break;            // out of first while loop
                        }
                    }
                    System.out.println("\n"+year+"\n");// year will still print here
                }
            } // but it is gone if I try to print past this point 
        }

        while (tokenizer.hasMoreTokens()){ // keep grabbing tokens until hit another "***", then break and
            temp = tokenizer.nextToken();  // can begin counting unigrams/bigrams
            if (temp.equals("***"))
                break;
        }

        line = line.substring(line.indexOf(temp)); // form a new document starting from location of previous "***"
        line = line.replaceAll("[^a-z\\s-]", "").replaceAll("\\s+", " ");
        line = line.replaceAll("-+", "-");  /*Many calls to remove excess whitespace and punctuation from entire document*/
        line = line.replaceAll(" - ", " "); 
        line = line.replaceAll("- ", " "); 
        line = line.replaceAll(" -", " ");
        line = line.replaceAll("\\s+", " ");

        StringTokenizer toke = new StringTokenizer(line); //start a new tokenizer with re-formatted file

        while(toke.hasMoreTokens()){//continue to grab tokens until EOF
            temp = toke.nextToken();
            //System.out.println(date);

            if (temp.charAt(0)=='-')
                temp = temp.substring(1);//if word starts or ends with hyphen, remove it
            if (temp.length()>1&&temp.charAt(temp.length()-1)=='-')
                temp = temp.replace('-', ' ');

            if ((!temp.equals(" "))){
                word.set(temp+"\t"+year);   
                context.write(word,one); 
            }
        }
    }
 }

Answer 1

您的代码中有year = temp。看来这取决于你输入的内容。

可能的错误：

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(0))){

恕我直言，你的意思是i而不是在charAt中的0：

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(i))){

另请考虑不要使用StringTokenizer：

StringTokenizer是为保持兼容性而保留的旧类   原因虽然在新代码中不鼓励使用它。建议   任何寻求此功能的人都使用String的split方法   或者改为java.util.regex包。

以下示例说明了String.split方法的用法   用于将字符串分解为其基本标记：
 String[] result = "this is a test".split("\\s");
 for (int x=0; x<result.length; x++)
     System.out.println(result[x]);

Answer 2

找到你的白色空间......

打印年份变量的2个语句添加了几个换行符：

System.out.println("\n"+year+"\n")

或标签：

word.set(temp+"\t"+year);   
context.write(word,one);

尝试删除\ n和\ t。

Java String变量在运行时

2 个答案: