嘿所有我在一个小型Java程序中遇到了一个奇怪的错误,我正在为一个学校项目写作。我很清楚代码是多么草率(它仍然是一个正在进行的工作),但无论如何,我的字符串变量“year”在断开循环后变得腐败。我使用Java与Mapreduce和hadoop来计算unigrams和bigrams并按年份/作者对它们进行排序。使用print语句,我确定当设置它等于temp时确实设置了“year”,但是在设置循环后的任何时候,变量都会以某种方式被破坏。年份数字被替换为大量的空白(至少它是在控制台中显示的方式)。我尝试过设置year=year.trim()
并使用正则表达式year=year.replaceAll("[^0-9]","")
,但都不起作用。有人有什么想法吗?
我只包含了地图类,因为这就是问题所在。还应该注意的是,正在解析的文本文件是来自Project Gutenberg的文件。我正在处理来自项目的大约40个随机文本的小样本。
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
line = line.toLowerCase();
line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " ");
String year=""; // variable to hold date -- somehow this gets cleared out before I need it
String temp=""; // variable to hold each token
StringTokenizer tokenizer = new StringTokenizer(line); // Splits document into individual words for parsing
while (tokenizer.hasMoreTokens()) {
temp = tokenizer.nextToken(); // grab first token of document
if (temp.equals("***")) // hit first triple star, break out and move to next while loop
break;
if (temp.equals("release")&&tokenizer.hasMoreTokens()){ // if token is "release" followed by "date", extract year
if (tokenizer.nextToken().equals("date")){
while(tokenizer.hasMoreTokens()){
temp = tokenizer.nextToken();
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
if (temp.length()>3||Integer.parseInt(temp)>=40){
year = temp; // set year = token if token is a number greater than 40 or has >3 digits
break;
}
}
}
if (!year.equals("")){ //if date isn't an empty string, it means we have date and break
break; // out of first while loop
}
}
System.out.println("\n"+year+"\n");// year will still print here
}
} // but it is gone if I try to print past this point
}
while (tokenizer.hasMoreTokens()){ // keep grabbing tokens until hit another "***", then break and
temp = tokenizer.nextToken(); // can begin counting unigrams/bigrams
if (temp.equals("***"))
break;
}
line = line.substring(line.indexOf(temp)); // form a new document starting from location of previous "***"
line = line.replaceAll("[^a-z\\s-]", "").replaceAll("\\s+", " ");
line = line.replaceAll("-+", "-"); /*Many calls to remove excess whitespace and punctuation from entire document*/
line = line.replaceAll(" - ", " ");
line = line.replaceAll("- ", " ");
line = line.replaceAll(" -", " ");
line = line.replaceAll("\\s+", " ");
StringTokenizer toke = new StringTokenizer(line); //start a new tokenizer with re-formatted file
while(toke.hasMoreTokens()){//continue to grab tokens until EOF
temp = toke.nextToken();
//System.out.println(date);
if (temp.charAt(0)=='-')
temp = temp.substring(1);//if word starts or ends with hyphen, remove it
if (temp.length()>1&&temp.charAt(temp.length()-1)=='-')
temp = temp.replace('-', ' ');
if ((!temp.equals(" "))){
word.set(temp+"\t"+year);
context.write(word,one);
}
}
}
}
答案 0 :(得分:1)
您的代码中有year = temp
。看来这取决于你输入的内容。
可能的错误:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
恕我直言,你的意思是i
而不是在charAt中的0:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(i))){
另请考虑不要使用StringTokenizer:
StringTokenizer是为保持兼容性而保留的旧类 原因虽然在新代码中不鼓励使用它。建议 任何寻求此功能的人都使用String的split方法 或者改为java.util.regex包。
以下示例说明了String.split方法的用法 用于将字符串分解为其基本标记:
String[] result = "this is a test".split("\\s"); for (int x=0; x<result.length; x++) System.out.println(result[x]);
答案 1 :(得分:0)
找到你的白色空间......
打印年份变量的2个语句添加了几个换行符:
System.out.println("\n"+year+"\n")
或标签:
word.set(temp+"\t"+year);
context.write(word,one);
尝试删除\ n和\ t。