我有2个文本文件:
1 Extract_tweet.txt
- 文件格式为user_id tweet_id tweet_text
12163922 5407952300 I think I just discovered the hour when the office thermostat changes. And it ain't a good time to be at work...brrrr 2009-11-03 19:22:54
2 locations.txt
- 以下数据的相关性是第3列,其作用类似于搜索字符串
asciiname: name of geographical point in plain ascii characters, varchar(200)
4045431 Point Poker Point Poker 52.89508 173.29911 T CAPE US AK 016 0 9 America/Adak 2013-10-26
我想从这些文件中提取一些数据。数据通常必须只是a-z,A-Z和任何空格。我之前想过将字符串标记化。但是,由于没有给出sentinal,我想到了使用正则表达式。 PFB提取27个字符的代码片段,即a-Z或A-Z或任何空格。我想只提取小写的文本,即如果大写有任何字符,它应该转换为小写。
我将打开文件1 - Extract_tweet.txt
并将完整文本作为单个字符串。然后我尝试用null替换每个非字母字符。
public void readfromFile() throws FileNotFoundException
{
Scanner inputStream;
String source=null;
FileInputStream file = new FileInputStream("Extract_tweet.txt");
inputStream = new Scanner(file);
while(inputStream.hasNextLine()) //Read from file till the last line of the file.
{
source = inputStream.nextLine();
System.out.println(source);
replaceAll(source);
}
inputStream.close();
}
public String replaceAll(String source)
{
String regex = "[A-Z]*"+"["+source.toLowerCase()+"|"+"[a-z]*"+"[\\s]";
source = source.replaceAll(regex, "");
System.out.println(source);
return source;
}
public static void main(String[] args) {
StringProcessing sp = new StringProcessing();
try {
sp.readfromFile();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
运行此代码后,我得到了以下错误。
60730027 6320951896 @thediscovietnam coo. thanks. just dropped you a line. 2009-12-03 18:41:07
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal character range near index 88
[A-Z]*[60730027 6320951896 @thediscovietnam coo. thanks. just dropped you a line. 2009-12-03 18:41:07|[a-z]*[\s]
答案 0 :(得分:0)
请改变那条
String regex = "[A-Z]* |"+"[a-z]*"+"[\\s]";
它会正常工作。
答案 1 :(得分:0)
我做了一些改变。但是,我想将大写更改为小写,并将所有字母数字值替换为null。
扩展您的方法:
public String replaceAll(String source) throws FileNotFoundException {
String regex = "[A-Z]* |[a-z]*\\s";
source = source.replaceAll(regex, "")
.replaceAll("\\d", "")
.toLowerCase();
System.out.println(source);
writetoFile(source);
return source;
}