使用给定文本文件中的空格提取字母,而不在文件中包含标记

时间:2014-08-16 02:17:41

标签: java regex string replaceall

我有2个文本文件:

1 Extract_tweet.txt - 文件格式为user_id tweet_id tweet_text

12163922    5407952300  I think I just discovered the hour when the office thermostat changes. And it ain't a good time to be at work...brrrr   2009-11-03 19:22:54

2 locations.txt - 以下数据的相关性是第3列,其作用类似于搜索字符串

asciiname: name of geographical point in plain ascii characters, varchar(200)

4045431 Point Poker Point Poker     52.89508    173.29911   T   CAPE    US      AK  016         0       9   America/Adak    2013-10-26

我想从这些文件中提取一些数据。数据通常必须只是a-z,A-Z和任何空格。我之前想过将字符串标记化。但是,由于没有给出sentinal,我想到了使用正则表达式。 PFB提取27个字符的代码片段,即a-Z或A-Z或任何空格。我想只提取小写的文本,即如果大写有任何字符,它应该转换为小写。

我将打开文件1 - Extract_tweet.txt并将完整文本作为单个字符串。然后我尝试用null替换每个非字母字符。

   public void readfromFile() throws FileNotFoundException
    {
        Scanner inputStream;
        String source=null;
        FileInputStream file = new FileInputStream("Extract_tweet.txt");    
        inputStream = new Scanner(file);
        while(inputStream.hasNextLine())    //Read from file till the last line of the file.
        {
            source = inputStream.nextLine();
            System.out.println(source);
            replaceAll(source);

        }
        inputStream.close();
    }
    public String replaceAll(String source) 
    {
        String regex = "[A-Z]*"+"["+source.toLowerCase()+"|"+"[a-z]*"+"[\\s]";
        source = source.replaceAll(regex, "");
        System.out.println(source);
        return source;
    }

    public static void main(String[] args) {

        StringProcessing sp = new StringProcessing();
        try {
            sp.readfromFile();
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

运行此代码后,我得到了以下错误。

60730027    6320951896  @thediscovietnam coo.  thanks. just dropped you a line. 2009-12-03 18:41:07
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal character range near index 88
[A-Z]*[60730027 6320951896  @thediscovietnam coo.  thanks. just dropped you a line. 2009-12-03 18:41:07|[a-z]*[\s]

2 个答案:

答案 0 :(得分:0)

请改变那条

String regex = "[A-Z]* |"+"[a-z]*"+"[\\s]";

它会正常工作。

答案 1 :(得分:0)

  

我做了一些改变。但是,我想将大写更改为小写,并将所有字母数字值替换为null。

扩展您的方法:

public String replaceAll(String source) throws FileNotFoundException {
    String regex = "[A-Z]* |[a-z]*\\s";
    source = source.replaceAll(regex, "")
                   .replaceAll("\\d", "")
                   .toLowerCase();

    System.out.println(source);
    writetoFile(source);
    return source;
}