Question

我正在尝试找到一种快速的方法来获取每个字符串的数组：1-主题标签，2-用户在推文文本中提到3个网址。我在csv文件中有推文文本。

我解决问题的方法需要太长的处理时间，我想知道我是否可以优化我的代码。我将展示我的每个匹配类型的正则表达式规则，但只是不发布长代码我将只显示我如何匹配主题标签。用于网址和用户提及的技术相同。

这是：

public static String hashtagRegex = "^#\\w+|\\s#\\w+";
public static Pattern hashtagPattern = Pattern.compile(hashtagRegex);

public static String urlRegex = "http+://[\\S]+|https+://[\\S]+";
public static Pattern urlPattern = Pattern.compile(urlRegex);

public static String mentionRegex = "^@\\w+|\\s@\\w+";
public static Pattern mentionPattern = Pattern.compile(mentionRegex);

public static String[] getHashtag(String text) {
   String hashtags[];
   matcher = hashtagPattern.matcher(tweet.getText());

    if ( matcher.find() ) {
        hashtags = new String[matcher.groupCount()];
        for ( int i = 0; matcher.find(); i++ ) {
                    //Also i'm getting an ArrayIndexOutOfBoundsException
            hashtags[i] = matcher.group().replace(" ", "").replace("#", "");
        }
    }

   return hashtags;

}

Answer 1

Matcher#groupCount为您提供捕获组的数量，不匹配数。这就是你获得ArrayIndexOutOfBoundsException的原因（在你的情况下，数组初始化为零）。您可能希望使用List来收集匹配的动态增长而不是数组。

加速的一种（潜在）方法可能是在空格上标记文本，然后只检查令牌的开头是否有http，@或#等片段。这样，您就可以完全避免使用正则表达式。（没有描述，所以我无法分辨性能影响）。

从推文文本中快速提取主题标签，用户提及和网址的方法？

1 个答案: