从字符串

时间:2017-02-01 08:09:41

标签: java

我有一个看起来像的字符串:

   Fulton / np-tl县/ nn-tl Grand / jj-tl陪审团/ nn-tl说/ vbd   星期五/ nr /调查/在亚特兰大的/ np $ recent / jj   primary / nn election / nn produce / vbd / no / at evidence / nn''/''   / cs any / dti irregularities / nns / / vbd place / nn ./.

我想只提取原始文本并丢弃POS标签。我可以使用什么Regex来做到这一点。我知道我可以拆分/但我需要删除标签并获取。我应该使用正则表达式来识别标签吗?

  富尔顿县大陪审团星期五对亚特兰大的调查表示   最近的初选产生了“没有证据”的任何违规行为   发生了。

2 个答案:

答案 0 :(得分:3)

您可以使用demo模式/.*?(\s|$)删除POS标记。我认为以下代码可以让您非常接近您想要的位置。

String input = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
input = input.replaceAll("/.*?(?:\\s|$)", " ");
System.out.println(input);

<强>输出:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary
election produced  no evidence " that any irregularities took place . "

答案 1 :(得分:0)

所以这就是我快速编写的用于提取所需字符串的内容。您是否有更好/更有效的想法,因为我需要在大量数据上做到这一点?

public static void main(String args[]) {

            StringBuilder sb = new StringBuilder();


            String str = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
            String [] newLine = str.split(" ");
            for (String word : newLine){
                int index = word.indexOf("/");
                String newWord = word.substring(0, index);
                sb.append(newWord);
                sb.append(" ");

            }
            System.out.println(sb);
}