我有一个看起来像的字符串:
> Fulton / np-tl县/ nn-tl Grand / jj-tl陪审团/ nn-tl说/ vbd
星期五/ nr /调查/在亚特兰大的/ np $ recent / jj
primary / nn election / nn produce / vbd /
no / at evidence / nn''/''
/ cs any / dti irregularities / nns / / vbd place / nn ./.
我想只提取原始文本并丢弃POS标签。我可以使用什么Regex来做到这一点。我知道我可以拆分/但我需要删除标签并获取。我应该使用正则表达式来识别标签吗?
富尔顿县大陪审团星期五对亚特兰大的调查表示 最近的初选产生了“没有证据”的任何违规行为 发生了。
答案 0 :(得分:3)
您可以使用demo模式/.*?(\s|$)
删除POS标记。我认为以下代码可以让您非常接近您想要的位置。
String input = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
input = input.replaceAll("/.*?(?:\\s|$)", " ");
System.out.println(input);
<强>输出:强>
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary
election produced no evidence " that any irregularities took place . "
答案 1 :(得分:0)
所以这就是我快速编写的用于提取所需字符串的内容。您是否有更好/更有效的想法,因为我需要在大量数据上做到这一点?
public static void main(String args[]) {
StringBuilder sb = new StringBuilder();
String str = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
String [] newLine = str.split(" ");
for (String word : newLine){
int index = word.indexOf("/");
String newWord = word.substring(0, index);
sb.append(newWord);
sb.append(" ");
}
System.out.println(sb);
}