我有一本书的一部分,包括标点符号,换行符等。我希望能够从文本中提取前n个单词,并将其分为5个部分。正则表达式使我神秘。这就是我想要的。我创建了一个索引大小为0的数组,其中包含所有输入文本:
public static String getNumberWords2(String s, int nWords){
String[] m = s.split("([a-zA-Z_0-9]+\b.*?)", (nWords / 5));
return "Part One: \n" + m[1] + "\n\n" +
"Part Two: \n" + m[2] + "\n\n" +
"Part Three: \n" + m[3] + "\n\n" +
"Part Four: \n" + m[4] + "\n\n" +
"Part Five: \n" + m[5];
}
谢谢!
答案 0 :(得分:5)
我认为最简单,最有效的方法是简单地重复找到一个“单词”:
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(chapter);
while (m.find()) {
String word = m.group();
...
}
您可以通过修改正则表达式来改变“单词”的定义。我写的只是使用正则表达式的单词字符的概念,我想知道它是否比你想要做的更合适。但它不会包含引号字符,您可能需要在一个单词中使用。
答案 1 :(得分:2)
使用BreakIterator为此做出了更好的替代方案。这将是解析Java中单词的最正确方法。
答案 2 :(得分:0)
(请参阅下文中的下一篇文章。由于思考过程,将此部分留在这里......)
基于我对split()
javadoc的阅读,我想我知道发生了什么。
您希望基于空格分割字符串,最多n次。
String [] m = s.split("\\b", nWords);
然后如果你必须将它们与令牌空白拼接在一起:
StringBuffer strBuf = new StringBuffer();
for (int i = 0; i < nWords; i++) {
strBuf.append(m[i]).append(" ");
}
最后,把它分成五个相同的字符串:
String [] out = new String[5];
String str = strBuf.toString();
int length = str.length();
int chopLength = length / 5;
for (int i = 0; i < 5; i++) {
int startIndex = i * chopLength;
out[i] = str.substring(startIndex, startIndex + choplength);
}
对我来说已经很晚了,所以你可能想要自己检查一下是否正确。我想我在地区代码中找到了正确的地方。
好的,这是尝试编号3.通过调试器运行它,我可以验证剩下的唯一问题是切片字符串的整数数学不是5到5的因子,以及如何最好地处理剩下的人物。
它不漂亮,但它有效。
String[] sliceAndDiceNTimes(String victim, int slices, int wordLimit) {
// Add one to the wordLimit here, because the rest of the input string
// (past the number of times split() does its magic) will be in the last
// array member
String [] words = victim.split("\\s", wordLimit + 1);
StringBuffer partialVictim = new StringBuffer();
for (int i = 0; i < wordLimit; i++) {
partialVictim.append(words[i]).append(' ');
}
String [] resultingSlices = new String[slices];
String recycledVictim = partialVictim.toString().trim();
int length = recycledVictim.length();
int chopLength = length / slices;
for (int i = 0; i < slices; i++) {
int chopStartIdx = i * chopLength;
resultingSlices[i] = recycledVictim.substring(chopStartIdx, chopStartIdx + chopLength);
}
return resultingSlices;
}
重要说明:
答案 3 :(得分:0)
我只想猜测你需要什么;希望这很接近:
public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
"nisi ut aliquip ex ea commodo consequat. Rosebud.";
String[] words = text.split("\\s+");
final int N = words.length;
final int C = 5;
final int R = (N + C - 1) / C;
for (int r = 0; r < R; r++) {
for (int x = r, i = 0; (i < C) && (x < N); i++, x += R) {
System.out.format("%-15s", words[x]);
}
System.out.println();
}
}
这会产生:
Lorem sed dolore quis ex
ipsum do magna nostrud ea
dolor eiusmod aliqua. exercitation commodo
sit tempor Ut ullamco consequat.
amet, incididunt enim laboris Rosebud.
consectetur ut ad nisi
adipisicing labore minim ut
elit, et veniam, aliquip
static String nextNwords(int n) {
return "(\\S+\\s*){N}".replace("N", String.valueOf(n));
}
static String[] splitFive(String text, final int N) {
Scanner sc = new Scanner(text);
String[] parts = new String[5];
for (int r = 0; r < 5; r++) {
parts[r] = sc.findInLine(nextNwords(N / 5 + (r < (N % 5) ? 1 : 0)));
}
return parts;
}
public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
"nisi ut aliquip ex ea commodo consequat. Rosebud.";
for (String part : splitFive(text, 23)) {
System.out.println(part);
}
}
这将打印text
的前23个单词,
Lorem ipsum dolor sit amet,
consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore
et dolore magna aliqua. Ut
enim ad minim
或者7:
Lorem ipsum
dolor sit
amet,
consectetur
adipisicing
或者3:
Lorem
ipsum
dolor
<blank>
<blank>
答案 4 :(得分:-1)
我有一个非常难看的解决方案:
public static Object[] getNumberWords(String s, int nWords, int offset){
Object[] os = new Object[2];
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(s);
m.region(offset, m.regionEnd());
int wc = 0;
String total = "";
while (wc <= nWords && m.find()) {
String word = m.group();
total += word + " ";
wc++;
}
os[0] = total;
os[1] = total.lastIndexOf(" ") + offset;
return os; }
String foo(String s, int n){
Object[] os = getNumberWords(s, n, 0);
String a = (String) os[0];
String m[] = new String[5];
int indexCount = 0;
int lastEndIndex = 0;
for(int count = (n / 5); count <= n; count += (n/5)){
if(a.length()<count){count = a.length();}
os = getNumberWords(a, (n / 5), lastEndIndex);
lastEndIndex = (Integer) os[1];
m[indexCount] = (String) os[0];
indexCount++;
}
return "Part One: \n" + m[0] + "\n\n" +
"Part Two: \n" + m[1] + "\n\n" +
"Part Three: \n" + m[2] + "\n\n" +
"Part Four: \n" + m[3] + "\n\n" +
"Part Five: \n" + m[4];
}