Split sentences with new line as delimiter using Java (1.7) Matcher

时间:2016-02-12 20:22:04

标签: java regex

I have a long sentence with embedded new lines or carriage returns that I want to split into separate sentences. An example such as: internal class ProjectService : IProjectService { private readonly IMapper _mapper; public ProjectService(IMapper mapper) { _mapper = mapper; } public ProjectCreate Get(string key) { var project = GetProjectSomehow(key); return _mapper.Map<Project, ProjectCreate>(project); } } should produce This is a new line=?xxx\n What's \n up This is a new line=?xxx and What's

I do not want to use up but instead something like:

String.split("\n")

The above code produces:

String x = "  This is a new line=?xxx\n Whats' \n up";
// This is not correct
Pattern p = Pattern.compile("(.*[\r\n]+|$)");
Matcher m = p.matcher(x);
while (m.find()) {
      System.out.printline(m.group(1));
}

What's wrong with my regex?

5 个答案:

答案 0 :(得分:1)

If you want to match then use this regex:

sc <- sparkR.init(master='MasterURL', sparkEnvir=list(spark.executor.memory='6G', spark.cores.max='4')
sqlContext <- sparkRSQL.init(sc)

# attempt to get temptable
df <- sql(sqlContext, "SELECT * FROM table"); # throws the error

(.+?)(?:[\r\n]|$) will match a line end ((?:[\r\n]|$) or \r) OR end of input thus making sure last line is also matched.

However \n should be preferred way here.

RegEx Demo

答案 1 :(得分:1)

Why is your regex incorrect?

The #! /usr/bin/env python3 import random def main(): names = 'Travis', 'Eric', 'Bob', 'Rose', 'Jessica', 'Anabel' while True: targets = random.sample(names, len(names)) if not any(a == b for a, b in zip(targets, names)): break # If Python supported do/while loops, you might have written this: # do: # targets = random.sample(names, len(names) # while any(a == b for a, b in zip(targets, names)) for source, target in zip(names, targets): print('{} will give to {}.'.format(source, target)) if __name__ == '__main__': main() contains 2 alternatives:

  • (.*[\r\n]+|$) - zero or more characters other than newline sequences (see below) and then one or more linebreaks (CR or/and LF)
  • .*[\r\n]+ - or...
  • | - end of string

So, you actually misplaced the grouping, here is how you wanted it to look like:

$

See IDEONE demo

If you want to match lines, it is easier to use a String p = "(.*(?:[\r\n]+|$))"; String x = " This is a new line=?xxx\n Whats' \n up"; Matcher m = Pattern.compile(p).matcher(x); while (m.find()) { System.out.println(m.group(1)); } that is matching any character but newline and carriage return, and some more "vertical whitespace" characters:

.

See the Java demo:

Pattern p = Pattern.compile(".+"); // for non-empty lines
Pattern p = Pattern.compile(".*"); // for empty lines as well

See what . actually does not match:

  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029).
  • If String x = " This is a new line=?xxx\n Whats' \n up"; Pattern ptrn = Pattern.compile(".+"); Matcher matcher = ptrn.matcher(x); while (matcher.find()) { System.out.println(matcher.group(0)); } mode is activated, then the only line terminators recognized are newline characters.

答案 2 :(得分:1)

Why go this route when there's support out of the box in java.util.regex.Pattern

rvm_autoinstall_bundler_flag=1

答案 3 :(得分:0)

Match the input using a reluctant quantifier.

Try this regex:

var grade = function(grade) {
  switch (grade) {
      case "A": " could not do any better. Well done";
      break;
      case "C": " had average performance. There's room to improve.";
      break;
      case "F": ", an unfortunate result. Will have to try again next year.";
      break;
      default: ".";
      break;
      }
  };

function Student(name, sgrade) {
  this.name = name;
  this.grade = sgrade;
  this.print = function() {
    console.log(this.name+grade(this.grade));
  };
 }

var student1 = new Student("Candice R.", "A");
var student2 = new Student("Robert K.", "C");
var student3 = new Student("Steven M.", "F");

for (i = 1; i<4; i++) {
  student+i.print();
  }

/* student[i].print() also doesn't work */

The "(?m).*$" flag makes (?m) match every end of line (platform-independently), and dot still won't match newlines (so no need for reluctant quantifier). Use $ or just m.group(0).


To match non-empty sentences, use a "+":

m.group()

To match non-blank (at least 1 non-whitespace):

"(?m).+$"

See live demo.

答案 4 :(得分:0)

Try this:

if (rowview != null)
        strid = rowview.Row["Id"].ToString();

It worked for me.