Question

I'm trying to create a regex (implementable in Javascript/Node.js) to:

Add a space whenever a letter or character (A-Z, a-z, !@#$%^&*(), etc. but NOT a number) is followed by a period which is then is followed by a capital letter (with no space in between) and/or,
Add a period (.) whenever a whitespace is followed by a single capital letter (A-Z, a-z but NOT a number or character) UNLESS there is more than one capital letter such as in an acronym, and/or,
Add a period (.) whenever any character, letter or number is NOT followed by anything else in a string.

For example, in the first case:

This is a sample sentence.This is a sample new sentence.

Should become:

This is a sample sentence. This is a sample new sentence.

In the second case, for example:

This is a sample sentence This is a sample new sentence.

Should become:

This is a sample sentence. This is a sample new sentence.

But also, in the second case:

This is a sample sentence with TEST This is a sample new sentence.

Should become:

This is a sample sentence with TEST. This is a sample new sentence.

In the third case, for example:

This is a sample sentence. This is a sample new sentence

Should become:

This is a sample sentence. This is a sample new sentence.

Notice the differences in placement of periods and spacing for these examples that I am looking to search for and change.

I've searched for variants of this and found some, but nothing that fits the exact criteria listed above. I'm only worried about periods and spaces at this point in time, not other types of punctuation unless there is a more universal solution that can apply to more than just these cases. I'm looking to use this to start cleaning up the grammar in some log files and other areas.

I apologize in advance if this reads too complicated. Leave a comment and I will gladly clarify if needed.

Answer 1

While I should include the standard caution against using programmatic means to mess around with natural languages (which are very complex and difficult for computers to understand), a series of regexes that (when run in sequence on the string) do what you want appears below.

For the first scenario:

s/([^0-9.])\.([^0-9])/\1. \2/g

For the second scenario:

s/([^.]) ([A-Z][a-z])/\1. \2/g

For the third scenario:

s/([^.])$/\1./g

To break it down a little:

s/A/B/g means "replace every occurrence of regex A in the text with B".

(A) means "capture A so we can use it again later" (this is known as a capture group).

[^0-9.] means "match all characters that are not numeric characters or the period character". This is a negated character class.

\. matches the literal period (".") character.

$ is the end-of-line anchor - it matches the end of the string.

\1 and \2 refer to the first and second capture groups, respectively.

So, basically, what these regexes do is to capture the stuff around the region to be modified, then replace that stuff plus the region with the stuff plus the modification.

Answer 2

For the 1st case, use the following to match and replace with space:

(?=\.[^\d\s])

For the 2nd and 3rd cases, use the following regex to match and replace with .

(?<!\.)$|(?=\s[A-Z])

Regex for adding a space or period for new sentence under certain conditions

2 个答案: