Python正则表达式用于分隔同一个字符串中的单词

时间:2016-10-05 03:54:51

标签: python-2.7

test_text = "AirMail from cairnsReceived but NOTavailable at the postOFFICE"

我希望能够将共同加入的单词分开并打印与

相同的字符串
print test_text 

test_text = "Air Mail from cairns Received but NOT available at the post OFFICE"

我尝试了以下代码,但不能完全符合我的要求:

cleaned_text1 = re.sub(r'([A-Z][^A-Z]*)', r' \1', test_text)
print cleaned_text1

我得到以下输出:

"来自凯恩斯的航空邮件已收到,但是在O F F I C E"

1 个答案:

答案 0 :(得分:0)

以下代码应该足以满足您的需求。

但是如果你需要拆分符号,那将是另一个故事...... 我为您附上了一些测试用例,以显示这种方式无法处理的情况。总之,它不处理尾随空格,单词和符号之间的两个或多个空格。

我仍然需要改进,因为我在splitWords函数中使用while循环来弥补正则表达式中的缺陷。

希望它有所帮助。

import re

def subFunc(matchobj):
    for c in range(len(matchobj.group(0))-1):
        if matchobj.group(0)[c].isupper() != matchobj.group(0)[c + 1].isupper():
            return ' '.join([matchobj.group(0)[:c+1], matchobj.group(0)[c+1:]])


def splitWords(test_text):
    cleaned_text1 = re.sub(r'([a-z][A-Z])|([A-Z]{2,}[a-z])', subFunc, test_text)
    while test_text != cleaned_text1:
        test_text = cleaned_text1
        cleaned_text1 = re.sub(r'([a-z][A-Z])|([A-Z]{2,}[a-z])', subFunc, test_text)

    print cleaned_text1

test_text = "AirMail from cairnsReceived but NOTavailable at the postOFFICE"
goal_text = "Air Mail from cairns Received but NOT available at the post OFFICE"
splitWords(test_text)
# Air Mail from cairns Received but NOT available at the post OFFICE

test_text = "AirMail from cairnsReceived but NOTavailable at the postOFFICEiLovePython"
splitWords(test_text)
# Air Mail from cairns Received but NOT available at the post OFFICE i Love Python

test_text = "AirMailFrOm cairnsReceived butNOTavailable at the postOFFICEiLovePYTHON"
splitWords(test_text)
# Air Mail Fr Om cairns Received but NOT available at the post OFFICE i Love PYTHON

test_text = "  Air..Mail From   cairnsReceived butNOTavailable at the postOFFICEiLovePYTHON"
splitWords(test_text)
#   Air..Mail From   cairns Received but NOT available at the post OFFICE i Love PYTHON