查找句子中的所有小写单词

时间:2018-08-06 14:36:34

标签: python regex

我必须使用Python查找句子中的所有小写单词。我考虑过如下使用正则表达式:

import re
re.findall(r'\b[^A-Z()\s\d]+\b', 'A word, TWO words')

除了我有Aword的情况外,它都有效。我该怎么解决?

通常,正则表达式应符合以下情况:

Aword --> output: word
A word --> output: word
A word word --> output [word, word]
A(word) AND A pers --> output [word, pers]
AwordWOrd --> output [word, rd]

3 个答案:

答案 0 :(得分:5)

您实际上并不需要 正则表达式来执行此任务,可以使用str方法。基于正则表达式的方法相当快,但是使用str.translate 则可以更快。

这是我找到的最快的解决方案。我们创建一个转换表(一个字典),将每个非小写的ASCII字符映射到一个空格。然后,我们使用str.split将结果字符串分成一个列表; str.split()在任何空白处分割,并丢弃空白,仅保留所需的单词。

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')

def find_lower(s):
    """ Translate non-lowercase chars to space """
    return s.translate(table).split()

这里有一些测试代码,比较了各种方法,包括Ajax1234的正则表达式解决方案,以及sopython聊天室中的常客的一些建议,包括Kevinuser3483203

此代码的测试数据由包含datalen个单词的字符串组成,其中datalen的范围为32到1024。每个单词包含8个随机字符;随机词生成器主要选择小写字母。

the timeit.Timer.repeat docs提到这些结果中的重要数字是 minimum (每个列表中的第一个),其他数字仅表示由于变量的变化而对结果产生的影响。系统负载。

#! /usr/bin/env python3

""" Find all "words" of lowercase chars in a string

    Speed tests, using the timeit module, of various approaches

    See https://stackoverflow.com/q/51710087

    Written by Ajax1234, PM 2Ring, Kevin, and user3483203
    2018.08.07
"""

import re
from string import ascii_lowercase, printable
from timeit import Timer
from random import seed, choice

seed(17)

# A collection of chars with lots of lowercase
# letters to use for making random words
test_chars = 5 * ascii_lowercase + printable

def randword(n):
    """ Make a random "word" of n chars."""
    return ''.join([choice(test_chars) for _ in range(n)])

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')
def find_lower_pm2r(s, table=table):
    """ Translate non-lowercase chars to space """
    return s.translate(table).split()

def find_lower_pm2r_byte(s):
    """ Convert to bytes & test the ASCII code to see if it's in range """
    return bytes(b if 97 <= b <= 122 else 32 for b in s.encode()).decode().split()

def find_lower_ajax(s):
    """ Use a regex """
    return re.findall('[a-z]+', s)

def find_lower_kevin(s):
    """ Use the str.islower method """
    return "".join([c if c.islower() else " " for c in s]).split()

lwr = set(ascii_lowercase)
def find_lower_3483203(s, lwr=lwr):
    """ Test using a set """
    return ''.join([i if i in lwr else ' ' for i in s]).split()

functions = (
    find_lower_ajax,
    find_lower_pm2r,
    find_lower_pm2r_byte,
    find_lower_kevin,
    find_lower_3483203,
)

def verify(data, verbose=False):
    """ Check that all functions give the same results """
    if verbose:
        print('Verifying:', repr(data))
    results = []
    for func in functions:
        result = func(data)
        results.append(result)
        if verbose:
            print('{:20} : {}'.format(func.__name__, result))
    head, *tail = results
    return all(u == head for u in tail)

def time_test(loops, data):
    """ Perform the timing tests """
    timings = []
    for func in functions:
        t = Timer(lambda: func(data))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:20} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

# Check that all functions perform correctly
datalen = 8
data = ' '.join([randword(8) for _ in range(datalen)])
print(verify(data, True), '\n')

# Time it!
loops = 1024
datalen = 32
for _ in range(6):
    data = ' '.join([randword(8) for _ in range(datalen)])
    print('loops', loops, 'len', datalen, verify(data, False))
    time_test(loops, data)
    loops //= 2
    datalen *= 2

输出

Verifying: '3c/zpws% OO8Dtcgl u;Zdm{y. dx]JTyjb pj;+ ym\t O6d.Jbg8 f\tRxrbau z`rxnkI:'
find_lower_ajax      : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r      : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r_byte : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_kevin     : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_3483203   : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
True 

loops 1024 len 32 True
find_lower_pm2r      : 0.038420, 0.075005, 0.082880
find_lower_ajax      : 0.065296, 0.083511, 0.117944
find_lower_3483203   : 0.136276, 0.139128, 0.139208
find_lower_kevin     : 0.225619, 0.241822, 0.250794
find_lower_pm2r_byte : 0.249634, 0.257480, 0.268771

loops 512 len 64 True
find_lower_pm2r      : 0.026582, 0.026888, 0.027445
find_lower_ajax      : 0.059608, 0.061116, 0.074781
find_lower_3483203   : 0.129526, 0.130411, 0.163533
find_lower_kevin     : 0.217885, 0.219185, 0.219834
find_lower_pm2r_byte : 0.237033, 0.237225, 0.237880

loops 256 len 128 True
find_lower_pm2r      : 0.020133, 0.020144, 0.020194
find_lower_ajax      : 0.059215, 0.060153, 0.076451
find_lower_3483203   : 0.125678, 0.125989, 0.127963
find_lower_kevin     : 0.215228, 0.215832, 0.218419
find_lower_pm2r_byte : 0.234180, 0.237770, 0.240791

loops 128 len 256 True
find_lower_pm2r      : 0.017107, 0.017151, 0.017376
find_lower_ajax      : 0.061019, 0.062389, 0.074479
find_lower_3483203   : 0.123576, 0.123802, 0.126174
find_lower_kevin     : 0.212917, 0.213197, 0.214432
find_lower_pm2r_byte : 0.231248, 0.232049, 0.233519

loops 64 len 512 True
find_lower_pm2r      : 0.014723, 0.014752, 0.014787
find_lower_ajax      : 0.054442, 0.055595, 0.068130
find_lower_3483203   : 0.121101, 0.121847, 0.122723
find_lower_kevin     : 0.210416, 0.211491, 0.211810
find_lower_pm2r_byte : 0.232548, 0.232655, 0.234670

loops 32 len 1024 True
find_lower_pm2r      : 0.013886, 0.014000, 0.014106
find_lower_ajax      : 0.051643, 0.052614, 0.065182
find_lower_3483203   : 0.121135, 0.121708, 0.124333
find_lower_kevin     : 0.210581, 0.212073, 0.212232
find_lower_pm2r_byte : 0.245451, 0.251015, 0.252851

结果是在运行Debian衍生品Linux的我的古老单核32位2GHz机器上使用Python 3.6.0的。 YMMV。


user3483203添加了一些Pandas and matplotlib code以根据timeit结果生成图形。

Graph of timeit results

答案 1 :(得分:3)

您可以使用[a-z]

import re
_input = ['AwordWOrd', 'Aword', 'A word', 'A word word', 'A(word) AND A pers']
results = [re.findall('[a-z]+', i) for i in _input] 

输出:

[['word', 'rd'], ['word'], ['word'], ['word', 'word'], ['word', 'pers']]

答案 2 :(得分:0)

我相信这应该可以解决问题:

import re
re.findall(r'[a-z\s\d]+\b', 'Aword, TWO words')