我必须使用Python查找句子中的所有小写单词。我考虑过如下使用正则表达式:
import re
re.findall(r'\b[^A-Z()\s\d]+\b', 'A word, TWO words')
除了我有Aword
的情况外,它都有效。我该怎么解决?
通常,正则表达式应符合以下情况:
Aword --> output: word
A word --> output: word
A word word --> output [word, word]
A(word) AND A pers --> output [word, pers]
AwordWOrd --> output [word, rd]
答案 0 :(得分:5)
您实际上并不需要 正则表达式来执行此任务,可以使用str
方法。基于正则表达式的方法相当快,但是使用str.translate
则可以更快。
这是我找到的最快的解决方案。我们创建一个转换表(一个字典),将每个非小写的ASCII字符映射到一个空格。然后,我们使用str.split
将结果字符串分成一个列表; str.split()
在任何空白处分割,并丢弃空白,仅保留所需的单词。
# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')
def find_lower(s):
""" Translate non-lowercase chars to space """
return s.translate(table).split()
这里有一些测试代码,比较了各种方法,包括Ajax1234的正则表达式解决方案,以及sopython聊天室中的常客的一些建议,包括Kevin和user3483203。
此代码的测试数据由包含datalen
个单词的字符串组成,其中datalen
的范围为32到1024。每个单词包含8个随机字符;随机词生成器主要选择小写字母。
the timeit.Timer.repeat
docs提到这些结果中的重要数字是 minimum (每个列表中的第一个),其他数字仅表示由于变量的变化而对结果产生的影响。系统负载。
#! /usr/bin/env python3
""" Find all "words" of lowercase chars in a string
Speed tests, using the timeit module, of various approaches
See https://stackoverflow.com/q/51710087
Written by Ajax1234, PM 2Ring, Kevin, and user3483203
2018.08.07
"""
import re
from string import ascii_lowercase, printable
from timeit import Timer
from random import seed, choice
seed(17)
# A collection of chars with lots of lowercase
# letters to use for making random words
test_chars = 5 * ascii_lowercase + printable
def randword(n):
""" Make a random "word" of n chars."""
return ''.join([choice(test_chars) for _ in range(n)])
# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')
def find_lower_pm2r(s, table=table):
""" Translate non-lowercase chars to space """
return s.translate(table).split()
def find_lower_pm2r_byte(s):
""" Convert to bytes & test the ASCII code to see if it's in range """
return bytes(b if 97 <= b <= 122 else 32 for b in s.encode()).decode().split()
def find_lower_ajax(s):
""" Use a regex """
return re.findall('[a-z]+', s)
def find_lower_kevin(s):
""" Use the str.islower method """
return "".join([c if c.islower() else " " for c in s]).split()
lwr = set(ascii_lowercase)
def find_lower_3483203(s, lwr=lwr):
""" Test using a set """
return ''.join([i if i in lwr else ' ' for i in s]).split()
functions = (
find_lower_ajax,
find_lower_pm2r,
find_lower_pm2r_byte,
find_lower_kevin,
find_lower_3483203,
)
def verify(data, verbose=False):
""" Check that all functions give the same results """
if verbose:
print('Verifying:', repr(data))
results = []
for func in functions:
result = func(data)
results.append(result)
if verbose:
print('{:20} : {}'.format(func.__name__, result))
head, *tail = results
return all(u == head for u in tail)
def time_test(loops, data):
""" Perform the timing tests """
timings = []
for func in functions:
t = Timer(lambda: func(data))
result = sorted(t.repeat(3, loops))
timings.append((result, func.__name__))
timings.sort()
for result, name in timings:
print('{:20} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
print()
# Check that all functions perform correctly
datalen = 8
data = ' '.join([randword(8) for _ in range(datalen)])
print(verify(data, True), '\n')
# Time it!
loops = 1024
datalen = 32
for _ in range(6):
data = ' '.join([randword(8) for _ in range(datalen)])
print('loops', loops, 'len', datalen, verify(data, False))
time_test(loops, data)
loops //= 2
datalen *= 2
输出
Verifying: '3c/zpws% OO8Dtcgl u;Zdm{y. dx]JTyjb pj;+ ym\t O6d.Jbg8 f\tRxrbau z`rxnkI:'
find_lower_ajax : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r_byte : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_kevin : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_3483203 : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
True
loops 1024 len 32 True
find_lower_pm2r : 0.038420, 0.075005, 0.082880
find_lower_ajax : 0.065296, 0.083511, 0.117944
find_lower_3483203 : 0.136276, 0.139128, 0.139208
find_lower_kevin : 0.225619, 0.241822, 0.250794
find_lower_pm2r_byte : 0.249634, 0.257480, 0.268771
loops 512 len 64 True
find_lower_pm2r : 0.026582, 0.026888, 0.027445
find_lower_ajax : 0.059608, 0.061116, 0.074781
find_lower_3483203 : 0.129526, 0.130411, 0.163533
find_lower_kevin : 0.217885, 0.219185, 0.219834
find_lower_pm2r_byte : 0.237033, 0.237225, 0.237880
loops 256 len 128 True
find_lower_pm2r : 0.020133, 0.020144, 0.020194
find_lower_ajax : 0.059215, 0.060153, 0.076451
find_lower_3483203 : 0.125678, 0.125989, 0.127963
find_lower_kevin : 0.215228, 0.215832, 0.218419
find_lower_pm2r_byte : 0.234180, 0.237770, 0.240791
loops 128 len 256 True
find_lower_pm2r : 0.017107, 0.017151, 0.017376
find_lower_ajax : 0.061019, 0.062389, 0.074479
find_lower_3483203 : 0.123576, 0.123802, 0.126174
find_lower_kevin : 0.212917, 0.213197, 0.214432
find_lower_pm2r_byte : 0.231248, 0.232049, 0.233519
loops 64 len 512 True
find_lower_pm2r : 0.014723, 0.014752, 0.014787
find_lower_ajax : 0.054442, 0.055595, 0.068130
find_lower_3483203 : 0.121101, 0.121847, 0.122723
find_lower_kevin : 0.210416, 0.211491, 0.211810
find_lower_pm2r_byte : 0.232548, 0.232655, 0.234670
loops 32 len 1024 True
find_lower_pm2r : 0.013886, 0.014000, 0.014106
find_lower_ajax : 0.051643, 0.052614, 0.065182
find_lower_3483203 : 0.121135, 0.121708, 0.124333
find_lower_kevin : 0.210581, 0.212073, 0.212232
find_lower_pm2r_byte : 0.245451, 0.251015, 0.252851
结果是在运行Debian衍生品Linux的我的古老单核32位2GHz机器上使用Python 3.6.0的。 YMMV。
user3483203添加了一些Pandas and matplotlib code以根据timeit
结果生成图形。
答案 1 :(得分:3)
您可以使用[a-z]
:
import re
_input = ['AwordWOrd', 'Aword', 'A word', 'A word word', 'A(word) AND A pers']
results = [re.findall('[a-z]+', i) for i in _input]
输出:
[['word', 'rd'], ['word'], ['word'], ['word', 'word'], ['word', 'pers']]
答案 2 :(得分:0)
我相信这应该可以解决问题:
import re
re.findall(r'[a-z\s\d]+\b', 'Aword, TWO words')