我需要将一个字符串拆分成单词,但也要获得单词的起始和结束偏移量。因此,例如,如果输入字符串是:
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
我想得到:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
我有一些使用input_string.split执行此操作的代码并调用.index,但速度很慢。我尝试通过手动迭代字符串来编写代码,但这仍然比较慢。有没有人有这个快速算法?
以下是我的两个版本:
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
def manual_iteration(line):
start = 0
offsets = []
word = ''
for off, char in enumerate(line + ' '):
if char in ' \t\r\n':
if off > start:
offsets.append((word, start, off - 1))
start = off + 1
word = ''
else:
word += char
return offsets
通过使用timeit,“using_split”是最快的,接着是“manual_iteration”,然后到目前为止最慢的是使用re.finditer,如下所示。
答案 0 :(得分:19)
以下将会这样做:
import re
s = 'ONE ONE ONE \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)
这会产生:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
答案 1 :(得分:8)
以下运行速度稍快 - 节省约30%。我所做的就是事先定义函数:
def using_split2(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append((word, word_offset, running_offset - 1))
return offsets
答案 2 :(得分:7)
def split_span(s):
for match in re.finditer(r"\S+", s):
span = match.span()
yield match.group(0), span[0], span[1] - 1
答案 3 :(得分:2)
我通过彻底的欺骗在几分钟内获得了大约35%的加速:我使用cython将你的using_split()函数转换为基于C的python模块。这是我尝试过cython的第一个借口,我发现这很简单,也很有价值 - 见下文。
进入C是最后的手段:首先,我花了几个小时来试图找到一个比你的using_split()版本更快的算法。问题是,本机python str.split()比我尝试使用numpy或re的任何东西都要快得多,速度要快得多。因此,即使您正在扫描字符串两次,str.split()也足够快,看起来并不重要,至少不是这个特定的测试数据。
要使用cython,我将解析器放在名为parser.pyx的文件中:
===================== parser.pyx ==============================
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
===============================================================
然后我运行它来安装cython(假设有一个debian-ish Linux盒子):
sudo apt-get install cython
然后我从这个python脚本中调用了解析器:
================== using_cython.py ============================
#!/usr/bin/python
import pyximport; pyximport.install()
import parser
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
def parse():
return parser.using_split(input_string)
===============================================================
为了测试,我跑了这个:
python -m timeit "import using_cython; using_cython.parse();"
在我的机器上,你的pure-python using_split()函数平均值 8.5 usec运行时,而我的cython版本平均大约5.5 usec。
http://docs.cython.org/src/userguide/source_files_and_compilation.html
的更多详情答案 4 :(得分:1)
警告,此解决方案的速度受到光速的限制:
def get_word_context(input_string):
start = 0
for word in input_string.split():
c = word[0] #first character
start = input_string.find(c,start)
end = start + len(word) - 1
yield (word,start,end)
start = end + 2
print list(get_word_context("ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"))
[('ONE',0,2),('ONE',5,7),('ONE',9,11),('TWO',17,19),('TWO',21 ,23),('ONE',25,27),('TWO',29,31),('TWO',33,35),('THREE',37,41)]
答案 5 :(得分:0)
以下想法可能会导致加速:
注意:我没有测试过这些,但这是一个例子
from collections import deque
def using_split(line):
MAX_WORD_LENGTH = 10
line_index = line.index
words = line.split()
offsets = deque()
offsets_append = offsets.append
running_offset = 0
for word in words:
word_offset = line_index(word, running_offset, running_offset+MAX_WORD_LENGTH)
running_offset = word_offset + len(word)
offsets_append((word, word_offset, running_offset - 1))
return list(offsets)
答案 6 :(得分:0)
这里有一些面向c的方法,只在整个字符串上迭代一次。 您还可以定义自己的分隔符。 经过测试和工作,但可能更干净。
def mySplit(myString, mySeperators):
w = []
o = 0
iW = False
word = [None, None,None]
for i,c in enumerate(myString):
if not c in mySeperators:
if not iW:
word[1]=i
iW = True
if iW == True and c in mySeperators:
word[2]=i-1
word[0] = myString[word[1]:i]
w.append(tuple(word))
word=[None,None,None]
iW = False
return w
mySeperators = [" ", "\t"]
myString = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
splitted = mySplit(myString, mySeperators)
print splitted
答案 7 :(得分:0)
这似乎很快就会起作用:
tuple_list = [(match.group(), match.start(), match.end()) for match in re.compile("\S+").finditer(input_string)]
答案 8 :(得分:0)
以下是一些您可以分析的想法,看看它们是否足够快:
input_string = "".join([" ","ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"," "])
#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)
#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(word,next(switches),next(switches)-1) for word in input_string.split()]
#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)
#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(input_string[i:j+1],i,j-1) for i,j in zip(switches,switches)]
答案 9 :(得分:0)
我发现python循环是这里的慢速操作,因此我开始使用位图,我得到了这么远但它仍然很快,但我无法找到一种无循环的方式来获取启动/停止索引它:
import string
table = "".join([chr(i).isspace() and "0" or "1" for i in range(256)])
def indexed6(line):
binline = string.translate(line, table)
return int(binline, 2) ^ int(binline+"0", 2)
返回的整数具有为每个起始位置设置的位,每个停止位置+ 1位置。
P.S。 zip()相对较慢:足够快,可以使用一次,太慢,不能使用3次。