假设我有一个1000 GB的文本文件。我需要找出短语在文本中出现的次数。
有没有更快的方法来做我正在使用的人? 完成任务需要多少钱。
phrase = "how fast it is"
count = 0
with open('bigfile.txt') as f:
for line in f:
count += line.count(phrase)
如果我是对的,如果我在内存中没有这个文件,我会等到每次我进行搜索时PC都加载文件,这应该至少花费4000秒,250 MB /秒硬盘驱动器和10000 GB的文件。
答案 0 :(得分:24)
我使用file.read()
以块的形式读取数据,在当前示例中,块的大小分别为100 MB,500 MB,1GB和2GB。我的文本文件大小为2.1 GB。
<强>代码:强>
from functools import partial
def read_in_chunks(size_in_bytes):
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt', 'r+b') as f:
prev = ''
count = 0
f_read = partial(f.read, size_in_bytes)
for text in iter(f_read, ''):
if not text.endswith('\n'):
# if file contains a partial line at the end, then don't
# use it when counting the substring count.
text, rest = text.rsplit('\n', 1)
# pre-pend the previous partial line if any.
text = prev + text
prev = rest
else:
# if the text ends with a '\n' then simple pre-pend the
# previous partial line.
text = prev + text
prev = ''
count += text.count(s)
count += prev.count(s)
print count
<强>时序:强>
read_in_chunks(104857600)
$ time python so.py
10000000
real 0m1.649s
user 0m0.977s
sys 0m0.669s
read_in_chunks(524288000)
$ time python so.py
10000000
real 0m1.558s
user 0m0.893s
sys 0m0.646s
read_in_chunks(1073741824)
$ time python so.py
10000000
real 0m1.242s
user 0m0.689s
sys 0m0.549s
read_in_chunks(2147483648)
$ time python so.py
10000000
real 0m0.844s
user 0m0.415s
sys 0m0.408s
另一方面,简单循环版本在我的系统上大约需要6秒钟:
def simple_loop():
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt') as f:
print sum(line.count(s) for line in f)
$ time python so.py
10000000
real 0m5.993s
user 0m5.679s
sys 0m0.313s
我的档案中@ SlaterTyranus grep
version的结果:
$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000
real 0m11.975s
user 0m11.779s
sys 0m0.568s
@ woot&#39; solution的结果:
$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m5.955s
user 0m14.825s
sys 0m5.766s
当我使用100 MB作为块大小时获得最佳时机:
$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m4.632s
user 0m13.466s
sys 0m3.290s
woot&#39; second solution的结果:
$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000
real 0m1.006s
user 0m0.509s
sys 0m2.171s
$ time python woot_thread.py #CHUNK_SIZE = 2147483648
10000000
real 0m1.009s
user 0m0.495s
sys 0m2.144s
系统规格:Core i5-4670,7200 RPM HDD
答案 1 :(得分:8)
这是Python尝试......您可能需要使用THREADS和CHUNK_SIZE。它也是短时间内的一堆代码,所以我可能没想过一切。我确实重叠我的缓冲区以捕获其中的缓冲区,并扩展最后一个块以包含文件的其余部分。
import os
import threading
INPUTFILE ='bigfile.txt'
SEARCH_STRING='how fast it is'
THREADS = 8 # Set to 2 times number of cores, assuming hyperthreading
CHUNK_SIZE = 32768
FILESIZE = os.path.getsize(INPUTFILE)
SLICE_SIZE = FILESIZE / THREADS
class myThread (threading.Thread):
def __init__(self, filehandle, seekspot):
threading.Thread.__init__(self)
self.filehandle = filehandle
self.seekspot = seekspot
self.cnt = 0
def run(self):
self.filehandle.seek( self.seekspot )
p = self.seekspot
if FILESIZE - self.seekspot < 2 * SLICE_SIZE:
readend = FILESIZE
else:
readend = self.seekspot + SLICE_SIZE + len(SEARCH_STRING) - 1
overlap = ''
while p < readend:
if readend - p < CHUNK_SIZE:
buffer = overlap + self.filehandle.read(readend - p)
else:
buffer = overlap + self.filehandle.read(CHUNK_SIZE)
if buffer:
self.cnt += buffer.count(SEARCH_STRING)
overlap = buffer[len(buffer)-len(SEARCH_STRING)+1:]
p += CHUNK_SIZE
filehandles = []
threads = []
for fh_idx in range(0,THREADS):
filehandles.append(open(INPUTFILE,'rb'))
seekspot = fh_idx * SLICE_SIZE
threads.append(myThread(filehandles[fh_idx],seekspot ) )
threads[fh_idx].start()
totalcount = 0
for fh_idx in range(0,THREADS):
threads[fh_idx].join()
totalcount += threads[fh_idx].cnt
print totalcount
答案 2 :(得分:7)
cat bigfile.txt | parallel --block 10M --pipe grep -o 'how\ fast\ it\ is' | wc -l
答案 3 :(得分:3)
您是否考虑过将文件编入索引?搜索引擎的工作方式是创建从单词到文件中位置的映射。如果您有此文件,请说:
Foo bar baz dar. Dar bar haa.
您创建一个如下所示的索引:
{
"foo": {0},
"bar": {4, 21},
"baz": {8},
"dar": {12, 17},
"haa": {25},
}
可以在O(1)中查找哈希表索引;所以它的速度很快。
有人搜索查询&#34; bar baz&#34;首先将查询分解为其组成单词:[&#34; bar&#34;,&#34; baz&#34;]然后您找到{4,21},{8};然后你用它来跳到可能存在查询文本的地方。
索引搜索引擎也有开箱即用的解决方案;例如Solr或ElasticSearch。
答案 4 :(得分:2)
建议用grep
而不是python来做这件事。会更快,通常如果您在本地计算机上处理1000GB的文本,那么您做错了什么,但除了所有判断之外,grep
附带了几个选项可以让您的生活更轻松。
grep -o '<your_phrase>' bigfile.txt|wc -l
具体来说,这将计算您想要的短语出现的行数。这也应计算在一行上的多次出现。
如果你不需要,你可以做这样的事情:
grep -c '<your_phrase>' bigfile.txt
答案 5 :(得分:2)
这是使用数据库的第三个更长的方法。数据库肯定比文本大。我不确定索引是否是最佳的,并且可以通过稍微玩一些来节省一些空间。 (比如,也许WORD,POS,WORD更好,或者也许是WORD,POS很好,需要稍微试验一下)。
这可能在200 OK测试中表现不佳,但因为它有很多重复文本,但可能在更独特的数据上表现良好。
首先通过扫描单词等来创建数据库:
import sqlite3
import re
INPUT_FILENAME = 'bigfile.txt'
DB_NAME = 'words.db'
FLUSH_X_WORDS=10000
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS WORDS (
POS INTEGER
,WORD TEXT
,PRIMARY KEY( POS, WORD )
) WITHOUT ROWID
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_WORD_POS
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_POS_WORD
""")
cursor.execute("""
DELETE FROM WORDS
""")
conn.commit()
def flush_words(words):
for word in words.keys():
for pos in words[word]:
cursor.execute('INSERT INTO WORDS (POS, WORD) VALUES( ?, ? )', (pos, word.lower()) )
conn.commit()
words = dict()
pos = 0
recomp = re.compile('\w+')
with open(INPUT_FILENAME, 'r') as f:
for line in f:
for word in [x.lower() for x in recomp.findall(line) if x]:
pos += 1
if words.has_key(word):
words[word].append(pos)
else:
words[word] = [pos]
if pos % FLUSH_X_WORDS == 0:
flush_words(words)
words = dict()
if len(words) > 0:
flush_words(words)
words = dict()
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_WORD_POS ON WORDS ( WORD, POS )
""")
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_POS_WORD ON WORDS ( POS, WORD )
""")
cursor.execute("""
VACUUM
""")
cursor.execute("""
ANALYZE WORDS
""")
然后通过生成SQL来搜索数据库:
import sqlite3
import re
SEARCH_PHRASE = 'how fast it is'
DB_NAME = 'words.db'
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
recomp = re.compile('\w+')
search_list = [x.lower() for x in recomp.findall(SEARCH_PHRASE) if x]
from_clause = 'FROM\n'
where_clause = 'WHERE\n'
num = 0
fsep = ' '
wsep = ' '
for word in search_list:
num += 1
from_clause += '{fsep}words w{num}\n'.format(fsep=fsep,num=num)
where_clause += "{wsep} w{num}.word = '{word}'\n".format(wsep=wsep, num=num, word=word)
if num > 1:
where_clause += " AND w{num}.pos = w{lastnum}.pos + 1\n".format(num=str(num),lastnum=str(num-1))
fsep = ' ,'
wsep = ' AND'
sql = """{select}{fromc}{where}""".format(select='SELECT COUNT(*)\n',fromc=from_clause, where=where_clause)
res = cursor.execute( sql )
print res.fetchone()[0]
答案 6 :(得分:2)
我们讨论的是在相当大的数据流中对特定子字符串的简单计数。该任务几乎肯定是I / O绑定的,但很容易并行化。第一层是原始读取速度;我们可以选择通过使用压缩来减少读取量,或者通过将数据存储在多个位置来分配传输速率。然后我们进行搜索;子串搜索是一个众所周知的问题,I / O也是有限的。如果数据集来自单个磁盘,那么任何优化都没有实际意义,因为磁盘无法在速度上击败单个核心。
假设我们确实有块,例如可能是bzip2文件的单独块(如果我们使用线程解压缩器),RAID中的条带或分布式节点,我们可以从单独处理它们中获得很多好处。搜索每个块needle
,然后通过从一个块的末尾和下一个块的开头取len(needle)-1
并在其中搜索来形成关节。
快速基准测试表明,正则表达式状态机比通常的in
运算符运行得更快:
>>> timeit.timeit("x.search(s)", "s='a'*500000; import re; x=re.compile('foobar')", number=20000)
17.146117210388184
>>> timeit.timeit("'foobar' in s", "s='a'*500000", number=20000)
24.263535976409912
>>> timeit.timeit("n in s", "s='a'*500000; n='foobar'", number=20000)
21.562405109405518
我们可以执行的另一个优化步骤(假设我们在文件中有数据)是mmap而不是使用通常的读取操作。这允许操作系统直接使用磁盘缓冲区。它还允许内核以任意顺序满足多个读取请求,而无需进行额外的系统调用,这使我们可以在多线程中操作时利用底层RAID等功能。
这是一个快速抛出的原型。显然可以改进一些事情,例如,如果我们有一个多节点集群,则分配块进程,通过将一个传递给相邻的worker(在此实现中未知的顺序)来执行tail + head检查,而不是将两者都发送到一个特殊的工作者,并实现一个interthread有限队列(管道)类而不是匹配信号量。将工作线程移到主线程函数之外可能也是有意义的,因为主线程一直在改变它的本地。
from mmap import mmap, ALLOCATIONGRANULARITY, ACCESS_READ
from re import compile, escape
from threading import Semaphore, Thread
from collections import deque
def search(needle, filename):
# Might want chunksize=RAID block size, threads
chunksize=ALLOCATIONGRANULARITY*1024
threads=32
# Read chunk allowance
allocchunks=Semaphore(threads) # should maybe be larger
chunkqueue=deque() # Chunks mapped, read by workers
chunksready=Semaphore(0)
headtails=Semaphore(0) # edges between chunks into special worker
headtailq=deque()
sumq=deque() # worker final results
# Note: although we do push and pop at differing ends of the
# queues, we do not actually need to preserve ordering.
def headtailthread():
# Since head+tail is 2*len(needle)-2 long,
# it cannot contain more than one needle
htsum=0
matcher=compile(escape(needle))
heads={}
tails={}
while True:
headtails.acquire()
try:
pos,head,tail=headtailq.popleft()
except IndexError:
break # semaphore signaled without data, end of stream
try:
prevtail=tails.pop(pos-chunksize)
if matcher.search(prevtail+head):
htsum+=1
except KeyError:
heads[pos]=head
try:
nexthead=heads.pop(pos+chunksize)
if matcher.search(tail+nexthead):
htsum+=1
except KeyError:
tails[pos]=tail
# No need to check spill tail and head as they are shorter than needle
sumq.append(htsum)
def chunkthread():
threadsum=0
# escape special characters to achieve fixed string search
matcher=compile(escape(needle))
borderlen=len(needle)-1
while True:
chunksready.acquire()
try:
pos,chunk=chunkqueue.popleft()
except IndexError: # End of stream
break
# Let the re module do the heavy lifting
threadsum+=len(matcher.findall(chunk))
if borderlen>0:
# Extract the end pieces for checking borders
head=chunk[:borderlen]
tail=chunk[-borderlen:]
headtailq.append((pos,head,tail))
headtails.release()
chunk.close()
allocchunks.release() # let main thread allocate another chunk
sumq.append(threadsum)
with infile=open(filename,'rb'):
htt=Thread(target=headtailthread)
htt.start()
chunkthreads=[]
for i in range(threads):
t=Thread(target=chunkthread)
t.start()
chunkthreads.append(t)
pos=0
fileno=infile.fileno()
while True:
allocchunks.acquire()
chunk=mmap(fileno, chunksize, access=ACCESS_READ, offset=pos)
chunkqueue.append((pos,chunk))
chunksready.release()
pos+=chunksize
if pos>chunk.size(): # Last chunk of file?
break
# File ended, finish all chunks
for t in chunkthreads:
chunksready.release() # wake thread so it finishes
for t in chunkthreads:
t.join() # wait for thread to finish
headtails.release() # post event to finish border checker
htt.join()
# All threads finished, collect our sum
return sum(sumq)
if __name__=="__main__":
from sys import argv
print "Found string %d times"%search(*argv[1:])
此外,修改整个事物以使用一些mapreduce例程(将数据块映射到计数,头部和尾部,通过求和计数和检查尾部+头部来减少)留作练习。
编辑:由于看起来这个搜索将以不同的针重复,因此索引会快得多,能够跳过已知不匹配的部分的搜索。一种可能性是制作一个映射,其中哪些块包含各种n-gram的出现(通过允许ngram重叠到下一个中来解释块边界);然后,在需要加载原始数据块之前,可以组合这些映射以找到更复杂的条件。肯定有数据库可以做到这一点;寻找全文搜索引擎。
答案 7 :(得分:1)
我承认grep会更快。我假设这个文件是一个基于字符串的大文件。
但如果你真的想要的话,你可以这样做。
import os
import re
import mmap
fileName = 'bigfile.txt'
phrase = re.compile("how fast it is")
with open(fileName, 'r') as fHandle:
data = mmap.mmap(fHandle.fileno(), os.path.getsize(fileName), access=mmap.ACCESS_READ)
matches = re.match(phrase, data)
print('matches = {0}'.format(matches.group()))