BeautifulSoup因工作量大而取消程序

时间:2019-04-30 13:40:26

标签: python python-3.x beautifulsoup

我正在开发一个python脚本(https://github.com/BrunorpPaixao/lmfm4u),该脚本可用于清理从Facebook获得的文本,当您要求它们向您发送您曾经发送过/已接收的所有消息时。

对于大多数文件来说,它工作得很好,但是当我尝试扫描一个巨大的文本文件时,凝视着冻结的屏幕几分钟后,我在终端上收到一条消息,说“被杀死”,我认为问题来了由BeautifulSoup提供,并具有加载大文件的功能。

我想知道是否有什么方法可以改善我的代码以改善“清理”时间,或者我目前的执行方式有什么错误。

所以...当我们从facebook获得文件时,它们看上去就

</div><div></div><div></div></div></div><div class="_3-94 _2lem">11/12/2017, 21:48</div></div><div class="pam _3-95 _2pi0 _2lej uiBoxWhite noborder"><div class="_3-96 _2pio _2lek _2lel">Francisco Zacarias</div><div class="_3-96 _2let"><div><div></div><div>eles nao deram respwan</div><div></div><div></div></div></div><div class="_3-94 _2lem">11/12/2017, 21:48</div></div><div class="pam _3-95 _2pi0 _2lej uiBoxWhite noborder"><div class="_3-96 _2pio _2lek _2lel">Francisco Zacarias</div><div class="_3-96 _2let"><div><div></div><div>deve ter bugado ent</div><div></div><div></div></div></div><div class="_3-94 _2lem">11/12/2017, 21:48</div></div><div class="pam _3-95 _2pi0 _2lej uiBoxWhite noborder"><div class="_3-96 _2pio _2lek _2lel">Bruno Paixao</div><div class="_3-96 _2let"><div><div></div><div>e usares nos pilares</div><div></div><div></div></div></div><div class="_3-94 _2lem">11/12/2017, 21:44</div></div><div class="pam _3-95 _2pi0 _2lej uiBoxWhite noborder"><div class="_3-96 _2pio _2lek _2lel">Bruno Paixao</div><div class="_3-96 _2let"><div><div></div><div>é so dropares 5</div><div></div><div></div></div></div><div class="_3-94 _2lem">11/12/2017, 21:44</div></div><div class="pam _3-95 _2pi0 _2lej uiBoxWhite noborder"><div class="_3-96 _2pio _2lek _2lel">Bruno Paixao</div><div class="_3-96 _2let"><div><div></div><div>e usas uma em cada pilar</div><div></div><div></div></div></div><div class="_3-94 

在简单的一行中有一堆HTML代码,然后我使用BeautifulSoup删除所有HTML标签,然后使用正则表达式拆分消息,最后正确设置它们。

regex = re.compile('\\d{2}/\\d{2}/\\d{4},.\\d{2}:\\d{2}')
CleanStartTime = time.time()
cleaned = BeautifulSoup(document, "lxml").get_text()
cleaned = cleaned.split(" ")
if(len(cleaned) < 290):
    print("Wrong type of file, please choose a facebook messenger history file.")
    quit()
else:
    for i in range(290):
        cleaned.pop(0)
cleaned =  " ".join(cleaned)
cleanedwregex = re.split(regex, cleaned)
listofdates = re.findall(regex, cleaned)
CleanEndTime = time.time()
print("HTML cleaned in " + str("%.2f" % (CleanEndTime - CleanStartTime)) + "seconds")

PrintStartTime = time.time()
gucciString = ""
for i in range(len(cleanedwregex) - 1):
    gucciString += listofdates[i] + " | " + cleanedwregex[i] + "\n"

最后,输出看起来像这样:

11/12/2017, 21:44 | Bruno Paixaotens de matar os orcs
11/12/2017, 21:44 | Bruno Paixaoai
11/12/2017, 21:31 | Francisco Zacariaso que é que tenho de fazer aqui
11/12/2017, 21:31 | Francisco Zacarias
11/12/2017, 21:31 | Francisco Zacariascaro engenheiro do ips peço desculpa estar a incomodar mas nao sei o que fazer agora

我希望对“ BeautifulSouping”大文件有所帮助,我考虑过要检查字符总数(有些达到5千8百万个字符)并以位为单位进行处理,然后附加到最终文件中,但是在尝试之前那,我希望得到你们中的一些人的意见。

如果您想查看完整的代码,请单击顶部的github链接,非常感谢您的帮助!

0 个答案:

没有答案