Question

我必须分析commoncrawl。为此我使用的是python 2.7。我观察了一些warc文件，warc.gz文件中有一些二进制数据。我必须使用bs4解析html源代码。但我怎么能发现这是文本数据，这是二进制的。例如，有一个包含二进制数据的URL摘要。 http://aa-download.avg.com/filedir/inst/avg_free_x86_all_2015_5315a8160.exe

如何跳过二进制数据并且只能在python中获取文本数据？

Answer 1

您可以使用python-magic来识别内容。

In [1]: import magic

In [2]: magic.from_file('places.sqlite')
Out[2]: b'SQLite 3.x database, user version 33, last written using SQLite version 3015001'

In [3]: magic.from_file('installed-port-list.txt')
Out[3]: b'ASCII text'

In [4]: magic.from_file('quotes.gz')
Out[4]: b'gzip compressed data, was "quotes", last modified: Tue Dec  6 20:35:44 2016, from Unix'

请注意，虽然这些示例使用from_file函数，但python-magic也具有from_buffer函数。

如何使用python处理commoncrawl中的二进制数据

1 个答案: