更改分隔符以在pyspark中读取文件

时间:2016-11-24 17:48:37

标签: python apache-spark pyspark delimiter warc

我正在尝试使用PySpark将.warc.gz文件读取到RDD。我希望分隔符是三个换行符,因此我可以将每个记录作为RDD的一个元素读取,以便解析它们并使用这些信息。首先,我有兴趣阅读回复记录的html内容。

    WARC/1.0
    WARC-Type: request
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
    Content-Length: 317
    Content-Type: application/http; msgtype=request
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=

    GET /photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs= HTTP/1.0
    Host: 0pointer.de
    Accept-Encoding: x-gzip, gzip, deflate
    User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
    Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



    WARC/1.0
    WARC-Type: response
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
    Content-Length: 7502
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=
    WARC-Payload-Digest: sha1:MOKD54JQHY4EWHNOJLT6IXM3ZTACA3CJ
    WARC-Block-Digest: sha1:VEYQQ2LH25SNUWZNVD4KA7EZWRKWK4HG

    HTTP/1.1 200 OK
    Date: Wed, 20 Aug 2014 06:36:13 GMT
    Server: Apache
    X-Powered-By: PHP/5.3.8-1+b1
    Content-Length: 7319
    Connection: close
    Content-Type: text/html; charset=utf-8

    <?xml version="1.0"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
    <html>
    <head>
    <!-- This makes IE6 suck less (a bit) -->
    <!--[if lt IE 7]>
    <script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
    </script>
    ...
    </html>


    WARC/1.0
    WARC-Type: metadata
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:e32aadef-5864-48e5-8829-c1a22223fb86>
    Content-Length: 20
    Content-Type: application/warc-fields
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=

    fetchTimeMs: 476



    WARC/1.0
    WARC-Type: request
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
    Content-Length: 316
    Content-Type: application/http; msgtype=request
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=

    GET /photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs= HTTP/1.0
    Host: 0pointer.de
    Accept-Encoding: x-gzip, gzip, deflate
    User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
    Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



    WARC/1.0
    WARC-Type: response
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
    Content-Length: 7492
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=
    WARC-Payload-Digest: sha1:Z7OTT2W742LWVRPCNR7DYVSXDT72I3GH
    WARC-Block-Digest: sha1:6CX2E5F3DA6PLY5R5FN7Y4YG73SFMWDI

    HTTP/1.1 200 OK
    Date: Wed, 20 Aug 2014 05:06:10 GMT
    Server: Apache
    X-Powered-By: PHP/5.3.8-1+b1
    Content-Length: 7309
    Connection: close
    Content-Type: text/html; charset=utf-8

    <?xml version="1.0"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
    <html>
    <head>
    <!-- This makes IE6 suck less (a bit) -->
    <!--[if lt IE 7]>
    <script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
    </script>
    ...
    </html>


    WARC/1.0
    WARC-Type: metadata
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:7c899d9d-1934-4096-9037-f8e8edcbf238>
    Content-Length: 20
    Content-Type: application/warc-fields
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=

    fetchTimeMs: 502

我试过了

    conf = SparkConf().setAppName("wdps phase1").setMaster("local")
    conf.set("textinputformat.record.delimiter", "\n\n\n")
    sc = SparkContext(conf=conf)

    data = sc.textFile(path)
    sample = data.filter(lambda x: checkResponse(x))

checkResponse是一个将每个RDD元素解析为warc记录并使用python库提取一些信息的函数。

    def checkResponse(input):
        try:
            record = warc.WARCFile(fileobj=StringIO(input))
            if record['WARC-Type'] == 'response':
                return True
            else:
                return False
        except Exception as e:
            return False

0 个答案:

没有答案