Python3 HTML.parser“feed”返回None

时间:2016-09-22 05:00:12

标签: python-3.x html-parsing

我试图将parser.feed的结果保存到字符串中以便进一步解析。但是parser.feed没有返回

这是我的代码:

import requests
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        return("Encountered some data  : ", data.encode('utf-8'))

list_of_10K_text_files = ['https://www.sec.gov/Archives/edgar/data/200406/000020040616000071/0000200406-16-000071.txt', 
                      'https://www.sec.gov/Archives/edgar/data/40545/000004054516000145/0000040545-16-000145.txt', 
                      'https://www.sec.gov/Archives/edgar/data/1095130/000161577416007303/0001615774-16-007303.txt']

page = requests.get(list_of_10K_text_files[0])

parser = MyHTMLParser()

pos_Large_Acc_filer = (page.text).find('Large accelerated filer')
pos_Small_Reporting_Co = (page.text).find('Smaller reporting company')

# I would like to save the results of parser.feed to "text_for_file"
# as a string for further parsing
text_for_file = parser.feed(page.text[pos_Large_Acc_filer:(pos_Small_Reporting_Co+150)])

# Output Desired in the text_for_file variable
---------------------------------------------------------------------------
Encountered some data  :  b'Large accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'\xc3\xbe'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Non-accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Smaller reporting company\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'

目前parser.feed返回None,但我需要它以一种允许我进一步解析该文本的格式返回输出,如上所示。

修改 万一你想知道我为什么要解析.txt文件。下面是.txt文件中的文本示例。显然它是HTML,除了第一个50左右的标题信息行(我没有包括)。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <!-- Document created using Wdesk 1 -->
    <!-- Copyright 2016 Workiva -->
    <title>10-K</title>
</head>
    <body style="font-family:Times New Roman;font-size:10pt;">
        <a name="s5971963f20334f9f9b208ef25f6cc9cd"></a>
        <div style="line-height:120%;padding-top:2px;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">UNITED STATES</font>
        </div>
        <div style="line-height:120%;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">SECURITIES AND EXCHANGE COMMISSION</font>
        </div>
        <div style="line-height:120%;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">Washington,&#160;D.C. 20549</font>
        </div> 

修改

解析器的源代码可以在以下链接中找到 HTML.parser Source Code

feed函数从第158行开始。feed返回self.goahead(0)goahead(0)函数从第193行开始。

函数handle_data(源代码从第534行开始)有时由goahead返回,但handle_data返回通过。这看起来很奇怪,但可能是我特殊问题的罪魁祸首。

1 个答案:

答案 0 :(得分:0)

首先,我要感谢@ Jean-FrançoisFabre帮助我更好地解释和构建我的问题的工作,以及他迄今为止在这个问题上所做的工作。

事实证明,我的问题的一个解决方案(在此处找到:@WillTownes-StackOverflow)是将stdout重定向到如下文件:

temp = sys.stdout                                                             # store original stdout object for later
sys.stdout = open("Form_10K_Data.txt", "w+")                                  # redirect all prints to this log file
parser.feed(page.text[pos_Large_Acc_filer:(pos_Small_Reporting_Co+150)])      # again nothing appears. it's written to log file instead
sys.stdout.close()                                                            # ordinary file object
sys.stdout = temp                                                             # restore print commands to interactive prompt

with open("Form_10K_Data.txt") as f:
    filer_file = f.read().split('\n')[:-1]

然而,这感觉很糟糕。这个问题有更多的pythonic解决方案吗?