美丽的汤崩溃Python过程

时间:2017-03-01 13:06:39

标签: python beautifulsoup

我在一个网站上处理多个页面的Python进程崩溃了:

soup = BeautifulSoup(cleaned_html, "lxml")

此外,每次都是不同的页面。

我使用Python 2.7,bs4 0.0.1和lxml 3.6.0。

你能帮帮我吗?提前谢谢!

我的代码:

 def clean_html(self, html, document_format):

        """ This function cleans and rearranges HTML and retearnes beautiful soup """

        cleaned_html = html

        # Remove all unimportant tags, except for the ones used by Abbyy
        cleaned_html = self.remove_unimportant_tags_except_for_p_b_font_a(cleaned_html)

        # Replace "nbsp;" with " "
        cleaned_html = self.replace_html_symbols(cleaned_html)

        # Remove extra spaces
        cleaned_html = self.remove_extra_space(cleaned_html)

        # Adjust html for the files from Abbyy
        if document_format == 'abbyy':
            logger.info("Record is made by Abbyy")
            cleaned_html = self.adjust_abbyy_tags(cleaned_html)
        elif document_format == 'sec':
            logger.info("Record is a SEC document")
            cleaned_html = self.adjust_sec_tags(cleaned_html)

        # Remove the unimportant tags used by Abbyy
        cleaned_html = self.remove_p_b_font_a(cleaned_html)

        # Remove extra spaces
        cleaned_html = self.remove_extra_space(cleaned_html)

        logger.info("HTML is cleaned before making soup")

        # Make soup
        try:
            if document_format in ("abbyy", "sec"): soup = BeautifulSoup(cleaned_html, "html5lib")
            else: soup = BeautifulSoup(cleaned_html, "lxml")
        except Exception as e:
            logger.warning("Beautiful soup cannot be made out of this page: {}".format(str(e)))
            return None  

        logger.info("Soup is made") 

        # Remove script and style tag containers with their content
        [s.extract() for s in soup('script')]
        [s.extract() for s in soup('style')]
        [s.extract() for s in soup('del')]
        [s.extract() for s in soup('s')]
        [s.extract() for s in soup('strike')]
        [s.extract() for s in soup('base')]
        [s.extract() for s in soup('basefont')]
        [s.extract() for s in soup('noscript')]
        [s.extract() for s in soup('applet')]
        [s.extract() for s in soup('embed')]
        [s.extract() for s in soup('object')]

        logger.info("Soup is cleaned") 

        return soup

如果我没有指定“lxml”,我会收到以下通知:

C:\Users\EERMIL~1\AppData\Local\Temp\2\_MEI38~1\bs4\__init__.py:166: UserWarning
: No parser was explicitly specified, so I'm using the best available HTML parse
r for this system ("lxml"). This usually isn't a problem, but if you run this co
de on another system, or in a different virtual environment, it may use a differ
ent parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

如果我使用“html5lib”而不是“lxml”,Python进程不会崩溃,但我无法从HTML页面中获取所有文本。即我得到以下错误(我在下面看到了)

'NoneType' object has no attribute 'next_element'

当我执行以下代码时:

 for child in soup.children:

            # If it is an irregular tag, skip it
            if str(type(child)) == "<class 'bs4.element.Tag'>":
                # If name has strange symbols, skip it
                if re.search('[^a-z0-9]', child.name):
                    continue
                # If there is no text inside, skip it
                try:
                    if not re.search('(\w|\d)', child.get_text()):
                        continue
                except Exception as e:
                    logger.warning("Unexpected exception in getting text from tag {}: {}".format(str(child), str(e)))
                    continue

0 个答案:

没有答案