我在一个网站上处理多个页面的Python进程崩溃了:
soup = BeautifulSoup(cleaned_html, "lxml")
此外,每次都是不同的页面。
我使用Python 2.7,bs4 0.0.1和lxml 3.6.0。
你能帮帮我吗?提前谢谢!我的代码:
def clean_html(self, html, document_format):
""" This function cleans and rearranges HTML and retearnes beautiful soup """
cleaned_html = html
# Remove all unimportant tags, except for the ones used by Abbyy
cleaned_html = self.remove_unimportant_tags_except_for_p_b_font_a(cleaned_html)
# Replace "nbsp;" with " "
cleaned_html = self.replace_html_symbols(cleaned_html)
# Remove extra spaces
cleaned_html = self.remove_extra_space(cleaned_html)
# Adjust html for the files from Abbyy
if document_format == 'abbyy':
logger.info("Record is made by Abbyy")
cleaned_html = self.adjust_abbyy_tags(cleaned_html)
elif document_format == 'sec':
logger.info("Record is a SEC document")
cleaned_html = self.adjust_sec_tags(cleaned_html)
# Remove the unimportant tags used by Abbyy
cleaned_html = self.remove_p_b_font_a(cleaned_html)
# Remove extra spaces
cleaned_html = self.remove_extra_space(cleaned_html)
logger.info("HTML is cleaned before making soup")
# Make soup
try:
if document_format in ("abbyy", "sec"): soup = BeautifulSoup(cleaned_html, "html5lib")
else: soup = BeautifulSoup(cleaned_html, "lxml")
except Exception as e:
logger.warning("Beautiful soup cannot be made out of this page: {}".format(str(e)))
return None
logger.info("Soup is made")
# Remove script and style tag containers with their content
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
[s.extract() for s in soup('del')]
[s.extract() for s in soup('s')]
[s.extract() for s in soup('strike')]
[s.extract() for s in soup('base')]
[s.extract() for s in soup('basefont')]
[s.extract() for s in soup('noscript')]
[s.extract() for s in soup('applet')]
[s.extract() for s in soup('embed')]
[s.extract() for s in soup('object')]
logger.info("Soup is cleaned")
return soup
如果我没有指定“lxml”,我会收到以下通知:
C:\Users\EERMIL~1\AppData\Local\Temp\2\_MEI38~1\bs4\__init__.py:166: UserWarning
: No parser was explicitly specified, so I'm using the best available HTML parse
r for this system ("lxml"). This usually isn't a problem, but if you run this co
de on another system, or in a different virtual environment, it may use a differ
ent parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
如果我使用“html5lib”而不是“lxml”,Python进程不会崩溃,但我无法从HTML页面中获取所有文本。即我得到以下错误(我在下面看到了)
'NoneType' object has no attribute 'next_element'
当我执行以下代码时:
for child in soup.children:
# If it is an irregular tag, skip it
if str(type(child)) == "<class 'bs4.element.Tag'>":
# If name has strange symbols, skip it
if re.search('[^a-z0-9]', child.name):
continue
# If there is no text inside, skip it
try:
if not re.search('(\w|\d)', child.get_text()):
continue
except Exception as e:
logger.warning("Unexpected exception in getting text from tag {}: {}".format(str(child), str(e)))
continue