美丽的汤不会返回任何清晰的东西

时间:2020-03-27 06:14:19

标签: python web-scraping encoding beautifulsoup python-requests

我正在尝试抓取多个布局应该相同的网页。他们中的一些人很好,但其他人返回空列表。打印页面内容将返回各种形式的乱码。

初始设置代码:

import csv
import requests
from bs4 import BeautifulSoup
import soupsieve

addresses = ["https://aWebsite.asp?id=1234"]

for a in range(0,len(addresses)):
    url = addresses[a]

    page = requests.get(url)

但是从这里开始,无论使用哪种编码或功能,我都会得到各种乱码。

当我尝试:

soup = BeautifulSoup(page.content) #or soup = BeautifulSoup(page.content, features="lxml")
cells = soup.select("tr p")
    print(soup)
    print(cells)

它打印:

<html><body><p>‹      ¼XñOܸþ™JïüTR½Y8½+…Mž(P•{TÔëUò&amp;</p></body></html>

[]

当我尝试:

soup = BeautifulSoup(page.content, features='html.parser')

cells = soup.select("tr p")
print(soup)
print(cells)

它打印:(我把汤打印的长度做的很短,但这是一个示例)

5���������/�b뱻��־E�g;�jw��&amp;����Ͻ��a��`~�?]5-�
�����[+�j</em"�p��></m.ﻟ�s)1�3�s�c{uoũⶰ�v^goh�m���;h�></f�g���*></n�`�<7`�ly��x�#tb></ii�����t�>

[]

当我尝试:

soup = BeautifulSoup(page.content).get_text().strip().encode("utf-8")

cells = soup.select("tr p")
print(soup)
print(cells)

它打印:

b'\xe2\x80\xb9      \xc2\xbc=\xc3\x9br\xc2\xb9r\xc3\x8fR\xc3\x95\xc3\xbe'

Traceback (most recent call last):
  File "warn_scrape.py", line 45, in <module>
    cells = soup.select("tr p")
AttributeError: 'bytes' object has no attribute 'select'

当我尝试:

soup = BeautifulSoup(page.content)

    cells = soup.select("tr p").get_text()
    print(soup)
    print(cells)

它打印:

<html><body><p>‹      ¼</p></body></html>

Traceback (most recent call last):
  File "warn_scrape.py", line 45, in <module>
    cells = soup.select("tr p").get_text()
AttributeError: 'list' object has no attribute 'get_text'

我尝试了多种其他方式以及上述各项的组合,包括.prettify和`encode(“ ascii”)',它们都返回空列表和某种形式的乱码html

0 个答案:

没有答案