我正在尝试抓取多个布局应该相同的网页。他们中的一些人很好,但其他人返回空列表。打印页面内容将返回各种形式的乱码。
初始设置代码:
import csv
import requests
from bs4 import BeautifulSoup
import soupsieve
addresses = ["https://aWebsite.asp?id=1234"]
for a in range(0,len(addresses)):
url = addresses[a]
page = requests.get(url)
但是从这里开始,无论使用哪种编码或功能,我都会得到各种乱码。
当我尝试:
soup = BeautifulSoup(page.content) #or soup = BeautifulSoup(page.content, features="lxml")
cells = soup.select("tr p")
print(soup)
print(cells)
它打印:
<html><body><p>‹ ¼XñOܸþ™JïüTR½Y8½+…Mž(P•{TÔëUò&</p></body></html>
[]
当我尝试:
soup = BeautifulSoup(page.content, features='html.parser')
cells = soup.select("tr p")
print(soup)
print(cells)
它打印:(我把汤打印的长度做的很短,但这是一个示例)
5���������/�b뱻��־E�g;�jw��&����Ͻ��a��`~�?]5-�
�����[+�j</em"�p��></m.ﻟ�s)1�3�s�c{uoũⶰ�v^goh�m���;h�></f�g���*></n�`�<7`�ly��x�#tb></ii�����t�>
[]
当我尝试:
soup = BeautifulSoup(page.content).get_text().strip().encode("utf-8")
cells = soup.select("tr p")
print(soup)
print(cells)
它打印:
b'\xe2\x80\xb9 \xc2\xbc=\xc3\x9br\xc2\xb9r\xc3\x8fR\xc3\x95\xc3\xbe'
Traceback (most recent call last):
File "warn_scrape.py", line 45, in <module>
cells = soup.select("tr p")
AttributeError: 'bytes' object has no attribute 'select'
当我尝试:
soup = BeautifulSoup(page.content)
cells = soup.select("tr p").get_text()
print(soup)
print(cells)
它打印:
<html><body><p>‹ ¼</p></body></html>
Traceback (most recent call last):
File "warn_scrape.py", line 45, in <module>
cells = soup.select("tr p").get_text()
AttributeError: 'list' object has no attribute 'get_text'
我尝试了多种其他方式以及上述各项的组合,包括.prettify
和`encode(“ ascii”)',它们都返回空列表和某种形式的乱码html