我正在尝试使用requests
库构建python爬虫。当我使用get
方法时,我检索到的结果如下:THá» THAO
。但是当我使用curl
时,我得到了THỂ THAO
,这是我预期的结果。这是我的代码:
def get_raw_channel():
r = requests.get('http://vtv.vn/')
raw_html = r.text
soup = BeautifulSoup(raw_html)
o_tags = soup.find_all("option")
for o_tag in o_tags:
print o_tag.text
# raw_channel = RawChannel(o_tag.text.strip(), o_tag['value'])
# channels_file.write(raw_channel.__str__() + '\n')
这是我的卷曲cmd:curl http://vtv.vn/
问题:结果有何不同?如何使用curl
获得requests
的结果?
答案 0 :(得分:1)
我尝试了您的代码,在我的情况下,编码是' ISO-8859-1',尝试将您的数据编码为UTF-8,然后在BS中进行处理,例如:
...
raw_html = r.text.encode("utf-8")
soup = BeautifulSoup(raw_html)
...
<强>更新强> 我做了一些测试,看起来一切都适合我,因为我明确设置了请求的编码,看看
In [1]: import requests
In [2]: from BeautifulSoup import BeautifulSoup
In [3]: r = requests.get('http://vtv.vn/')
In [4]: r.encoding = "utf-8"
In [5]: raw_html = r.text
In [6]: soup = BeautifulSoup(raw_html)
In [7]: soup.findAll("option")
Out[7]:
[<option value="1">
VTV1</option>,
... stripped out some output ...
VTVCab3 - Thể thao TV</option>,
<option value="13">
... stripped out some output ...
]