我想废弃该网站:
https://xueqiu.com/yaodewang
我想废弃他的所有文章。我使用了BeautifulSoup和Requests:
import requests
from bs4 import BeautifulSoup
url = 'https://xueqiu.com/yaodewang'
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'}
r = requests.get(url,headers = header).content
soup = BeautifulSoup(r,'lxml')
artile = soup.find_all('ul',{'class':'status-list'})
print(artile)
结果什么都没有!它的回归:
[]
所以,我还有另外一个这样的规则:
# art = soup.find_all('div',{'class':'allStatuses no-head'})
# art = soup.find_all('div',{'class':'status_bd'})
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})
这样的内容
我需要你的帮助,非常感谢你!
答案 0 :(得分:1)
所需数据实际上并不位于具有status-list
类的元素内。如果要检查源代码,则会找到一个空容器:
<div class="status_bd">
<div id="statusLists" class="allStatuses no-head"></div>
</div>
相反,状态位于您需要找到的script
元素内,提取所需对象,从JSON加载到Python字典中并提取所需信息:
import json
import re
import requests
from bs4 import BeautifulSoup
url = 'https://xueqiu.com/yaodewang'
headers = {
'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'
}
r = requests.get(url, headers=headers).content
soup = BeautifulSoup(r, 'lxml')
pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
data = json.loads(pattern.search(script.text).group(1))
for item in data["statuses"]:
print(item["description"])
打印:
The best advice: Remember common courtesy and act toward others as you want them to act toward you.
Lighten up! It's the weekend. we're just having a little fun! Industrial Bank is expected to rise,next week...
...
点.点.点... 点到这个,学位、学历、成绩单翻译一下要50块、100块的...