使用Python 3.4和BeautifulSoup,Requests废弃文章

时间:2016-05-01 02:05:52

标签: python python-3.x web-scraping beautifulsoup python-requests

我想废弃该网站:

https://xueqiu.com/yaodewang

我想废弃他的所有文章。我使用了BeautifulSoup和Requests:

import requests
from bs4 import BeautifulSoup
url = 'https://xueqiu.com/yaodewang'
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'}
r = requests.get(url,headers = header).content
soup = BeautifulSoup(r,'lxml')
artile = soup.find_all('ul',{'class':'status-list'})
print(artile)

结果什么都没有!它的回归:

 []

所以,我还有另外一个这样的规则:

# art = soup.find_all('div',{'class':'allStatuses no-head'})
# art = soup.find_all('div',{'class':'status_bd'})
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})

但是,它会返回一些不正确的词。 我想要像enter image description here

这样的内容

我需要你的帮助,非常感谢你!

1 个答案:

答案 0 :(得分:1)

所需数据实际上并不位于具有status-list类的元素内。如果要检查源代码,则会找到一个空容器:

<div class="status_bd">
    <div id="statusLists" class="allStatuses no-head"></div>
</div>

相反,状态位于您需要找到的script元素内,提取所需对象,从JSON加载到Python字典中并提取所需信息:

import json
import re
import requests
from bs4 import BeautifulSoup

url = 'https://xueqiu.com/yaodewang'
headers = {
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'
}
r = requests.get(url, headers=headers).content
soup = BeautifulSoup(r, 'lxml')

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

data = json.loads(pattern.search(script.text).group(1))
for item in data["statuses"]:
    print(item["description"])

打印:

The best advice: Remember common courtesy and act toward others as you want them to act toward you.
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week...
...
点.点.点... 点到这个,学位、学历、成绩单翻译一下要50块、100块的...