Python没有在html标签

时间:2016-02-04 03:09:24

标签: python beautifulsoup python-requests lxml

看起来python在标记为display = none时无法找到文本,我该怎么做才能解决这个问题?

这是我的代码

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.domcop.com/domains/great-expired-domains/')
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id':'all-domains'})
data.text

代码返回[]

我也试过xpath:

from lxml import etree

data = etree.HTML(r.text)
anchor = data.xpath('//div[@id="all-domains"]/text()')

它返回相同的东西......

1 个答案:

答案 0 :(得分:1)

是的,id="all-domains"元素为空,因为它是由浏览器中执行的javascript动态设置的。使用requests,您只能获得没有“动态”部分的初始HTML页面。为了获得所有域,我只是迭代表行并提取域链接文本。工作样本:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.domcop.com/domains/great-expired-domains/',
                 headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"})

soup = BeautifulSoup(r.text, 'html.parser')
for domain in soup.select("tbody#domcop-table-body tr td a.domain-link"):
    print(domain.get_text())

打印:

u2tourfans.com
tvadsview.com
gfanatic.com
blucigs.com
...
twply.com
sweethomeparis.com
vvchart.com