无法使用漂亮的汤来解析网页内容

时间:2017-08-21 19:50:12

标签: python-3.x web-scraping beautifulsoup

我一直在使用Beautiful Soup来解析网页以进行一些数据提取。到目前为止,它对我来说非常有效,适用于其他网页。但是,我正在尝试计算<一个>此page中的标记,

from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89

url = url_base + catsection + "?page=" + str(i)
print(url)

#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

j=0
for num in soup.find_all('a'):
    j=j+1
print(j)

我得到输出为0.这让我觉得r = requests.get(url)之后的2行可能不起作用(显然页面中没有< a>标记的机会) ,我不确定我可以在这里使用什么替代解决方案。以前有人有任何解决方案或遇到过类似的问题吗? 提前致谢。

2 个答案:

答案 0 :(得分:1)

您需要将一些信息与请求一起传递给服务器 以下代码应该可以工作......您也可以和其他参数一起玩

from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89

url = url_base + catsection + "?page=" + str(i)
print(url)

headers = {
    'User-agent': 'Mozilla/5.0'
}

#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage

r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

j=0
for num in soup.find_all('a'):
    j=j+1
print(j)

答案 1 :(得分:0)

将任何网址放入解析器并检查" a"该页面上提供的标签:

from bs4 import BeautifulSoup
import requests

url_base = "http://www.dnaindia.com/cricket?page=1"
res = requests.get(url_base, headers={'User-agent': 'Existed'})
soup = BeautifulSoup(res.text, 'html.parser')
a_tag = soup.select('a')
print(len(a_tag))