使用beautifulsoup

时间:2016-10-06 11:13:25

标签: python-2.7 web-scraping beautifulsoup html-parsing

每次更新时都想获取表的内容。使用BeautifulSoup。为什么这段代码不起作用?它不会返回任何输出或有时会抛出异常

from bs4 import BeautifulSoup
import urllib2
url =    "http://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
divcontent = soup.find('div', {"id":"latestTrPagging", "class":"content2"})
table = soup.find_all('table')
rows = table.findAll('tr', {"class":"even", "class": "odd"})
for row in rows:
    cols = row.findAll('td', {"class":"tno"})
for td in cols:
    print td.text(text=True)`

网址为https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2 只是想获取表格部分,并在新招标时收到通知

1 个答案:

答案 0 :(得分:0)

这对我有用 - 使用requests代替urllib2,设置User-Agent标题并调整一些定位器:

from bs4 import BeautifulSoup
import requests


url = "https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"})
soup = BeautifulSoup(page.content, "html.parser")

divcontent = soup.find('div', {"id": "latestTrPagging", "class": "content2"})
table = soup.find('table')
rows = table.find_all('tr', {"class": ["even", "odd"]})

for row in rows:
    cols = row.find_all('td', {"class": "tno"})
    for td in cols:
        print(td.get_text())

打印前10个投标号码:

LC1MC16044[NIT]
LC1MC16043[NIT]
LC1MC16045[NIT]
EY1VC16028[NIT]
RC2SC16050(E -tender)[NIT]
RC2SC16048(E -tender)[NIT]
RC2SC16049(E -tender)[NIT]
UI1MC16002[NIT]
V16RC16015[E-Gas]
K16AC16002[E-Procurement]

请注意您应该如何处理多个课程("甚至"和#34;奇数")。