隔离' a'标签基于使用美丽汤的类

时间:2017-03-08 15:20:10

标签: python html beautifulsoup

我想将此网址中的网址链接写入文件,但表格中每行有2个'td a'个标记。我只想要class="pagelink" href="/search"

的那个

我尝试了以下代码,希望只选择"class":"pagelink"的代码,但会产生错误:

  

AttributeError:' Doctype'对象没有属性' find_all'

有人可以帮忙吗?

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results

response = session.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

for link in html:
    prop_link = link.find_all("td a", {"class":"pagelink"})

    writer.writerow([prop_link])

2 个答案:

答案 0 :(得分:3)

您的html变量包含一个不可迭代的Doctype对象。 您需要在该对象中使用find_allselect来查找所需的节点。

示例:

import requests
from bs4 import BeautifulSoup as soup
import csv

outputfilename = 'Ed_Streets2.csv'

#inputfilename = 'Edinburgh.txt'

baseurl = 'https://www.saa.gov.uk'

outputfile = open(outputfilename, 'wb')
writer = csv.writer(outputfile)
writer.writerow(['URL', 'Reference', 'Description', 'Address'])

session = requests.session()

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = session.get(url)              
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", class_="pagelink button small")

for link in prop_link:
    prop_url = baseurl+(link["href"])
    print prop_url
    writer.writerow([prop_url, "", "", ""])

答案 1 :(得分:0)

试试这个。
您需要在开始循环之前查找链接。

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = requests.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", {"class":"pagelink button small"})

for link in prop_link:
    if(type(link) != type(None) and link.has_attr("href")):
        wr = link["href"]
        writer.writerow([wr])