Question

我想将此网址中的网址链接写入文件，但表格中每行有2个'td a'个标记。我只想要class="pagelink" href="/search"等

的那个

我尝试了以下代码，希望只选择"class":"pagelink"的代码，但会产生错误：

AttributeError：＆＃39; Doctype＆＃39;对象没有属性＆＃39; find_all＆＃39;

有人可以帮忙吗？

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results

response = session.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

for link in html:
    prop_link = link.find_all("td a", {"class":"pagelink"})

    writer.writerow([prop_link])

Answer 1

您的html变量包含一个不可迭代的Doctype对象。您需要在该对象中使用find_all或select来查找所需的节点。

示例：

import requests
from bs4 import BeautifulSoup as soup
import csv

outputfilename = 'Ed_Streets2.csv'

#inputfilename = 'Edinburgh.txt'

baseurl = 'https://www.saa.gov.uk'

outputfile = open(outputfilename, 'wb')
writer = csv.writer(outputfile)
writer.writerow(['URL', 'Reference', 'Description', 'Address'])

session = requests.session()

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = session.get(url)              
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", class_="pagelink button small")

for link in prop_link:
    prop_url = baseurl+(link["href"])
    print prop_url
    writer.writerow([prop_url, "", "", ""])

Answer 2

试试这个。
您需要在开始循环之前查找链接。

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = requests.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", {"class":"pagelink button small"})

for link in prop_link:
    if(type(link) != type(None) and link.has_attr("href")):
        wr = link["href"]
        writer.writerow([wr])

隔离＆＃39; a＆＃39;标签基于使用美丽汤的类

2 个答案: