尝试抓取电子邮件地址

时间:2019-07-30 08:53:30

标签: python-3.x web-scraping scrapy

我正试图废弃此网站

[www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll =] [1]

我确实抓了它,但是我无法抓取电子邮件地址 你能帮我把它报废吗? 我在用草木

# -*- coding: utf-8 -*-
import scrapy
from ..items import ChurchItem


class ChurchSpiderSpider(scrapy.Spider):
    name = 'church_spider'
    page_number = 1
    start_urls = ['https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=']

    def parse(self, response):
        items = ChurchItem()
        container = response.css(".icon-ministry")
        for t in container:
            church_name = t.css(".field-name-locator-ministry-title a::text").extract()
            church_phone = t.css(".field-name-field-phone::text").extract()
            church_address = t.css(".thoroughfare::text").extract()
            church_email = t.css(".field-name-field-mu-email span::text").extract()

            items["church_name"] = church_name
            items["church_phone"] = church_phone
            items["church_address"] = church_address
            items["church_email"] = church_email

            yield items

        # next_page = 'https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=&page=' + str(ChurchSpiderSpider.page_number)
        # if ChurchSpiderSpider.page_number <= 110:
        #     ChurchSpiderSpider.page_number += 1
        #     yield response.follow(next_page, callback=self.parse)

我找到了一些解决方案,但仍未完成 现在的输出就像

{'church_address': ['7763 Highway 21'],
 'church_email': ['herbklaehn', ' [at] ', 'gmail.com'],
 'church_name': ['Allenford United Church'],
 'church_phone': ['519-35-6232']}

您能帮我用@替换[at]并将其组合为一个字符串吗?

1 个答案:

答案 0 :(得分:0)

加入列表元素并替换,

email = ''.join(church_email).replace(" [at] ","@")