如何使用BeautifulSoup收集数据python

时间:2015-02-23 17:35:25

标签: python beautifulsoup

我正在尝试使用beautifulsoup使用python收集数据,但它正在收集除电子邮件数据之外的所有数据,因此我也可以收集电子邮件。

def scrapeProfileData(profilePageSource):
    time.sleep(6)
    try:
        personName = str(profilePageSource.find("title").get_text().encode("utf-8"))[2:-1]
    except:
        personName =""


    try:
        industry = str(profilePageSource.find("dd", class_="industry").get_text().encode("utf-8"))[2:-1]
    except:
        industry = ""
    try:
        location = str(profilePageSource.find("span", class_="locality").get_text().encode("utf-8"))[2:-1]
    except:
        location = ""
    try:
        title = str(profilePageSource.find("p", class_="title").get_text().encode("utf-8"))[2:-1]
    except:
        title = ""
    try:
        email = str(profilePageSource.find("@", class_="contact-field").get_text().encode("utf-8"))[2:-1]
    except:
        email = ""
        pass

以下是我正在尝试收集数据的表格

dd class="industry"><a href="/vsearch/p?f_I=43&amp;trk=prof-0-ovw-industry" name="industry" title="Find other members in this industry">Financial Services</a></dd>

<span class="locality"><a href="/vsearch/p?f_G=gb%3A4573&amp;trk=prof-0-ovw-location" name='location' title="Find other members in London, Greater London, United Kingdom">London, Greater London, United Kingdom</a></span>

<p class="title">&#x2714;&#x2714;Sales &amp; Business Development Mobile Payments, Telecoms, Cloud&#x2714;&#x2714;</p>

<table summary="Online Contact Info"><tr><th>Email</th><td><div id="email"><div id="email-view"><ul><li><a href="mailto:username@domain.com">username@domain.com</a></li></ul></div>

我正在考虑收集电子邮件,但是我需要提出建议......

由于

1 个答案:

答案 0 :(得分:1)

您可以使用以下CSS selector

访问电子邮件元素
div#email-view a[href]

并且,在Python代码中:

email = profilePageSource.select("div#email-view a[href]")[0].get_text()

或者,没有使用find()的CSS选择器:

email = profilePageSource.find("div", id="email-view").a.get_text()