BeautifulSoup过滤来自html页面上的列表元素的数据

时间:2015-11-17 02:59:47

标签: python html beautifulsoup

我正在尝试从多个html页面收集数据,特别是列表元素中的数据。我试图将这些数据添加到字典供以后使用,我必须按照我的预期提取数据但是我将数据输入到dict中并没有按预期工作。我目前正在覆盖每个条目而不是添加新条目。任何人都可以指出我哪里出错吗?

当前代码

from BeautifulSoup import BeautifulSoup
import requests
import re

person_dict = {}

.....
<snip>
<snip>
.....

soup = BeautifulSoup(response.text)

    div = soup.find('div', {'id': 'object-a'})
    ul = div.find('ul', {'id': 'object-a-1'})
    li_a = ul.findAll('a', {'class': 'title'})
    li_p = ul.findAll('p', {'class': 'url word'})
    li_po = ul.findAll('p')

    for a in li_a:
        nametemp = a.text
        name = (nametemp.split(' - ')[0])
        person_dict.update({'Name': name})     #I attempted updating
    for lip in li_p:
        person_dict['url'] = lip.text          #I attempted adding directly

    for email in li_po:   
        reg_emails = re.compile('[a-zA-Z0-9.]*' + '@')        
        person_dict['email'] = reg_emails.findall(email.text)

print person_dict # results in 1 entry being returned

测试数据

<div id="object-a">
    <ul id="object-a-1">
            <li>
              <a href="www.url.com/person" class="title">Person1</a>
              <p class="url word">www.url.com/Person1</p>
              <p>Person 1, some foobar possibly an email@address.com &nbsp;...</p>
            </li>


            <li>
              <a href="www.url.com/person" class="title">Person2</a>
              <p class="url word">www.url.com/Person1</p>
              <p>Person 2, some foobar possibly an email@address.com &nbsp;...</p>
            </li>


            <li>
              <a href="www.url.com/person" class="title">Person3</a>
              <p class="url word">www.url.com/Person1</p>
              <p>Person 3, some foobar, possibly an email@address.com &nbsp;...</p>
            </li>
    </ul>

2 个答案:

答案 0 :(得分:1)

您是否需要使用字典取决于您,但如果您选择使用字典,那么每个列表项目都有一个单独的字典而不是所有条目的单个字典可能会更好。

我建议您将所有条目存储在列表中。以下代码显示了两个建议,使用private void setDateTimeField () { final Calendar newCalendar = Calendar.getInstance(); mDatePickerDialog = new DatePickerDialog(AddBirthday.this, new OnDateSetListener() { @Override public void onDateSet(DatePicker view, int year, int monthOfYear, int dayOfMonth) { Calendar newDate = Calendar.getInstance(); newDate.set(year, monthOfYear, dayOfMonth); mYear = c.get(Calendar.YEAR); getAge = mYear - year; if (getAge == 0) { SuperActivityToast.create(AddBirthday.this, "Invalid Date of Birthday!", SuperToast.Duration.SHORT, Style.getStyle(Style.RED, SuperToast.Animations.FLYIN)).show(); } else { addBirthdayDate.setText(dateFormatter.format(newDate.getTime())); dateSelected = String.valueOf(dayOfMonth) + " /" + String.valueOf(monthOfYear + 1) + " /" + String.valueOf(year); } SuperActivityToast.create(AddBirthday.this, "Notification set for: " + dayOfMonth + "/" + (monthOfYear + 1) + "/" + year, SuperToast.Duration.SHORT, Style.getStyle(Style.RED, SuperToast.Animations.FLYIN)) .show(); } }, newCalendar.get(Calendar.YEAR), newCalendar.get(Calendar.MONTH), newCalendar.get(Calendar.DAY_OF_MONTH)); } 来存储每个项目的各种信息,或者使用字典。

如果您只是打算显示信息或将其写入文件,tuple解决方案会更快。

tuple

对于您的示例HTML,将显示以下内容:

# Two possible ways of storing your data: a list of tuples, or a list of dictionaries
entries_tuples = []             
entries_dictionary = []

soup = BeautifulSoup(text)

div = soup.find('div', {'id': 'object-a'})
ul = div.find('ul', {'id': 'object-a-1'})

for li in ul.findAll('li'):
    title = li.find('a', {'class': 'title'})
    url_href = title.get('href')
    person = title.text
    url_word = li.find('p', {'class': 'url word'}).text
    emails = re.findall(r'\s+(\S+@\S+)(?:\s+|\Z)', li.findAll('p')[1].text, re.M)       # allow for multiple emails

    entries_tuples.append((url_href, person, url_word, emails))
    entries_dictionary.append({'url_href' : url_href, 'person' : person, 'url_word' : url_word, 'emails' : emails})

for url_href, person, url_word, emails in entries_tuples:
    print '{:25} {:10} {:25} {}'.format(url_href, person, url_word, emails)

print

for entry in entries_dictionary:
    print '{:25} {:10} {:25} {}'.format(entry['url_href'], entry['person'], entry['url_word'], entry['emails'])

注意,从文本中提取电子邮件地址本身就是一个完整的问题。上述解决方案可以轻松匹配实际上不是很好的电子邮件地址的条目,但这里就足够了。

答案 1 :(得分:0)

你可能会走错路。尝试这样的事情:

from BeautifulSoup import BeautifulSoup
import re

text = open('soup.html') # You are opening the file differently
soup = BeautifulSoup(text)
list_items = soup.findAll('li')

people = []

for item in list_items:
    name = item.find('a', {'class': 'title'}).text
    url = item.find('p', {'class': 'url word'}).text
    email_text = item.findAll('p')[1].text
    match = re.search(r'[\w\.-]+@[\w\.-]+', email_text)
    email = match.group(0)

    person = {'name': name, 'url': url, 'email': email}
    people.append(person)

print people