BeautifulSoup Python脚本不再适用于挖掘简单字段

时间:2014-12-10 02:33:28

标签: python python-2.7 web-scraping beautifulsoup html-parsing

脚本曾经工作过,但不再是,我无法弄清楚原因。我试图去链接并提取/打印宗教领域。使用firebug,宗教字段条目位于'tbody'然后'td'标记结构内。但是现在脚本在搜索这些标签时发现“无”。我还看了'print Soup_FamSearch'的lxml,我看不到萤火虫上出现的'tbody'和'td'标签。

请让我知道我错过了什么?

import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize

FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()

tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')

for trTags in trTags:
    tdTags_label = trTag.find('td', class_='result-label ')
    if tdTags_label:
        tdTags_label_string = tdTags_label.get_text(strip=True)

        if tdTags_label_string == 'Religion: ':
            print trTags.find('td', class_='result-value ')

1 个答案:

答案 0 :(得分:1)

找到Religion:标签by text并获取next td sibling

soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)

演示:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>> 
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist

然后,您可以创建一个很好的可重用函数并重用:

def get_field_value(soup, field):
    return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)

print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')

打印:

Methodist
Canadian
Ontario