从维基百科表中提取数据(剧集标题)

时间:2014-09-17 04:43:20

标签: python web-scraping beautifulsoup html-table wikipedia

我正在尝试使用BeautifulSoup和Python从维基百科的表格中提取电视剧集的标题。 为了解释我到目前为止所做的事情,我使用了两个表:

1:http://en.wikipedia.org/wiki/Community_(season_1)

2:http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)

现在,在表格中,每集都包含在<td class="summary">中。 在第一个表格中,<td>也有一个<a> TitleName </a>,我可以很好地使用以下代码提取数据:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Community_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

for names in soup.select('td[class="summary"] > a'):
    print names.string

但问题出现在第二个表格中,即两个半人,标题在<td>内 我使用此代码来提取它们:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.string

但是瓷砖带有引号,即“”。 我猜测删除引号会很容易,但如果在一个表中,某些<td>包含<a>而有些则不包含?我如何让python决定是否应检查<a>元素?

如果在第一个代码块中,我删除了> a,那么我得不到输出,因为父和子都包含字符串。如果我继续使用names.strings

<generator object _all_strings at 0x01B1CDA0>

如果我使用soup.get_text(),我会得到 UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 6818, character maps to <undefined>

请帮助:)

2 个答案:

答案 0 :(得分:2)

如何使用.text

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.text.replace('"','') # remove the quote marks as well

这将打印所有没有引号的名称,它修复了None问题。

Pilot
Most Chicks Won't Eat Veal
Big Flappy Bastards
etc...

答案 1 :(得分:1)

你有没有想过尝试使用tvrage API?

import tvrage.api
community = tvrage.api.Show('Community')
twohalfmen = tvrage.api.Show('Two and a Half Men')
comeps = community.season(1).episode(1)
twoeps = twohalfmen.season(1).episode(2)
>>> comeps
Community 1x01 Pilot
>>> twoeps
Two and a Half Men 1x02 Big Flappy Bastards
>>> community.season(1)
{1: Community 1x01 Pilot, 2: Community 1x02 Spanish 101, 3: Community 1x03 Introduction to Film,
4: Community 1x04 Social Psychology, 5: Community 1x05 Advanced Criminal Law, 6: Community 1x06 Football, Feminism and You,
7: Community 1x07 Introduction to Statistics, 8: Community 1x08 Home Economics, 9: Community 1x09 Debate 109, 10: Community 1x10 Environmental Science,
11: Community 1x11 The Politics of Human Sexuality, 12: Community 1x12 Comparative Religion, 13: Community 1x13 Investigative Journalism, 14: Community 1x14 Interpretive Dance, 15: Community 1x15 Romantic Expressionism, 16: Community 1x16 Communication Studies, 17: Community 1x17 Physical Education, 18: Community 1x18 Basic Genealogy, 19: Community 1x19 Beginner Pottery, 20: Community 1x20 The Science of Illusion, 21: Community 1x21 Contemporary American Poultry, 22: Community 1x22 The Art of Discourse, 23: Community 1x23 Modern Warfare, 24: Community 1x24 English as a Second Language, 25: Community 1x25 Pascal's Triangle Revisited}