我正在尝试使用BeautifulSoup和Python从维基百科的表格中提取电视剧集的标题。 为了解释我到目前为止所做的事情,我使用了两个表:
1:http://en.wikipedia.org/wiki/Community_(season_1)
2:http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)
现在,在表格中,每集都包含在<td class="summary">
中。
在第一个表格中,<td>
也有一个<a>
TitleName </a>
,我可以很好地使用以下代码提取数据:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Community_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for names in soup.select('td[class="summary"] > a'):
print names.string
但问题出现在第二个表格中,即两个半人,标题在<td>
内
我使用此代码来提取它们:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
print lel.string
但是瓷砖带有引号,即“”。
我猜测删除引号会很容易,但如果在一个表中,某些<td>
包含<a>
而有些则不包含?我如何让python决定是否应检查<a>
元素?
如果在第一个代码块中,我删除了> a
,那么我得不到输出,因为父和子都包含字符串。如果我继续使用names.strings
我
<generator object _all_strings at 0x01B1CDA0>
如果我使用soup.get_text()
,我会得到
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 6818, character maps to <undefined>
请帮助:)
答案 0 :(得分:2)
如何使用.text
?
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
print lel.text.replace('"','') # remove the quote marks as well
这将打印所有没有引号的名称,它修复了None
问题。
Pilot
Most Chicks Won't Eat Veal
Big Flappy Bastards
etc...
答案 1 :(得分:1)
你有没有想过尝试使用tvrage API?
import tvrage.api
community = tvrage.api.Show('Community')
twohalfmen = tvrage.api.Show('Two and a Half Men')
comeps = community.season(1).episode(1)
twoeps = twohalfmen.season(1).episode(2)
>>> comeps
Community 1x01 Pilot
>>> twoeps
Two and a Half Men 1x02 Big Flappy Bastards
>>> community.season(1)
{1: Community 1x01 Pilot, 2: Community 1x02 Spanish 101, 3: Community 1x03 Introduction to Film,
4: Community 1x04 Social Psychology, 5: Community 1x05 Advanced Criminal Law, 6: Community 1x06 Football, Feminism and You,
7: Community 1x07 Introduction to Statistics, 8: Community 1x08 Home Economics, 9: Community 1x09 Debate 109, 10: Community 1x10 Environmental Science,
11: Community 1x11 The Politics of Human Sexuality, 12: Community 1x12 Comparative Religion, 13: Community 1x13 Investigative Journalism, 14: Community 1x14 Interpretive Dance, 15: Community 1x15 Romantic Expressionism, 16: Community 1x16 Communication Studies, 17: Community 1x17 Physical Education, 18: Community 1x18 Basic Genealogy, 19: Community 1x19 Beginner Pottery, 20: Community 1x20 The Science of Illusion, 21: Community 1x21 Contemporary American Poultry, 22: Community 1x22 The Art of Discourse, 23: Community 1x23 Modern Warfare, 24: Community 1x24 English as a Second Language, 25: Community 1x25 Pascal's Triangle Revisited}