我必须从此来源中删除标题标记中的文字:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title>
我用它来删除文字:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ourUrl = opener.open("http://www.thehindubusinessline.com/industry-and-economy/info-tech/nokia-cannot-license-brand-nokia-post-microsoft-deal/article5156470.ece").read()
soup = BeautifulSoup(ourUrl)
print soup
dem = soup.findAll('p')
hea = soup.findAll('title')
此代码正确提取p标记,但在尝试提取标题时失败。谢谢。我只包含了部分代码,不用担心其余部分工作正常。
答案 0 :(得分:0)
你的HTML代码有错误!您有2个</title>
结束标记:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title> #You already have endtag of <title>
所以固定代码应如下所示:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />