美丽的汤 - 剥离HTML标签时返回奇怪的字符

时间:2018-04-21 12:00:17

标签: python html beautifulsoup python-requests

我已经从this接受的Stack Overflow回答中删除了我的大部分代码,并插入到以下代码中(在Python 2.7中运行):

import SelectProxy
from  bs4 import BeautifulSoup, NavigableString
import requests
import json

sys.path.append("G:\\Python27\\Kodi")

session = requests.Session()

url = 'http://www.tvguide.co.uk/mobile/channellisting.asp?ch=66'


headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.',
'Connection': 'keep-alive',
'Host': 'www.tvguide.co.uk',
'Referer': 'http://www.tvguide.co.uk/mobile/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}

r = session.get(url, headers=headers)

print r.text



def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

invalid_tags = ['td', 'tr', 'div', 'a', 'span', 'br']
print strip_tags(html, invalid_tags)

...这会删除标签,但我现在会在屏幕上打印出许多奇怪的文字:

</body></html>
<html><body>

                        The latest national and international stories as they break   

                            <html><body>
</body></html>
<html><body></body></html>
<html><body>Rating:  <html><body>3.1</body></html></body></html>
</body></html>
</body></html>
</body></html>

......任何人都可以告诉我我做错了什么?

由于

1 个答案:

答案 0 :(得分:1)

标签可以帮助您获得所需的文字。该页面中的大多数文本都在HTML表格中,可以按如下方式提取:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://www.tvguide.co.uk/mobile/channellisting.asp?ch=66')
soup = BeautifulSoup(r.text, "html.parser")

for tr in soup.select('table tr'):
    if not tr.script:
        print ' -'.join(re.sub(r'\s+', ' ', td.text) for td in tr.find_all('td'))

这将为您提供输出开始:

6:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4 
7:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4 
8:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4 
9:00am - BBC News The latest national and international stories as they break Rating: 3.1 
10:00am - BBC News The latest national and international stories as they break Rating: 3.1 
10:30am - The Travel Show 20/04/2018 Join the team on their journey of discovery as they explore new destinations around the globe and uncover hidden sides to some of the world's favourite holiday hotspots Rating: 4 
11:00am - BBC News The latest national and international stories as they break Rating: 3.1 
11:30am - Dateline London 21/04/2018 Foreign correspondents currently posted to London look at events in the UK through outsiders' eyes, and at how the issues of the week are being tackled around the world Rating: 6.3 
12:00pm - BBC News The latest national and international stories as they break Rating: 3.1 
12:30pm - Click 20/04/2018 A guide to the latest gadgets, websites, games and computer industry news Rating: 3.3