删除\ r和空格

时间:2018-03-04 18:41:00

标签: python python-2.7 beautifulsoup kodi

如何使用BS和Python从打印文本中删除所有空行? 我还是新手,我觉得我所说的可能叫做空白?

当前输出:

02:00 - 05:00 NHL: Columbus Blue Jackets at San Jose Sharks

 - Channel 60







02:30 - 04:30 NCAAB: Quinnipiac vs Fairfield

 - Channel 04







03:00 - 05:00 MLS: Portland Timbers at Los Angeles Galaxy

 - Channel 05

期望的输出:

02:00 - 05:00 NHL: Columbus Blue Jackets at San Jose Sharks - Channel 60
02:30 - 04:30 NCAAB: Quinnipiac vs Fairfield - Channel 04 
03:00 - 05:00 MLS: Portland Timbers at Los Angeles Galaxy - Channel 05

代码:

import urllib, urllib2, re, HTMLParser, os
from bs4 import BeautifulSoup
import os

pg_source = ''
req = urllib2.Request('http://rushmore.tv/schedule')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')

try:
    response = urllib2.urlopen(req)
    pg_source = response.read().decode('utf-8' , 'ignore')
    response.close()
except:
    pass

content = []
soup = BeautifulSoup(pg_source)
content = BeautifulSoup(soup.find('ul', { 'id' : 'myUL' }).prettify())

print (content.text)

2 个答案:

答案 0 :(得分:0)

有点list comprehension.split().strip().join()您可以构建输出,如:

代码:

text = [l.strip() for l in content.text.split('\n') if l.strip()]
print('\n'.join(' '.join(l) for l in zip(text[::2], text[1::2])))

测试代码:

import urllib, urllib2, re, HTMLParser, os
from bs4 import BeautifulSoup
import os

pg_source = ''
req = urllib2.Request('http://rushmore.tv/schedule')
req.add_header('User-Agent',
               'Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')

try:
    response = urllib2.urlopen(req)
    pg_source = response.read().decode('utf-8', 'ignore')
    response.close()
except:
    pass

content = []
soup = BeautifulSoup(pg_source)
content = BeautifulSoup(soup.find('ul', {'id': 'myUL'}).prettify())

text = [l.strip() for l in content.text.split('\n') if l.strip()]
print('\n'.join(' '.join(l) for l in zip(text[::2], text[1::2])))

结果:

21:00 - 23:00 NCAAB:    Pepperdine vs Saint Mary's - Channel 03
21:30 - 00:00 AFL: Gold Coast vs. Geelong - Channel 47
22:00 - 00:00 A-League: Western Sydney Wanderers vs Perth Glory - BT Sport 1
22:45 - 03:00 Ski Classic: Mora - Channel 93
23:00 - 00:30 Freestyle Skiing WC: Ski Cross - Channel 106

答案 1 :(得分:0)

使用请求模块实现相同结果但代码更少的一种非常简单的方法。

这是代码。

import requests
from bs4 import BeautifulSoup

html = requests.get('http://rushmore.tv/schedule').text

soup = BeautifulSoup(html,'lxml')

ul = soup.find('ul', { 'id' : 'myUL' })

for content in ul.find_all('li'):
    print(content.text)

试试吧。它对我很好。