Python重排&从html页面标题中删除字符

时间:2016-04-28 20:26:03

标签: python html-parsing

我正在运行Python 2.7.11 |在Windows 10上使用beautifulsoup4和lxml。

import urllib2
import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("http://www.daisuki.net/us/en/anime/watch.GUNDAMUNICORNRE0096.13142.html"), "lxml")
Name = soup.title.string

print(Name.replace('#', ""))

输出:

01 DEPARTURE 0096 - MOBILE SUIT GUNDAM UNICORN RE:0096 - DAISUKI

期望的输出:

MOBILE SUIT GUNDAM UNICORN RE:0096 - 01 DEPARTURE 0096

我如何去除" - DAISUKI"在最后并重新排序字符串?

2 个答案:

答案 0 :(得分:1)

-拆分并重新排列标题部分:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(urllib2.urlopen("http://www.daisuki.net/us/en/anime/watch.GUNDAMUNICORNRE0096.13142.html"), "lxml")
>>> Name = soup.title.string
>>> 
>>> " - ".join(Name.replace('#', "").split(" - ")[1::-1])
u'MOBILE SUIT GUNDAM UNICORN RE:0096 - 01 DEPARTURE 0096'

答案 1 :(得分:1)

Hacky解决方案传入:

Name = "01 DEPARTURE 0096 - MOBILE SUIT GUNDAM UNICORN RE:0096 - DAISUKI"
print ("- ".join(reversed(Name.split('-')[:2]))).strip()