我正在尝试制作一个读取crunchyroll页面的python脚本,并给我字幕的ssid。
转到源代码并查找ssid
,我想在此元素的ssid之后提取数字
<a href="/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035?ssid=154757" title="English (US)">English (US)</a>
我想提取“154757”,但我似乎无法让我的脚本正常工作
这是我目前的剧本:
import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup
feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')
如何修改我的代码以搜索和提取该特定数字?
答案 0 :(得分:1)
这可以让您开始为每个条目提取ssid
。请注意,其中一些链接没有任何ssid
因此您必须考虑到一些错误捕获。此处无需re
或urllib2
模块。
import feedparser
import requests
from bs4 import BeautifulSoup
d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
#print url.link
r = requests.get(url.link)
soup = BeautifulSoup(r.text)
#print soup
subtitles = soup.find_all('span',{'class':'showmedia-subtitle-text'})
for ssid in subtitles:
x = ssid.findAll('a')
for a in x:
print a['href']
<强>输出:强>
--snip--
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981
--snip--
还有更多,但为了简洁,我把它们留了下来。从这些结果中,您应该能够轻松地解析ssid
一些切片,因为它看起来像ssid都是6位数。做类似的事情:
print a['href'][-6:]
可以解决问题,让你 ssid
。