继续这个问题:Python beautifulsoup how to get the line after 'href'
我有这个HTML代码
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html" class="ss-titre">
Monte le son </a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html" class="ss-titre">
"Rubin_Steiner" </a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html" class="ss-titre">
Fare maohi </a>
如您所见,“Monte le son”和“Rubin_Steiner”与同一个id(101973832)相关联,“Fare maohi”与id 102103928相关联。
所以,实际上我有这些列表(例子有一个结果,一个id):
url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html', 'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']
titles = ['Monte le son', 'Rubin_Steiner', 'Fare maohi'] #2 entries for id 101973832
#1 entry for id 102103928
标题可以有3个条目,或1个,或者没有......
如何将地址的ID(101973832)与标题相关联,以获得此结果:
result = ['"Monte le son Rubin_Steiner 101973832"', 'Fare maohi 102103928']
结果将用于在我的Gtk界面中显示。它需要包含id以找到相应的URL,如下所示:
choice = self.liste.get_active_text() # choice = result
for adress in url:
if id in adress:
adresse = url
我希望我的问题不难理解......
修改 我得到了这样的标题和网址:
url = "http://pluzz.francetv.fr/recherche?recherche=" + mot # mot is a word for my Gtk search
try:
f = urllib.urlopen(url)
page = f.read()
f.close()
except:
self.champ.set_text("La recherche a échoué")
pass
soup = BeautifulSoup(page)
titres=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
titres.append(titre)
list_url.append(lien)
答案 0 :(得分:0)
如果我理解正确,您的所有网址和标题都会出现在您的示例列表中。
import re
In [111]: titles = ['Monte le son', 'Rubin_Steiner']
In [112]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html']
In [113]: get_id = get_id = re.findall('\d+', url[0]) # find consecutive digits
In [114]: results = [x for x in titles] + get_id
In [115]: results
Out[115]: ['Monte le son', 'Rubin_Steiner', '101973832']
正如我在评论中所说,当您在标题列表中添加标题,在子列表中对相应的标题进行分组时,如果没有某种方式对分组进行索引,则无法确定哪个属于哪个属于哪些属于哪些。我已将它们分组在子列表中,以向您展示它是如何工作的。
In [3]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html', 'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']
In [4]: titles = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] # need to sub list to match to url position
In [5]: get_ids = [re.findall('\d+', x) for x in url] # get all ids, position in list will match sub list position in titles
In [6]: results= [t + i for t, i in zip(titles, get_ids)] # this is why sub lists are useful, each position of the sub lists correspond.
In [7]: results
Out[7]: [['Monte le son', 'Rubin_Steiner', '101973832'], ['Fare maohi', '102103928']]
In [11]: final_results=[ " ".join(y) for y in results ]
In [12]: final_results
Out[12]: ['Monte le son Rubin_Steiner 101973832', 'Fare maohi 102103928'] # join strings in each sublist