Python会在列表中关联网址的ID和网址标题

时间:2014-05-15 09:41:10

标签: python list beautifulsoup

继续这个问题:Python beautifulsoup how to get the line after 'href'

我有这个HTML代码

    <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html" class="ss-titre"> 
                            Monte le son         </a>
    <div class="rs-cell-details">
                            <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"  class="ss-titre">
                                    "Rubin_Steiner"                 </a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html" class="ss-titre"> 
                        Fare maohi              </a>

如您所见,“Monte le son”和“Rubin_Steiner”与同一个id(101973832)相关联,“Fare maohi”与id 102103928相关联。

所以,实际上我有这些列表(例子有一个结果,一个id):

url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html', 'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']      
titles = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']   #2 entries for id 101973832
                                                           #1 entry for id 102103928

标题可以有3个条目,或1个,或者没有......

如何将地址的ID(101973832)与标题相关联,以获得此结果:

result = ['"Monte le son Rubin_Steiner 101973832"', 'Fare maohi 102103928']

结果将用于在我的Gtk界面中显示。它需要包含id以找到相应的URL,如下所示:

choice = self.liste.get_active_text()     # choice = result   
for adress in url:
        if id in adress: 
            adresse = url

我希望我的问题不难理解......

修改 我得到了这样的标题和网址:

url = "http://pluzz.francetv.fr/recherche?recherche=" + mot # mot is a word for my Gtk search
try:
   f = urllib.urlopen(url)
   page = f.read()
   f.close()
except: 
   self.champ.set_text("La recherche a échoué")
   pass    
soup = BeautifulSoup(page)
titres=[]
list_url=[]
for link in soup.findAll('a'):
     lien = link.get('href')
     if lien == None:
         lien = ""
     if "http://pluzz.francetv.fr/videos/" in lien:
         titre = (link.text.strip())
         if "Voir cette  vidéo" in titre:
              titre = ""
         if "Lire la vidéo" in titre:
              titre = ""
         titres.append(titre)
         list_url.append(lien)

1 个答案:

答案 0 :(得分:0)

如果我理解正确,您的所有网址和标题都会出现在您的示例列表中。

import re

In [111]: titles = ['Monte le son', 'Rubin_Steiner']

In [112]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html']

In [113]: get_id = get_id = re.findall('\d+', url[0]) # find consecutive digits

In [114]: results = [x for x in titles] + get_id

In [115]: results
Out[115]: ['Monte le son', 'Rubin_Steiner', '101973832']

正如我在评论中所说,当您在标题列表中添加标题,在子列表中对相应的标题进行分组时,如果没有某种方式对分组进行索引,则无法确定哪个属于哪个属于哪些属于哪些。我已将它们分组在子列表中,以向您展示它是如何工作的。

In [3]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html',   'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']

In [4]: titles = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']]   # need to sub list to match to url position

In [5]: get_ids = [re.findall('\d+', x) for x in url] # get all ids, position in list will match sub list position in titles

In [6]: results= [t + i for t, i in zip(titles, get_ids)] # this is why sub lists are useful, each position of the sub lists correspond.

In [7]: results

Out[7]: [['Monte le son', 'Rubin_Steiner', '101973832'], ['Fare maohi', '102103928']]

In [11]: final_results=[ " ".join(y) for y in  results ]

In [12]: final_results

Out[12]: ['Monte le son Rubin_Steiner 101973832', 'Fare maohi 102103928'] # join strings in each sublist