如何选择特定的单词并将它们放入元组列表中?

时间:2015-01-24 13:47:46

标签: python python-3.x beautifulsoup

我通过使用BeautifulSoup获得了长字符串的结果。 形状像这样:

<a href="link1"><span>title1</span></a>
<a href="link2"><span>title2</span></a>
<a href="link3"><span>title3</span></a>
<a href="link4"><span>title4</span></a>

我想专门选择&#34;链接#&#34;和&#34;标题&#34;并把它们放在一个列表中 - 如下所示:

[(link1,title1),(link2,title2),(link3,title3),(link4,title4)]

由于我对python缺乏了解, 我甚至不知道要搜索什么。 我已经尝试这样做了6个小时但仍然无法找到方法。

我使用的bs代码

def extract(self):

    self.url ="http://aetoys.tumblr.com"
    self.source = requests.get(self.url)
    self.text = self.source.text
    self.soup = BeautifulSoup(self.text)

    for self.div in self.soup.findAll('li',{'class':'has-sub'}):
        for self.li in self.div.find_all('a'):
            print(self.li)

1 个答案:

答案 0 :(得分:1)

您只需要提取href:

out = [] # store lists of lists
for self.div in self.soup.findAll('li',{'class':'has-sub'}):
     out.append([x["href"] for x in self.div.find_all('a',href=True)])
     print([x["href"] for x in self.div.find_all('a',href=True)])



['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando']
['#', '/onepiece_book', '/onepiece']
['#', '/naruto_book', '/naruto']
['#', '/bleach_book', '/bleach']
['#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal']
['#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']

获取单个列表:

url ="http://aetoys.tumblr.com"
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text)

print [ x["href"]  for div in soup.findAll('li',{'class':'has-sub'}) for x in div.find_all('a',href=True)]


['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal', '#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']

如果你真的想要元组:

out = []
for div in soup.findAll('li',{'class':'has-sub'}):
        out.append(tuple(x["href"] for x in div.find_all('a',href=True)))