After looping, which is the correct way of returning into a list all the iterated elements?

时间:2017-04-10 02:02:46

标签: python python-3.x loops data-structures

I have the following function which takes a .html document and extract some content:

        for e in soup.tbody.findAll('a', href=True):
            tree = etree.fromstring(str(e))
            for e in tree.xpath('//b'):
                # Here, instead of the above line, I would like to get in a single string all the printed elements in tree.xpath()

How can I return in list all the es in tree.xpath('//a') in a single movement?. I tried to append into a new list each element and to new_lis += element, however it's not working.

2 个答案:

答案 0 :(得分:2)

There are two ways you can do this:

  1. Simply push everything onto a list you create *outside of any of your loops (or use list comprehensions):

    def extract(html_file):
        soup = BeautifulSoup(open(html_file), 'lxml')
        results = []
        for e in soup.tbody.findAll('a', href=True):
            results.append(e['href'])
    
        return results
    
    def extract_with_list_comprehension(html_file):
        soup = BeautifulSoup(open(html_file), 'lxml')
        return [e['href'] for e in soup.tbody.findAll('a', href=True)]
    
  2. Turn extract into a generator and just yield as you find things, and then iterate over the result:

    def extract(html_file):
        for e in BeautifulSoup(open(html_file), 'lxml').findAll('a', href=True):
            yield e['href']
    

    and then you can turn it into a list with list if you need to:

    all_links = list(extract('~/some/html/file.here'))
    

答案 1 :(得分:1)

Try building the desired string for each iteration, append the string to the list, then return the list:

def extract(html_file):
    url_list = []
    soup = BeautifulSoup(open(html_file), 'lxml')
    try:
        for e in soup.tbody.findAll('a', href=True):
            tree = etree.fromstring(str(e))
            for e in tree.xpath('//a'):
                url = 'www.example.com' + e.get('href')+' | title: ' + e.get('title'), '\n')
                url_list.append(url)
    except AttributeError:
        print('NaN')

    return url_list