Question

I have the following function which takes a .html document and extract some content:

        for e in soup.tbody.findAll('a', href=True):
            tree = etree.fromstring(str(e))
            for e in tree.xpath('//b'):
                # Here, instead of the above line, I would like to get in a single string all the printed elements in tree.xpath()

How can I return in list all the es in tree.xpath('//a') in a single movement?. I tried to append into a new list each element and to new_lis += element, however it's not working.

Answer 1

There are two ways you can do this:

Simply push everything onto a list you create *outside of any of your loops (or use list comprehensions):

def extract(html_file):
    soup = BeautifulSoup(open(html_file), 'lxml')
    results = []
    for e in soup.tbody.findAll('a', href=True):
        results.append(e['href'])

    return results

def extract_with_list_comprehension(html_file):
    soup = BeautifulSoup(open(html_file), 'lxml')
    return [e['href'] for e in soup.tbody.findAll('a', href=True)]

Turn extract into a generator and just yield as you find things, and then iterate over the result:

def extract(html_file):
    for e in BeautifulSoup(open(html_file), 'lxml').findAll('a', href=True):
        yield e['href']

and then you can turn it into a list with list if you need to:

all_links = list(extract('~/some/html/file.here'))

Answer 2

Try building the desired string for each iteration, append the string to the list, then return the list:

def extract(html_file):
    url_list = []
    soup = BeautifulSoup(open(html_file), 'lxml')
    try:
        for e in soup.tbody.findAll('a', href=True):
            tree = etree.fromstring(str(e))
            for e in tree.xpath('//a'):
                url = 'www.example.com' + e.get('href')+' | title: ' + e.get('title'), '\n')
                url_list.append(url)
    except AttributeError:
        print('NaN')

    return url_list

After looping, which is the correct way of returning into a list all the iterated elements?

2 个答案: