I have the following function which takes a .html
document and extract some content:
for e in soup.tbody.findAll('a', href=True):
tree = etree.fromstring(str(e))
for e in tree.xpath('//b'):
# Here, instead of the above line, I would like to get in a single string all the printed elements in tree.xpath()
How can I return in list all the e
s in tree.xpath('//a')
in a single movement?. I tried to append into a new list each element and to new_lis += element
, however it's not working.
答案 0 :(得分:2)
There are two ways you can do this:
Simply push everything onto a list you create *outside of any of your loops (or use list comprehensions):
def extract(html_file):
soup = BeautifulSoup(open(html_file), 'lxml')
results = []
for e in soup.tbody.findAll('a', href=True):
results.append(e['href'])
return results
def extract_with_list_comprehension(html_file):
soup = BeautifulSoup(open(html_file), 'lxml')
return [e['href'] for e in soup.tbody.findAll('a', href=True)]
Turn extract
into a generator and just yield
as you find things, and then iterate over the result:
def extract(html_file):
for e in BeautifulSoup(open(html_file), 'lxml').findAll('a', href=True):
yield e['href']
and then you can turn it into a list with list
if you need to:
all_links = list(extract('~/some/html/file.here'))
答案 1 :(得分:1)
Try building the desired string for each iteration, append the string to the list, then return the list:
def extract(html_file):
url_list = []
soup = BeautifulSoup(open(html_file), 'lxml')
try:
for e in soup.tbody.findAll('a', href=True):
tree = etree.fromstring(str(e))
for e in tree.xpath('//a'):
url = 'www.example.com' + e.get('href')+' | title: ' + e.get('title'), '\n')
url_list.append(url)
except AttributeError:
print('NaN')
return url_list