我目前使用以下Python代码摘录来获取网页的所有
元素:
def scraping(url, html):
data = {}
soup = BeautifulSoup(html,"lxml")
data["news"] = []
page = soup.find("div", {"class":"container_news"}).findAll('p')
page_text = ''
for p in page:
page_text += ''.join(p.findAll(text = True))
data["news"].append(page_text)
print(page_text)
return data
但是,page_text
的输出如下:
"['New news on the internet. ', 'Here is some text. ', ""Here is some other."", ""And then there are other variations \n\nLooks like there are some non-text elements. \n\xa0""]" ...
是否可以使内容清理程序并将列表合并为一个字符串?相较于regex变体,BeautifulSoup解决方案将是首选。
谢谢!
答案 0 :(得分:4)
我不确定维护data["news"]
的重要性,但这可以在一行中完成:
page_text = ' '.join(e.text for p in page for e in p.findAll(text=True))
您可以使用所需的任何字符串代替定界符来代替' '
。
否则
page_text = []
for p in page:
page_text.extend(e.text for e in p.findAll(text=True))
data["news"].append(page_text)
print(' '.join(page_text))