我正在尝试使用请求和BeautifulSoup实现网络收获。 Web爬网程序代码正常工作,但提取部分无法正常工作。写入输出文件的唯一数据是标题行。我在网上看过几十个例子,但仍然无法解决我的问题。我哪里错了?
secondSoupParser = BeautifulSoup(raw_html, 'html.parser')
list_of_headers = []
list_of_paras = []
try:
results_parser = secondSoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
except AttributeError as e:
logging.exception(e)
sys.exit(1)
for div in results_parser.findAll('h2'):
for para in div.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
list_of_headers.append(list_of_paras)
filenameTest = (output_directory + '/'+ 'test' + '-' + timestamp + '.csv')
output_file2 = open(filenameTest, 'w', encoding='utf8')
writer2 = csv.writer(output_file2)
writer2.writerow(['Test'])
writer2.writerow(list_of_headers)
目标网址格式为:
<div style="padding-left:10px;width:98%">
<p><i>Last revised: A date is here</i></p>
<h2>Header One</h2>
<p>Some text goes here.</p>
<h2>Header Two</h2>
<p>Some text goes here.</p>
<h2>Header Three</h2>
<p>Some text goes here.</p>
<h2>Header Four</h2>
<p>Some text goes here.</p>
<h2>Header Five</h2>
<p>Some text goes here.</p>
<h2>Header Six</h2>
<p>Some text goes here.</p>
</div>
答案 0 :(得分:1)
<p>
代码中未包含<h2>
代码,因此无需首先循环<h2>
。这应该足以将<p>
的文本提取到列表中:
results_parser = secondSoupParser.find('div', attrs={'style': 'padding-left:10px;width:98%'})
for para in results_parser.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
list_of_headers.append(list_of_paras)