#Extract Record
for person in soup.find_all(['b','h1']):
with open('test.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
#Header
header_tag = soup.find_all('h1')[k]
k += 1
header = header_tag.text.strip().replace('\n', ' ').encode('windows-1252', errors='replace')
print(header)
#Name
name_tag = header_tag.find_all_next('p')[1]
name = name_tag.text.strip().replace('\n', ' ').encode('windows-1252', errors='replace')
print(name)
#writer.writerow([name])
#Workplace
workplace_tag = name_tag.find_all_next('i')[0]
workplace = workplace_tag.text.strip().replace('\n', ' ').encode('windows-1252', errors='replace')
print(workplace)
#writer.writerow([workplace])
#Abstract
while workplace_tag.find_all_next('p')[l] != 'h1':
abstract_tag = workplace_tag.find_all_next('p')[l]
abstract = abstract_tag.text.strip().replace('\n', ' ').encode('windows-1252', errors='replace')
l += 1
print(abstract)
#writer.writerow([abstract])
上面的代码输出了我所需要的。除了一个问题,当我尝试在Abstract代码的底部结合while循环的p标签时,我遇到了问题。
使用print(abstract, end='')
不能按预期工作。
并使用此方法:
#Abstract
abstracts = ''
while workplace_tag.find_all_next('p')[l] != 'h1':
abstract_tag = workplace_tag.find_all_next('p')[l]
abstract = abstract_tag.text.strip().replace('\n', ' ').encode('windows-1252', errors='replace')
l += 1
abstracts += abstract.decode('windows-1252', errors='replace')
print(abstracts)
#writer.writerow([abstract])
这段代码几乎可以正常工作,但是它使我的while循环始终为真,因此可以无限次打印相同的第一组p标签。
当前我的代码可能得到的输出是:
Name
Workplace
Abstract A
Abstract b
Abstract c
但是我需要它看起来像:
Name
Workplace
Abstract A, Abstract b, Abstract c