我正在学习网页抓取,但在格式化抓取的数据时,我遇到了一个问题,即我的两个变量,即first_line和second_line都显示相同的值,并且该值是second_line。
当我尝试打印出first_line时,在其他内部,我得到了预期的结果,但是在if和else之外,first_line显示了来自second_line的复制值
while current_page < 201:
print(current_page)
url = base_url + loc + "&start=" + str(current_page)
yelp_r = requests.get(url)
yelp_soup = BeautifulSoup(yelp_r.text, 'html.parser')
file_path = 'yelp-{loc}-2.txt'.format(loc=loc)
with open(file_path, "a") as textfile:
business = yelp_soup.findAll('div',{'class':'biz-listing-large'})
for biz in business:
title = biz.findAll('a', {'class':'biz-name'})[0].text
print(title)
second_line = ""
first_line = ""
try:
address = biz.findAll('address')[0].contents
for item in address:
if "br" in str(item):
second_line = second_line + item.getText()
else:
first_line = item.strip(" \n\t\r")
print(first_line)
print(first_line)
print(second_line)
except:
pass
print('\n')
try:
phone = biz.findAll('span',{'class':'biz-phone'})[0].text
except:
phone = None
print(phone)
page_line = "{title}\n{address_1}\n{address_2}\n{phone}".format(
title=title,
address_1=first_line,
address_2=second_line,
phone=phone
)
textfile.write(page_line)
current_page += 10
答案 0 :(得分:0)
如果您在某个节点上调用.get_text()
,则会为您提供全文。然后,您可以在换行符上拆分以获得第一行和第二行:
first_line, second_line = biz.findAll('address')[0].get_text().split('\n')
但是,由于你只是打印f'{first_line}\n{second_line}'
,为什么你需要将它们分开?