对于Python来说我还很陌生,我的For循环很难提取特定站点上的所有Web链接。这是我的代码:
import requests
import csv
from bs4 import BeautifulSoup
j= [["Population and Housing Unit Estimates"]] # Title
k= [["Web Links"]] # Column Headings
example_listing='https://www.census.gov/programs-surveys/popest.html' #Source
r=requests.get(example_listing) #Grab page source html
html_page=r.text
soup=BeautifulSoup(html_page,'html.parser') #Build Beautiful Soup object to help parse the html
with open('HTMLList.csv','w',newline="") as f: #Choose what you want to grab
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
writer.writerows(j)
writer.writerows(k)
for link in soup.find_all('a'):
f.append(link.get('href'))
if not f:
""
else:
writer.writerow(f)
f.close()
非常感谢您的帮助。我真的不知道从这里去哪里。谢谢!
答案 0 :(得分:2)
假定您要尝试将站点中的URL保存到CSV文件中-每行一个URL。首先,不要重复使用f
,即用于文件。通过将链接包含在数组writer.writerow([link.get('href')])
中,可以将链接直接写到CSV。希望对您有所帮助。否则,请编辑您的问题并添加更多详细信息。
import csv
import requests
from bs4 import BeautifulSoup
j= [["Population and Housing Unit Estimates"]] # Title
k= [["Web Links"]] # Column Headings
example_listing='https://www.census.gov/programs-surveys/popest.html' #Source
r=requests.get(example_listing) #Grab page source html
html_page=r.text
soup=BeautifulSoup(html_page,'html.parser') #Build Beautiful Soup object to help parse the html
with open('HTMLList.csv','w', newline="") as f: #Choose what you want to grab
writer=csv.writer(f, delimiter=' ',lineterminator='\r')
writer.writerows(j)
writer.writerows(k)
for link in soup.find_all('a'):
url = link.get('href')
if url:
writer.writerow([url])
答案 1 :(得分:1)
import requests
import csv
from bs4 import BeautifulSoup
j= ["Population and Housing Unit Estimates"] # Title
k= ["Web Links"] # Column Headings
example_listing='https://www.census.gov/programs-surveys/popest.html' #Source
r=requests.get(example_listing) #Grab page source html
html_page=r.text
soup=BeautifulSoup(html_page,'html.parser') #Build Beautiful Soup object to help parse the html
with open('HTMLList.csv','w',newline="") as f: #Choose what you want to grab
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
writer.writerow(j)
writer.writerow(k)
for link in soup.find_all('a'):
if link.get('href') is not None:
writer.writerow([link.get('href')])
HTMLList.csv
"Population and Housing Unit Estimates"
"Web Links"
https://www.census.gov/en.html
https://www.census.gov/topics/population/age-and-sex.html
https://www.census.gov/topics/business-economy.html
https://www.census.gov/topics/education.html
https://www.census.gov/topics/preparedness.html
https://www.census.gov/topics/employment.html
https://www.census.gov/topics/families.html
https://www.census.gov/topics/population/migration.html
https://www.census.gov/geography.html
https://www.census.gov/topics/health.html
https://www.census.gov/topics/population/hispanic-origin.html
https://www.census.gov/topics/housing.html
https://www.census.gov/topics/income-poverty.html
https://www.census.gov/topics/international-trade.html
https://www.census.gov/topics/population.html
.......