我使用以下代码:
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1956,2016))
for year in years:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm',)
my_url = my_urls[0]
for my_url in my_urls:
uClient = uReq(my_url)
html_input = uClient.read()
uClient.close()
page_soup = BeautifulSoup(html_input, "html.parser")
container = page_soup.findAll("li")
filename = "singoli" + str(year) + ".csv"
f = open(filename, "w")
headers = "lista"
f.write(headers)
lista = container[0].text
print("lista: " + lista)
f.write(lista + "\n")
f.close()
我得到的文本似乎不在“li”容器中,但它会在输出中写入。这是不需要的文字:
<!--
google_ad_client = "ca-pub-9635531430093553";
/* in medias res */
google_ad_slot = "9880694813";
google_ad_width = 468;
google_ad_height = 60;
//-->
我怎样摆脱它?
答案 0 :(得分:0)
您不想要的文本来自脚本元素。所以在开始之前摆脱脚本元素并且它可以工作:
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1956,2016))
for year in years:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm',)
my_url = my_urls[0]
for my_url in my_urls:
uClient = uReq(my_url)
html_input = uClient.read()
uClient.close()
page_soup = BeautifulSoup(html_input, "html.parser")
[s.extract() for s in page_soup('script')]
container = page_soup.findAll("li")
filename = "singoli" + str(year) + ".csv"
f = open(filename, "w")
headers = "lista"
f.write(headers)
lista = container[0].text
print("lista: " + lista)
f.write(lista + "\n")
f.close()
我所做的只是添加一行:
[s.extract() for s in page_soup('script')]
找到脚本元素并将其删除。