通过使用BeautifulSoup解析html来排除多行中不需要的文本

时间:2017-09-07 17:08:04

标签: python python-3.x parsing beautifulsoup

我使用以下代码:

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1956,2016))


for year in years:
    my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm',)
    my_url = my_urls[0]
    for my_url in my_urls:
        uClient = uReq(my_url)
        html_input = uClient.read()
        uClient.close()
        page_soup = BeautifulSoup(html_input, "html.parser")
        container = page_soup.findAll("li")
        filename = "singoli" + str(year) + ".csv"
        f = open(filename, "w")
        headers = "lista"
        f.write(headers)
        lista = container[0].text
        print("lista: " + lista)
        f.write(lista + "\n")
        f.close()

我得到的文本似乎不在“li”容器中,但它会在输出中写入。这是不需要的文字:

<!--
google_ad_client = "ca-pub-9635531430093553";
/* in medias res */
google_ad_slot = "9880694813";
google_ad_width = 468;
google_ad_height = 60;
//-->

我怎样摆脱它?

1 个答案:

答案 0 :(得分:0)

您不想要的文本来自脚本元素。所以在开始之前摆脱脚本元素并且它可以工作:

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1956,2016))


for year in years:
    my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm',)
    my_url = my_urls[0]
    for my_url in my_urls:
        uClient = uReq(my_url)
        html_input = uClient.read()
        uClient.close()
        page_soup = BeautifulSoup(html_input, "html.parser")
        [s.extract() for s in page_soup('script')]
        container = page_soup.findAll("li")
        filename = "singoli" + str(year) + ".csv"
        f = open(filename, "w")
        headers = "lista"
        f.write(headers)
        lista = container[0].text
        print("lista: " + lista)
        f.write(lista + "\n")
        f.close()

我所做的只是添加一行:

[s.extract() for s in page_soup('script')]

找到脚本元素并将其删除。