Question

我正在使用以下代码抓取新闻网站：

-*- coding: utf-8 -*-
archivo = open("News_Content.txt","w")
import urllib
import re
from BeautifulSoup import BeautifulSoup
links = open("MyFileWithLinks.txt").readlines()
i = 0
while i< len(links):
    conn = urllib.urlopen(links[i])
    html = conn.read()
    soup = BeautifulSoup(html)
    p = soup.find("div", attrs={'class':'single-content'})
    p1 = p.text
    p2 = BeautifulSoup(p1)
    archivo.write(str(p2))
    archivo.write("\n")
    print(i)
    i = i + 1
print("DONE")
archivo.close()

但是当我打印新闻时，结果是：

Some Useful Text .googletag.cmd.push(function() { googletag.display('div-gpt-ad-1417813885451-0'); }) More Useful Text
$("ul.social_media").clone(true).prependTo( "#redes-bottom" );
            });

我想删除所有googletags。我尝试过替换，但它没有用。你能帮我吗？

Answer 1

您是否可以使用CSS Selectors，然后对返回的每个对象使用get_text()方法？

E.g。

with open('News_Content.txt', 'w') as f_out:
    with open('MyFileWithLinks.txt') as f_in:
        for link in f_in:
            content = urllib.urlopen(link).read()
            soup = BeautifulSoup(content)
            tags = soup.select('div.single-content p')
            for tag in tags:
                f_out.write(tag.get_text() + '\n')

从BeautifulSoup删除GoogleTag

1 个答案: