我正在使用以下代码抓取新闻网站:
-*- coding: utf-8 -*-
archivo = open("News_Content.txt","w")
import urllib
import re
from BeautifulSoup import BeautifulSoup
links = open("MyFileWithLinks.txt").readlines()
i = 0
while i< len(links):
conn = urllib.urlopen(links[i])
html = conn.read()
soup = BeautifulSoup(html)
p = soup.find("div", attrs={'class':'single-content'})
p1 = p.text
p2 = BeautifulSoup(p1)
archivo.write(str(p2))
archivo.write("\n")
print(i)
i = i + 1
print("DONE")
archivo.close()
但是当我打印新闻时,结果是:
Some Useful Text .googletag.cmd.push(function() { googletag.display('div-gpt-ad-1417813885451-0'); }) More Useful Text
$("ul.social_media").clone(true).prependTo( "#redes-bottom" );
});
我想删除所有googletags。我尝试过替换,但它没有用。你能帮我吗?
答案 0 :(得分:1)
您是否可以使用CSS Selectors,然后对返回的每个对象使用get_text()
方法?
E.g。
with open('News_Content.txt', 'w') as f_out:
with open('MyFileWithLinks.txt') as f_in:
for link in f_in:
content = urllib.urlopen(link).read()
soup = BeautifulSoup(content)
tags = soup.select('div.single-content p')
for tag in tags:
f_out.write(tag.get_text() + '\n')