这是我写的代码:
import urllib2
import codecs
import urllib
import re
from bs4 import BeautifulSoup
from lxml.html import fromstring
import codecs
url="http://www.thehindu.com/sci-tech/science/iit-bombay-birds-eye-view-and-quantum-biology/article18191268.ece"
htmltext = urllib.urlopen(url).read()
resp = urllib.urlopen(url)
respData =resp.read()
paras = re.findall(r'<p>(.*?)</p>',str(respData))
soup = BeautifulSoup(htmltext,"lxml")
webpage_title = soup.find_all('h1', attrs = {"class": "title"})
webpage_title = webpage_title[0].get_text(strip=True)
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "w+", encoding="utf-8") as f:
f.write(webpage_title)
soup = BeautifulSoup(htmltext,"lxml")
ut_container = soup.find("div", {"class": "ut-container"})
time = ut_container.find("none").text.strip()
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "a+",encoding="utf-8") as f:
f.write(time)
写入文件的输出是:
IIT Bombay: Bird’s eye view and quantum biologyApril 22, 2017 18:56 IST
我希望输出像这样保存:
IIT Bombay: Bird’s eye view and quantum biology
April 22, 2017 18:56 IST
答案 0 :(得分:0)
由于它非常笼统,我只是对这个背景提出了一个想法。
在撰写webpage_title
之后,您需要添加一个新行。
f.writelines(webpage_title)
f.write("\n")
答案 1 :(得分:0)
我使用了windows风格&#34; \ r \ n&#34;。它就像一个魅力:
import urllib2
import codecs
import urllib
import re
from bs4 import BeautifulSoup
from lxml.html import fromstring
import codecs
url="http://www.thehindu.com/sci-tech/science/iit-bombay-birds-eye-view-and-quantum-biology/article18191268.ece"
htmltext = urllib.urlopen(url).read()
resp = urllib.urlopen(url)
respData =resp.read()
paras = re.findall(r'<p>(.*?)</p>',str(respData))
soup = BeautifulSoup(htmltext,"lxml")
webpage_title = soup.find_all('h1', attrs = {"class": "title"})
webpage_title = webpage_title[0].get_text(strip=True)
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "w+", encoding="utf-8") as f:
f.write(webpage_title+"\r\n")
soup = BeautifulSoup(htmltext,"lxml")
ut_container = soup.find("div", {"class": "ut-container"})
time = ut_container.find("none").text.strip()
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "a+",encoding="utf-8") as f:
f.write(time)