我正在尝试将数据导出为.txt文件
from bs4 import BeautifulSoup
import requests
import os
import os
os.getcwd()
'/home/folder'
os.mkdir("Probeersel6")
os.chdir("Probeersel6")
os.getcwd()
'/home/Desktop/folder'
os.mkdir("img") #now `folder`
url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("article", {"class": "article"})
with open(""%s".txt", "wb" %(url)) as file:
for item in data:
print item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].text
print item.contents[0].find_all("a", {"class": "link-grey"})[0].text
print "\n"
print item.contents[0].find_all("img", {"class": "media-full"})[0]
print "\n"
print item.contents[1].find_all("div", {"class": "article_textwrap"})[0].text
file.write()
应该放在:
file.write()
上班?
我也试图将.txt文件的名称与url相同,我应该用字符串吗?
with open(""%s".txt", "wb" %(url)) as file:
url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
答案 0 :(得分:2)
您应该将内容放入file.write
内。我可能会做类似的事情:
#!/usr/bin/python3
#
from bs4 import BeautifulSoup
import requests
url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
file_name=url.rsplit('/',1)[1].rsplit('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = soup.find_all('article', {'class': 'article'})
content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
item.contents[0].find_all('img', {'class': 'media-full'})[0],
item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
) for item in data)
with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
file.write(content)
答案 1 :(得分:0)
我正在开展一个webscraping项目,这个问题给了我很多问题。我尝试了几乎每个解决方案,处理Python编码(使用string.encode()转换为UTF,转换为ASCII,使用' unicodedata'模块进行转换,使用.decode ()然后.encode(),献给Tim Peters的血祭等等)。
所有解决方案都没有一直,这让我觉得非常不像Pythonic。
所以我最终使用的是以下内容:
html = bs.prettify() #bs is your BeautifulSoup object
with open("out.txt","w") as out:
for i in range(0, len(html)):
try:
out.write(html[i])
except Exception:
1+1
它并不完美,但它给了我最好的结果。当我在浏览器中打开它时,它几乎每次都能正确地解析页面。