我在网上找到了这个代码,并想知道如何将收集的数据导出到csv文件。
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.body.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
答案 0 :(得分:1)
您所拥有的代码只是从给定的URL中提取所有文本。这会丢失任何结构,因此很难确定所需文本的开始和结束位置。
在您提供的页面上,您可以通过查看HTML源并确定5个故事都具有唯一的HTML ID来提取所有标题。您可以使用soup()
查找这些内容并从中提取文本。现在,您有每篇文章的标题和摘要,然后可以将其写入CSV文件。以下内容已经使用Python 3.5.2进行了测试:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("http://www.thestar.com.my/news/nation/")
soup = BeautifulSoup(html, "html.parser")
# IDs found by looking at the HTML source in a browser
ids = [
"slcontent3_3_ileft_0_hlFirstStory",
"slcontent3_3_ileft_0_hlSecondStory",
"slcontent3_3_ileft_0_lvStoriesRight_ctrl0_hlStoryRight",
"slcontent3_3_ileft_0_lvStoriesRight_ctrl1_hlStoryRight",
"slcontent3_3_ileft_0_lvStoriesRight_ctrl2_hlStoryRight"]
with open("news.csv", "w", newline="", encoding='utf-8') as f_news:
csv_news = csv.writer(f_news)
csv_news.writerow(["Headline", "Summary"])
for id in ids:
headline = soup.find("a", id=id)
summary = headline.find_next("p")
csv_news.writerow([headline.text, summary.text])
这将为您提供如下CSV文件:
Headline,Summary
Many say convicted serial rapist Selva still considered âa person of high riskâ,PETALING JAYA: Convicted serial rapist Selva Kumar Subbiah will be back in the country from Canada in three days and a policeman who knows him says there is no guarantee that he will not strike again.
Liow: Way too many road accidents,"PETALING JAYA: Road accidents took the lives of 7,152 and incurred a loss of about RM9.2bil in Malaysia last year, says Datuk Seri Liow Tiong Lai."
Ex-civil servant wins RM27.4mil jackpot,PETALING JAYA: It was the ang pow of his life.
"Despite latest regulation, many still puff away openly at parks and R&R;","KUALA LUMPUR: It was another cloudy afternoon when office workers hung out at the popular KLCC park, puffing away at the end of lunch hour, oblivious to the smoking ban there."
Police warn groups not to cause disturbances on Thaipusam,GEORGE TOWN: Police have warned supporters of the golden and silver chariots against provoÂking each other during the Thaipusam celebration next week.