如何将Web抓取表导出到多行的csv中?

时间:2017-06-12 09:32:58

标签: python web-scraping beautifulsoup export-to-csv

我在Python 2.7.13上编写了这段代码,用于从网站上抓取数据表。

import urllib2
from bs4 import BeautifulSoup
import csv
import os

out=open("proba.csv","rb")
data=csv.reader(out)

def make_soup(url):
    thepage = urllib2.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

maindatatable=""
soup = make_soup("https://www.mnb.hu/arfolyamok")

for record in soup.findAll('tr'):
    datatable=""
    for data in record.findAll('td'):
        datatable=datatable+","+data.text
    maindatatable = maindatatable + "\n" + datatable[1:]

header = "Penznem,Devizanev,Egyseg,Penznemforintban"
print maindatatable

file = open(os.path.expanduser("proba.csv"),"wb")

utf16_str1 =header.encode('utf16')
utf16_str2 = maindatatable.encode('utf16')
file.write(utf16_str1)
file.write(utf16_str2)
file.close()

我想将其导出为CSV,接下来的4行:

“Penznem Devaizanev Egyseg Penznemforintban”

数据用“,”分隔,但最后两个值是一行。 (283,45)

我该如何解决?

1 个答案:

答案 0 :(得分:0)

你不能直接避免最后一次昏迷,但

你可以做的就是使用另一个seprator,即;(分号) 当你在exel中打开文件时,将select(;)分号作为seprator进行计算,你会得到预期的结果!



import urllib2
from bs4 import BeautifulSoup
import csv
import os

out=open("proba.csv","rb")
data=csv.reader(out)

def make_soup(url):
    thepage = urllib2.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

maindatatable=""
soup = make_soup("https://www.mnb.hu/arfolyamok")

for record in soup.findAll('tr'):
    datatable=""
    for data in record.findAll('td'):
        datatable=datatable+";"+data.text
    maindatatable = maindatatable + "\n" + datatable[1:]

header = "Penznem;Devizanev;Egyseg;Penznemforintban"
print maindatatable

file = open(os.path.expanduser("proba.csv"),"wb")

utf16_str1 =header.encode('utf16')
utf16_str2 = maindatatable.encode('utf16')
file.write(utf16_str1)
file.write(utf16_str2)
file.close()