我第一次尝试使用Python抓取工具,所以我可以从各处修补代码。
现在我遇到了两个我不知道如何解决的问题:
我的tbl
列表仅向test.csv
输出到第一个单元格,并且即使我已在.writer()
CSV文件的输出存在一些编码问题,即使我在Python shell上输出时也看不到任何问题。
我目前正在使用Python 2.7
import urllib2
from bs4 import BeautifulSoup
import csv
import pandas as pd
site= "https://www.investing.com/currencies/usd-sgd-forward-rates"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
px_table = str(soup.find('table', attrs={'id':'curr_table'}))
print type(px_table)
tbl = pd.read_html(px_table, encoding='utf-8')
with open('test.csv', 'w') as myFile:
wr = csv.writer(myFile, delimiter=' ')
wr.writerow(tbl)
输出:
Unnamed: 0 Name Bid Ask High Low Chg. Time
0 NaN USDSGDÂ ONÂ FWD -0.85 0.15 -0.29 -1.19 0.75 9:40:00
1 NaN USDSGDÂ TNÂ FWD -0.50 -0.45 -0.35 -0.45 -0.08 9:43:00
2 NaN USDSGDÂ SNÂ FWD -0.30 -0.20 -0.29 -0.21 0.10 9:42:00
3 NaN USDSGDÂ SWÂ FWD -2.17 -1.69 -1.80 -1.80 -0.16 9:42:00
4 NaN USDSGDÂ 2WÂ FWD -5.32 -1.72 -3.58 -3.44 -1.22 9:43:00
5 NaN USDSGDÂ 3WÂ FWD -6.15 -4.35 -5.12 -5.17 -0.30 9:42:00
6 NaN USDSGDÂ 1MÂ FWD -8.53 -7.74 -8.00 -8.10 -0.17 9:42:00
7 NaN USDSGDÂ 2MÂ FWD -15.81 -14.81 -14.75 -15.15 -0.25 9:43:00
8 NaN USDSGDÂ 3MÂ FWD -25.00 -24.07 -23.53 -24.07 -0.40 9:42:00
9 NaN USDSGDÂ 4MÂ FWD -35.72 -27.72 -32.16 -32.37 -1.18 9:43:00
10 NaN USDSGDÂ 5MÂ FWD -46.53 -35.47 -40.00 -40.96 -2.41 9:42:00
11 NaN USDSGDÂ 6MÂ FWD -50.83 -48.67 -48.75 -50.00 0.94 9:42:00
12 NaN USDSGDÂ 7MÂ FWD -65.77 -53.06 -59.68 -58.69 -3.27 9:43:00
13 NaN USDSGDÂ 8MÂ FWD -79.41 -59.65 -66.98 -69.70 -6.61 9:42:00
14 NaN USDSGDÂ 9MÂ FWD -84.51 -73.85 -74.05 -79.19 -1.84 9:42:00
15 NaN USDSGDÂ 10MÂ FWD -102.16 -75.06 -85.01 -87.28 -9.66 9:43:00
16 NaN USDSGDÂ 11MÂ FWD -109.81 -84.92 -96.50 -96.31 -7.91 9:43:00
17 NaN USDSGDÂ 1YÂ FWD -107.88 -103.13 -104.47 -107.00 2.63 9:43:00
18 NaN USDSGDÂ 15MÂ FWD -140.08 -106.19 -132.00 -121.00 6.92 9:40:00
19 NaN USDSGDÂ 21MÂ FWD -200.00 -151.00 -185.50 -180.50 14.00 9:40:00
20 NaN USDSGDÂ 2YÂ FWD -196.50 -121.50 -162.40 -197.50 50.50 9:40:00
21 NaN USDSGDÂ 3YÂ FWD -355.00 -306.00 -347.00 -330.00 20.00 9:43:00
22 NaN USDSGDÂ 4YÂ FWD 145.00 211.00 0.00 0.00 1.00 31/07
23 NaN USDSGDÂ 5YÂ FWD 117.00 187.00 0.00 0.00 -4.00 31/07
24 NaN USDSGDÂ 7YÂ FWD 63.00 189.00 0.00 0.00 -1.00 31/07
25 NaN USDSGDÂ 10YÂ FWD -30.00 127.00 0.00 0.00 10.00 31/07
答案 0 :(得分:1)
您应该使用Pandas to_csv()
函数来编写表格。您还可以为文件指定文件编码,例如utf-8
:
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
site = "https://www.investing.com/currencies/usd-sgd-forward-rates"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, "lxml")
px_table = str(soup.find('table', attrs={'id':'curr_table'}))
df_table = pd.read_html(px_table, encoding='utf-8')[0]
del df_table['Unnamed: 0']
df_table.to_csv('test.csv', encoding='utf-8', index=False)
这会给你一个test.csv
开头像:
Name,Bid,Ask,High,Low,Chg.,Time
USDSGD ON FWD,-1.35,0.65,-0.29,-1.19,0.25,12:10:00
USDSGD TN FWD,-0.54,-0.46,-0.35,-0.49,-0.12,11:11:00
USDSGD SN FWD,-0.43,-0.14,-0.29,-0.25,-0.03,12:11:00
USDSGD SW FWD,-1.99,-1.51,-1.8,-1.8,0.02,12:10:00
USDSGD 2W FWD,-5.63,-1.53,-3.58,-3.44,-1.53,12:11:00
此代码还会删除不需要的Unnamed: 0
列,并禁用将索引列写入CSV文件。
或者,您可以删除对BeautifulSoup的需求,因为read_html()
将返回它能够找到的所有表的数据框列表:
import urllib2
import pandas as pd
site = "https://www.investing.com/currencies/usd-sgd-forward-rates"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
page = urllib2.urlopen(req)
df_table = pd.read_html(page.read(), encoding='utf-8')[1]
df_table.drop(df_table.columns[[0]], axis=1, inplace=True)
df_table['Name'] = df_table['Name'].str.encode('ascii', errors='ignore')
df_table.to_csv('test.csv', encoding='ascii', index=False)
此方法还强制将Name
列转换为ASCII。