我下面的代码抓取一个网站,并将数据框导出到excel文件。但是,我需要从第一列中删除不必要的字符并将其组合起来,这样就无需在excel文件中重命名月份。每行都有来自HOZ18(2018年12月)'HOZ19(2019年12月)网站的名称,除了“ \”外,我也没有兴趣。因此,我只希望第一列中的12月18日,1月19日,2月20日等。
from urllib.request import urlopen
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=ho&x=13&y=8&domain=if&display_ice=1&enabled_ice_exchanges=&tz=0&ed=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
Contracts=[]
LastPrice=[]
data_rows = soup.findAll('tr')[2:]
for td in data_rows:
Contract = td.findAll('td')[0].text
Contracts.append(Contract)
LstPrice = td.findAll('td')[7].text
LastPrice.append(LstPrice)
df = pd.DataFrame({'Contracts': Contracts, 'Previous Settled': LastPrice})
0 Cash (HOY00) 2.1884
1 Dec \'18 (HOZ18) 2.2371
2 Jan \'19 (HOF19) 2.2238
3 Feb \'19 (HOG19) 2.2125
答案 0 :(得分:0)
如果您希望将类似Dec \'18 (HOZ18)
的字符串转换为December 18
,这是一种解决方案。
1)定义一个清理字符串的函数:
# define a dictionary to convert short month names to full ones
month_mapper = {
'Jan': 'January',
'Feb': 'February',
'Mar': 'March',
'Apr': 'April',
'May': 'May',
'Jun': 'June',
'Jul': 'July',
'Aug': 'August',
'Sep': 'September',
'Oct': 'October',
'Nov': 'November',
'Dec': 'December',
}
def clean_month_string(s):
# replace the '\' char with empty string
s = s.replace('\\', '')
# split into three pieces on space
# eg, "Dec '18 (HOZ18)" ->
# month = "Dec"
# year = "'18"
# code = "(HOZ18)"
month, year, code = s.split(' ')
# convert month using month mapper
month = month_mapper[month]
# remove the ' at the start of the year
year = year.replace("'", "")
# return new month and new year (dropping code)
return ' '.join([month, year])
2)使用apply
将该函数应用于DataFrame中的每一行。
# drop that first row, which is not properly formatted
df = df.drop(0).reset_index(drop=True)
# apply the function to your 'Contracts' series.
df['Contracts'] = df['Contracts'].apply(clean_month_string)
答案 1 :(得分:0)
这里是不需要.apply()
的选项。假设我们正在处理21世纪的岁月,不确定是否对您有用。并且还会将月份存储为数字,这很有用,如果没有,您可以删除该位。
import pandas as pd
import re
import datetime
# Data setup.
data = pd.DataFrame(['Dec \'18 (HOZ18)', 'Jan \'19 (HOF19)', 'Feb \'19 (HOG19)'], columns = ['string'])
# Extract the month number using regex, then map it to a month number.
data['month_number'] = [datetime.datetime.strptime(re.sub('\s\'.*', '', i), '%b').month for i in data['string']]
# Extract the year, prepend '20' and store as an integer.
data['year'] = [int('20' + re.search('\d\d', i).group(0)) for i in data['string']]
print(data)
给予:
string month_number year
0 Dec '18 (HOZ18) 12 2018
1 Jan '19 (HOF19) 1 2019
2 Feb '19 (HOG19) 2 2019