我正在跟踪马士基的货船,并希望实现流程的自动化。到目前为止,我可以获取数据,但是清洁部分却使我丧命。
我使用BS4。
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
header = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
#gets the data
def get_data(x):
soup = BeautifulSoup(requests.get(url, headers={"User-Agent":header}).text, 'lxml')
data = soup.find_all("td")
list_of_prices = [x.text for x in data]
return list_of_prices
#convert to a dictionary that can easily be converted to a pandas dataframe
def Convert(a):
pts = get_data(a)
it = iter(pts)
res_dct = dict(zip(it, it))
return res_dct
# makes it a dataframe with the required columns
def make_df():
todf = Convert(get_data(url))
df = pd.DataFrame((todf), index=[0])
keep_flag = df[['Flag']]
keep_ETA = df[['ETA']]
keep_speed = df[['Course / Speed']]
keep_report = df[['Last report ']]
new_df = pd.concat([keep_flag, keep_ETA, keep_speed, keep_report], axis = 1).T
#date = pd.Timestamp.today()
return new_df
# how I print
urls = {
"EMMA MAERSK": "https://www.vesselfinder.com/vessels/EMMA-MAERSK-IMO-9321483-MMSI-220417000",
"MANILA MAERSK": "https://www.vesselfinder.com/vessels/MANILA-MAERSK-IMO-9780469-MMSI-219038000"
}
for ele, url in urls.items():
print(ele, make_df())
输出是这样的:
EMMA MAERSK 0
Flag Denmark
ETA Nov 24, 00:01
Course / Speed 232.0° / 11.7 kn
Last report Nov 22, 2019 08:10 UTC
MANILA MAERSK 0
Flag Denmark
ETA Nov 23, 11:30
Course / Speed 182.4° / 13.4 kn
Last report Nov 22, 2019 08:31 UTC
一种不错的格式,但是我很好奇如何将其制作成数据框。
我尝试过:
new_df = []
for ele, url in urls.items():
data = ele, make_df()
ddf = new_df.append(data)
appended_data = pd.DataFrame(new_df)
appended_data.to_excel('appended.xlsx')
但是它并没有给我希望的输出。
我希望这两列并排放置,而不是彼此靠下。因此,艾玛·马士基(Emma Maersk)和马尼拉·马士基(Manila Maersk)并肩作战。
谢谢!
答案 0 :(得分:1)
使用您自己的功能:
dictionary_list = []
for ele, url in urls.items():
values_dict = Convert(get_data(url))
values_dict["Name"] = ele
dictionary_list.append(values_dict)
从dictionary_list
创建字典:
pd.DataFrame(dictionary_list)[["Name", "Flag", "ETA", "Course / Speed", "Last report "]]
返回:
Name Flag ETA Course / Speed Last report
0 EMMA MAERSK Denmark Nov 24, 00:01 240.5° / 11.9 kn Nov 22, 2019 08:59 UTC
1 MANILA MAERSK Denmark Nov 23, 11:30 179.6° / 14.1 kn Nov 22, 2019 09:01 UTC
然后,您可以使用rename
重命名列名。
答案 1 :(得分:1)
您只需将所有数据添加到一个地方,然后转换为数据框
configurations.all {
resolutionStrategy {
// force certain versions of dependencies (including transitive)
// *append new forced modules:
force 'com.squareup.okhttp3:okhttp:3.12.3'
}
}
输出:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
header = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
#gets the data
def get_data(x):
soup = BeautifulSoup(requests.get(url, headers={"User-Agent":header}).text, 'lxml')
data = soup.find_all("td")
list_of_prices = [x.text for x in data]
return list_of_prices
#convert to a dictionary that can easily be converted to a pandas dataframe
def Convert(a):
pts = get_data(a)
it = iter(pts)
res_dct = dict(zip(it, it))
data.append({'flag' : res_dct.get('Flag',''),
'ETA' : res_dct.get('ETA',''),
'Course / Speed' : res_dct.get('Course / Speed',''),
'Last report' : res_dct.get('Last report ','')})
# how I print
urls = {
"EMMA MAERSK": "https://www.vesselfinder.com/vessels/EMMA-MAERSK-IMO-9321483-MMSI-220417000",
"MANILA MAERSK": "https://www.vesselfinder.com/vessels/MANILA-MAERSK-IMO-9780469-MMSI-219038000"
}
data = []
for ele, url in urls.items():
Convert(get_data(url))
df = pd.DataFrame(data)