我正在抓取worldometers主页以使用Python提取表中的数据,但由于值输入不正确,我感到很苦恼。 (字符串是...(国家/地区:美国,西班牙,意大利...)。
import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate
url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
print(colLen)
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
if len(T)!=len(tr_elements[0]): break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])
#Print Country Col (this is correct)
print(col[0][0:])
我似乎无法弄清楚问题是什么。请帮助解决问题。我也乐意建议您以其他方式进行此操作:)