刮擦Worldometers主页以提取COVID-19表数据,但值不会错误地提取(Python)

时间:2020-04-20 14:48:27

标签: python html pandas web-scraping

我正在抓取worldometers主页以使用Python提取表中的数据,但由于值输入不正确,我感到很苦恼。 (字符串是...(国家/地区:美国,西班牙,意大利...)。

import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate

url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

print(colLen)


#Since out first row is the header, data is stored on the second row onwards

for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]

    if len(T)!=len(tr_elements[0]): break

    #i is the index of our column
    i=0
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()

#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])

#Print Country Col (this is correct)
print(col[0][0:])

我似乎无法弄清楚问题是什么。请帮助解决问题。我也乐意建议您以其他方式进行此操作:)

Data Table on Webpage

Command Prompt output for Country ( Correct)

Command Prompt output for Total Cases ( incorrect)

0 个答案:

没有答案