将标题添加到我抓取的表中

时间:2019-06-06 12:29:28

标签: python web-scraping python-requests python-requests-html

我一直在关注在线教程,但是我不想使用标题附带的教程数据,而是要使用以下代码:

我的问题是我的表没有标题,因此它使用第一行作为标题。如何设置已定义的“乘车”和“队列时间”标题?

谢谢

import requests
import lxml.html as lh
import pandas as pd

url='http://www.ridetimes.co.uk/'

page = requests.get(url)

doc = lh.fromstring(page.content)

tr_elements = doc.xpath('//tr')

r_elements = doc.xpath('//tr')

col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print '%d:"%s"'%(i,name)
    col.append((name,[]))
    print(col)

3 个答案:

答案 0 :(得分:0)

如何尝试:

>>> pd.DataFrame(col,columns=["Ride","Queue Time"])
               Ride Queue Time
0  Spinball Whizzer         []
1            0 mins         []

如果我是正确的,那么这就是答案。

答案 1 :(得分:0)

使用熊猫获取表,然后只需分配列名:

import pandas as pd

url='http://www.ridetimes.co.uk/'
df = pd.read_html(url)[0]

df.columns = ['Ride', 'Queue Time']

输出:

print (df)
               Ride             Queue Time
0  Spinball Whizzer                 0 mins
1           Nemesis                 5 mins
2          Oblivion                 5 mins
3        Wicker Man                 5 mins
4        The Smiler                10 mins
5              Rita                20 mins
6          TH13TEEN                25 mins
7         Galactica  Currently Unavailable
8        Enterprise  Currently Unavailable

答案 2 :(得分:0)

考虑使用与页面相同的源来更新返回json的值。您在网址中添加了一个随机数,以防止提供缓存的结果。这样不仅可以thrill进行所有组类型的操作。

import requests
import random 
import pandas as pd

i = random.randint(1,1000000000000000000)
r = requests.get('http://ridetimes.co.uk/queue-times-new.php?r=' + str(i)).json() #to prevent cached results being served
df = pd.DataFrame([(item['ride'], item['time']) for item in r], columns = ['Ride', ' Queue Time'])
print(df)

如果您只希望thrill组,请修改此行:

df = pd.DataFrame([(item['ride'], item['time']) for item in r if item['group'] == 'Thrill'], columns = ['Ride', ' Queue Time'])