如何使read_html循环?蟒蛇

时间:2020-04-29 13:44:05

标签: python pandas beautifulsoup jupyter-notebook

我目前有以下代码,df2 = df [0]将代码限制为仅在该对应日期收集1个游戏的数据。我试图弄清楚如何收集同一天发生的多个游戏数据。

想法是提取一天进行的所有游戏的比赛数据,并继续在整个页面上运行

例如 桌子[20] 返回html链接

1)href =“ / matches / 2338970 / north-vs-mad-lions-dreamhack-open-leipzig-2020

2)href =“ / matches / 2338968 / heroic-vs-mad-lions-dreamhack-open-leipzig-2020

我尝试了以下操作:

Series

但是它不会更新每个变量(团队选择,选择,映射),而是在其他匹配项中重复第一个匹配项数据(参见图片)enter image description here

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])

m = df['timestamp'].gt(df['date']) & df['value'].gt(0)

s = df[m].drop_duplicates('id').set_index('id')['timestamp']

df['new_date'] = df['id'].map(s)
print (df)
   id  timestamp       date  value   new_date
0   1 2001-01-01 2001-05-01      1 2001-10-02
1   1 2001-10-01 2001-05-01      0 2001-10-02
2   1 2001-10-02 2001-05-01      1 2001-10-02
3   1 2001-10-03 2001-05-01      0 2001-10-02
4   1 2001-10-04 2001-05-01      1 2001-10-02
5   2 2001-01-01 2001-05-01      1 2001-10-04
6   2 2001-10-01 2001-05-01      0 2001-10-04
7   2 2001-10-02 2001-05-01      0 2001-10-04
8   2 2001-10-03 2001-05-01      0 2001-10-04
9   2 2001-10-04 2001-05-01      1 2001-10-04

1 个答案:

答案 0 :(得分:0)

好。因此,您只需要遍历从表中拉出的子表。另外,我进行了另一项更改。您可以使用index = 0来完成这种操作,而不是设置enumerate()然后在每个循环后递增。看看是否可行:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import re


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
team_id = 8362
for i in range(0,1):
    i = i*100
    url = "https://www.hltv.org/results?offset={}&team={}".format(i,team_id)

    res = requests.get(url, headers=headers)
    soup = bs(res.content, 'html.parser')

    tables = soup.find_all("div", {"class": "results-sublist"})



    list_dfs = []
    for index, table in enumerate(tables):
        print ('Page %s:\t%s of %s' %(i+1, index+1, len(tables)))
        dfs = pd.read_html(str(table)) #<--- returns all the tables into a list called dfs
        for tableIdx, df2 in enumerate(dfs):  #<---- add additional loop here
            df = pd.DataFrame(columns=["match", "teamchoose", "chosen", "maps", "team", "opponent", "date"])
            link = table.find_all('a', href=True)[tableIdx]   #<--- Also need to grab correct link for associated table/match if there are more than 1 match
            link = "https://www.hltv.org/" + link.get('href')
            res = requests.get(link, headers=headers)
            soup = bs(res.content, 'lxml')
            temp = soup.find_all("div", {"class": "padding"})
            date = pd.to_datetime(int(soup.select(".timeAndEvent div")[0]['data-unix'])*1000000)

            out = re.findall(r'<div>\d\.(.*?)</div>', str(temp))

            dict_choices = {"teamchoose": [], "chosen": [], "maps": []}
            for choice in out[0:6]:
                split = choice.strip(" ").split(" ")
                dict_choices["teamchoose"].append(" ".join(split[:-2]))
                dict_choices["chosen"].append(split[-2])
                dict_choices["maps"].append(split[-1])
                    # df = df.append(dict_choices, True)
                    # dict_choices = {"turn": [], "choice": [], "maps": []}
            try:
                left = out[6]
                split = left.strip(" ").split(" ")
                dict_choices["teamchoose"].append(split[2])
                dict_choices["chosen"].append(split[2])
                dict_choices["maps"].append(split[0])
            except:
                pass
            df = df.append(pd.DataFrame.from_dict(dict_choices, orient='index').transpose())

            df["opponent"] = df2[2].iloc[0]
            df["team"] = df2[0].iloc[0]
            df["match"] = index
            df['date'] = date
            list_dfs.append(df)



df_out = pd.concat(list_dfs)
df_out = df_out[['match','date','team','opponent','teamchoose','chosen','maps']]
df_out.to_csv("{}_vetoes.csv".format(team_name),index=False)
print(tabulate(df_out, headers='keys', tablefmt='psql'))