我目前有以下代码,df2 = df [0]将代码限制为仅在该对应日期收集1个游戏的数据。我试图弄清楚如何收集同一天发生的多个游戏数据。
想法是提取一天进行的所有游戏的比赛数据,并继续在整个页面上运行
例如 桌子[20] 返回html链接
1)href =“ / matches / 2338970 / north-vs-mad-lions-dreamhack-open-leipzig-2020
2)href =“ / matches / 2338968 / heroic-vs-mad-lions-dreamhack-open-leipzig-2020
我尝试了以下操作:
Series
但是它不会更新每个变量(团队选择,选择,映射),而是在其他匹配项中重复第一个匹配项数据(参见图片)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])
m = df['timestamp'].gt(df['date']) & df['value'].gt(0)
s = df[m].drop_duplicates('id').set_index('id')['timestamp']
df['new_date'] = df['id'].map(s)
print (df)
id timestamp date value new_date
0 1 2001-01-01 2001-05-01 1 2001-10-02
1 1 2001-10-01 2001-05-01 0 2001-10-02
2 1 2001-10-02 2001-05-01 1 2001-10-02
3 1 2001-10-03 2001-05-01 0 2001-10-02
4 1 2001-10-04 2001-05-01 1 2001-10-02
5 2 2001-01-01 2001-05-01 1 2001-10-04
6 2 2001-10-01 2001-05-01 0 2001-10-04
7 2 2001-10-02 2001-05-01 0 2001-10-04
8 2 2001-10-03 2001-05-01 0 2001-10-04
9 2 2001-10-04 2001-05-01 1 2001-10-04
答案 0 :(得分:0)
好。因此,您只需要遍历从表中拉出的子表。另外,我进行了另一项更改。您可以使用index = 0
来完成这种操作,而不是设置enumerate()
然后在每个循环后递增。看看是否可行:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
team_id = 8362
for i in range(0,1):
i = i*100
url = "https://www.hltv.org/results?offset={}&team={}".format(i,team_id)
res = requests.get(url, headers=headers)
soup = bs(res.content, 'html.parser')
tables = soup.find_all("div", {"class": "results-sublist"})
list_dfs = []
for index, table in enumerate(tables):
print ('Page %s:\t%s of %s' %(i+1, index+1, len(tables)))
dfs = pd.read_html(str(table)) #<--- returns all the tables into a list called dfs
for tableIdx, df2 in enumerate(dfs): #<---- add additional loop here
df = pd.DataFrame(columns=["match", "teamchoose", "chosen", "maps", "team", "opponent", "date"])
link = table.find_all('a', href=True)[tableIdx] #<--- Also need to grab correct link for associated table/match if there are more than 1 match
link = "https://www.hltv.org/" + link.get('href')
res = requests.get(link, headers=headers)
soup = bs(res.content, 'lxml')
temp = soup.find_all("div", {"class": "padding"})
date = pd.to_datetime(int(soup.select(".timeAndEvent div")[0]['data-unix'])*1000000)
out = re.findall(r'<div>\d\.(.*?)</div>', str(temp))
dict_choices = {"teamchoose": [], "chosen": [], "maps": []}
for choice in out[0:6]:
split = choice.strip(" ").split(" ")
dict_choices["teamchoose"].append(" ".join(split[:-2]))
dict_choices["chosen"].append(split[-2])
dict_choices["maps"].append(split[-1])
# df = df.append(dict_choices, True)
# dict_choices = {"turn": [], "choice": [], "maps": []}
try:
left = out[6]
split = left.strip(" ").split(" ")
dict_choices["teamchoose"].append(split[2])
dict_choices["chosen"].append(split[2])
dict_choices["maps"].append(split[0])
except:
pass
df = df.append(pd.DataFrame.from_dict(dict_choices, orient='index').transpose())
df["opponent"] = df2[2].iloc[0]
df["team"] = df2[0].iloc[0]
df["match"] = index
df['date'] = date
list_dfs.append(df)
df_out = pd.concat(list_dfs)
df_out = df_out[['match','date','team','opponent','teamchoose','chosen','maps']]
df_out.to_csv("{}_vetoes.csv".format(team_name),index=False)
print(tabulate(df_out, headers='keys', tablefmt='psql'))