因此,我要使用BS4抓取相同的网页,因为数据存储在表中,这是一个非常简单的过程。标识表并使用df1 = pd.read_html(str(table))
问题在于表是相似的,但并不总是相同的,也就是说,列数并不总是相同的。
例如。第1页上的表具有以下列:Id, Name, DOB, College, Years_experience, Nationality
,而第2页上的同一表具有适用于College的相同列。
就是这样:
Id, Name, DOB, College, Years_experience, Nationality
vs
Id, Name, DOB, Years_experience, Nationality
正如我想将数据存储到单个CSV一样,我的问题是如何定义所有列,如果表缺少某些列,则它将空值填充为CSV。
类似这样:检查列名,如果找不到,则在所有行中填充空值。
有没有简单的解决方案,或者我需要创建字典并手动执行所有操作?
顺便说一句;如果有通用的更好的解决方案来解决此问题,则不必使用Pandas,只是我已经习惯了,因为它非常容易阅读HTML表格
我正在做类似的事情:
所以我正在做类似的事情:
for urlx in urls:
url = str(urlx)
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find('table', id='abc')
df1 = pd.read_html(str(table))
df1[0]['URL'] = urlx
df1[0].to_csv('_out.csv', encoding='utf=8',float_format="%.3f", index=False, header=None , mode='a')
谢谢
编辑:添加了更多信息
答案 0 :(得分:1)
您可以这样做:
df = pd.DataFrame()
for urlx in urls:
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find('table', id='abc')
df1 = pd.read_html(str(table))[0]
df1['URL'] = urlx
df = df.append(df1, sort=False).reset_index(drop=True)
df.to_csv('file.csv', index=False)
或者跳过整个beautifulsoup,直接与熊猫同行:
df = pd.DataFrame()
for urlx in urls:
df1 = pd.read_html(urlx, attrs={'id':'abc'})[0]
dfs['URL'] = urlx
df = df.append(df1, sort=False).reset_index(drop=True)
df.to_csv('file.csv', index=False)
这里是您的代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0)'}
#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']
df = pd.DataFrame() #< Initialize empty dataframe before loop
for year_url in years_url:
res = requests.get(year_url,headers=headers)
soup = BeautifulSoup(res.content,'html.parser')
table = soup.find('table', id='per_game')
df1 = pd.read_html(str(table))[0]
df1['player'] = year_url #<---- HERE WAS YOUR ERROR
df1['Name'] = soup.find('h1',{'itemprop':'name'}).text #<-- I added this
df = df.append(df1, sort=False).reset_index(drop=True)
df.to_csv('123.csv', index=False) #<--- Took this out of the for loop
OR
import pandas as pd
#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']
df = pd.DataFrame() #< Took this out of the for loop
for year_url in years_url:
df1 = pd.read_html(year_url, attrs={'id':'per_game'})[0]
df1['player'] = year_url
df = df.append(df1, sort=False).reset_index(drop=True)
df.to_csv('123.csv', index=False)
答案 1 :(得分:0)
这是您要寻找的吗? :
import scala.collection.immutable.HashMap
def twoSum(nums: Array[Int], target: Int): Array[Int] = {
nums.indices.iterator
.scanLeft(HashMap.empty[Int, Int]) { (map, i) =>
map + (nums(i) -> i)
}
.zipWithIndex
.collectFirst(Function.unlift { case (indexByValue, i) =>
indexByValue.get(target - nums(i)).filter(_ != i).map(Array(i, _))
})
.getOrElse(Array(-1, -1))
}