Question

因此，我要使用BS4抓取相同的网页，因为数据存储在表中，这是一个非常简单的过程。标识表并使用df1 = pd.read_html(str(table))

进行读取

问题在于表是相似的，但并不总是相同的，也就是说，列数并不总是相同的。例如。第1页上的表具有以下列：Id, Name, DOB, College, Years_experience, Nationality，而第2页上的同一表具有适用于College的相同列。就是这样：

Id, Name, DOB, College, Years_experience, Nationality

vs

Id, Name, DOB, Years_experience, Nationality

正如我想将数据存储到单个CSV一样，我的问题是如何定义所有列，如果表缺少某些列，则它将空值填充为CSV。

类似这样：检查列名，如果找不到，则在所有行中填充空值。

有没有简单的解决方案，或者我需要创建字典并手动执行所有操作？

顺便说一句；如果有通用的更好的解决方案来解决此问题，则不必使用Pandas，只是我已经习惯了，因为它非常容易阅读HTML表格

我正在做类似的事情：

所以我正在做类似的事情：

for urlx in urls:


url = str(urlx)
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.content,'lxml')



table = soup.find('table', id='abc')

df1 = pd.read_html(str(table))


df1[0]['URL'] = urlx


df1[0].to_csv('_out.csv', encoding='utf=8',float_format="%.3f", index=False, header=None , mode='a')

谢谢

编辑：添加了更多信息

Answer 1

您可以这样做：

df = pd.DataFrame()
for urlx in urls:
    res = requests.get(url,headers=headers)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find('table', id='abc')

    df1 = pd.read_html(str(table))[0]
    df1['URL'] = urlx

    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('file.csv', index=False)

或者跳过整个beautifulsoup，直接与熊猫同行：

df = pd.DataFrame()
for urlx in urls:
    df1 = pd.read_html(urlx, attrs={'id':'abc'})[0]
    dfs['URL'] = urlx
    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('file.csv', index=False)

这里是您的代码：

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0)'}

#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']

df = pd.DataFrame() #< Initialize empty dataframe before loop
for year_url in years_url:
    res = requests.get(year_url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    table = soup.find('table', id='per_game')

    df1 = pd.read_html(str(table))[0]
    df1['player'] = year_url  #<---- HERE WAS YOUR ERROR
    df1['Name'] = soup.find('h1',{'itemprop':'name'}).text #<-- I added this

    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('123.csv', index=False) #<--- Took this out of the for loop

OR

import pandas as pd

#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']

df = pd.DataFrame() #< Took this out of the for loop
for year_url in years_url:
    df1 = pd.read_html(year_url, attrs={'id':'per_game'})[0]
    df1['player'] = year_url
    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('123.csv', index=False)

Answer 2

这是您要寻找的吗？：

import scala.collection.immutable.HashMap

def twoSum(nums: Array[Int], target: Int): Array[Int] = {
  nums.indices.iterator
    .scanLeft(HashMap.empty[Int, Int]) { (map, i) =>
      map + (nums(i) -> i)
    }
    .zipWithIndex
    .collectFirst(Function.unlift { case (indexByValue, i) =>
      indexByValue.get(target - nums(i)).filter(_ != i).map(Array(i, _))
    })
    .getOrElse(Array(-1, -1))
}

大熊猫-追加到CSV，列数不一致

2 个答案: