大熊猫-追加到CSV,列数不一致

时间:2020-05-06 13:23:15

标签: python pandas web-scraping beautifulsoup

因此,我要使用BS4抓取相同的网页,因为数据存储在表中,这是一个非常简单的过程。标识表并使用df1 = pd.read_html(str(table))

进行读取

问题在于表是相似的,但并不总是相同的,也就是说,列数并不总是相同的。 例如。第1页上的表具有以下列:Id, Name, DOB, College, Years_experience, Nationality,而第2页上的同一表具有适用于College的相同列。 就是这样:

Id, Name, DOB, College, Years_experience, Nationality

vs

Id, Name, DOB, Years_experience, Nationality

正如我想将数据存储到单个CSV一样,我的问题是如何定义所有列,如果表缺少某些列,则它将空值填充为CSV。

类似这样:检查列名,如果找不到,则在所有行中填充空值。

有没有简单的解决方案,或者我需要创建字典并手动执行所有操作?

顺便说一句;如果有通用的更好的解决方案来解决此问题,则不必使用Pandas,只是我已经习惯了,因为它非常容易阅读HTML表格

我正在做类似的事情:

所以我正在做类似的事情:

for urlx in urls:


url = str(urlx)
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.content,'lxml')



table = soup.find('table', id='abc')

df1 = pd.read_html(str(table))


df1[0]['URL'] = urlx


df1[0].to_csv('_out.csv', encoding='utf=8',float_format="%.3f", index=False, header=None , mode='a')

谢谢

编辑:添加了更多信息

2 个答案:

答案 0 :(得分:1)

您可以这样做:

df = pd.DataFrame()
for urlx in urls:
    res = requests.get(url,headers=headers)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find('table', id='abc')

    df1 = pd.read_html(str(table))[0]
    df1['URL'] = urlx

    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('file.csv', index=False)

或者跳过整个beautifulsoup,直接与熊猫同行:

df = pd.DataFrame()
for urlx in urls:
    df1 = pd.read_html(urlx, attrs={'id':'abc'})[0]
    dfs['URL'] = urlx
    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('file.csv', index=False)

这里是您的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0)'}

#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']

df = pd.DataFrame() #< Initialize empty dataframe before loop
for year_url in years_url:
    res = requests.get(year_url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    table = soup.find('table', id='per_game')

    df1 = pd.read_html(str(table))[0]
    df1['player'] = year_url  #<---- HERE WAS YOUR ERROR
    df1['Name'] = soup.find('h1',{'itemprop':'name'}).text #<-- I added this

    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('123.csv', index=False) #<--- Took this out of the for loop

OR

import pandas as pd

#years_url = ['https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fplayers%2Fi%2Fiversal01.html&div=div_playoffs_per_game']
years_url = ['https://www.basketball-reference.com/players/o/odomla01.html','https://www.basketball-reference.com/players/r/russebi01.html']

df = pd.DataFrame() #< Took this out of the for loop
for year_url in years_url:
    df1 = pd.read_html(year_url, attrs={'id':'per_game'})[0]
    df1['player'] = year_url
    df = df.append(df1, sort=False).reset_index(drop=True)

df.to_csv('123.csv', index=False) 

答案 1 :(得分:0)

这是您要寻找的吗? :

import scala.collection.immutable.HashMap

def twoSum(nums: Array[Int], target: Int): Array[Int] = {
  nums.indices.iterator
    .scanLeft(HashMap.empty[Int, Int]) { (map, i) =>
      map + (nums(i) -> i)
    }
    .zipWithIndex
    .collectFirst(Function.unlift { case (indexByValue, i) =>
      indexByValue.get(target - nums(i)).filter(_ != i).map(Array(i, _))
    })
    .getOrElse(Array(-1, -1))
}