我很难使用beautifulsoup将足球运动员的细节刮到可行的熊猫桌上。
问题是我抓的一些数据是"额外的"并用废话填满我的桌子。例如:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}
page = requests.get('https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985', headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
playerdata = soup.find_all(class_='posrela')
names = [';'.join(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df = pd.DataFrame([sub.split(";") for sub in names])
print(df.replace('^$', np.nan, regex=True))
结果:
python testing5.py
0 1 2 3
0 David de Gea D. de Gea Keeper None
1 Sergio Romero S. Romero Keeper None
2 Joel Pereira J. Pereira Keeper None
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back None
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back None
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back None
9 Luke Shaw Luke Shaw Left-Back None
10 Matteo Darmian M. Darmian Right-Back None
11 Antonio Valencia A. Valencia Right-Back None
12 Nemanja Matic N. Matic Defensive Midfield None
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield None
15 Ander Herrera A. Herrera Central Midfield None
16 Marouane Fellaini M. Fellaini Central Midfield None
17 Ashley Young A. Young Left Midfield None
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield None
19 Juan Mata Juan Mata Attacking Midfield None
20 Jesse Lingard J. Lingard Left Wing None
21 Romelu Lukaku R. Lukaku Centre-Forward None
22 Anthony Martial A. Martial . Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward None
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
正如您所看到的,在我删除空数据的地方,它已将数据推送到错误的单元格中。您可能会问为什么我有第4列,我将在那里插入更多数据但是现在我需要清理第3列。
正如您所看到的,我已经尝试使用正则表达式在第一个实例中用NaN替换空格。但无论我尝试什么,我都无法选择"选择"空单元格。我无法接触他们!
当我尝试和对待'名字'就像一个列表一样,解释器告诉我这不是一个列表而是一个结果集!
想知道是否有人可以提供帮助,作为一个编程菜鸟我已经取得了很大的进步,但已经碰壁了。
答案 0 :(得分:2)
您可以使用后期处理 - 使用NaN
和notnull
从第3列到第2列替换非loc
:
df.loc[df[3].notnull(), 2] = df[3]
#remove column 3
df = df.drop(3, axis=1)
另一个解决方案是使用mask
:
df[2] = df[2].mask(df[3].notnull(), df[3])
df = df.drop(3, axis=1)
或与numpy.where
非常相似:
df[2] = np.where(df[3].notnull(), df[3], df[2])
df = df.drop(3, axis=1)
我尝试了一点改进你的解决方案:
playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)
0 1 2
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
另一种解决方案:
playerdata = soup.find_all(class_='posrela')
names = []
for pt in playerdata:
L = list(pt.findAll(text=True))
#check length of list
if len(L) == 4:
#assign 4. value to 3.
L[2] = L[3]
#appenf first 3 values in list
names.append(L[:3])
df = pd.DataFrame(names)
print (df)
0 1 2
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
答案 1 :(得分:1)
如果您要提取更多数据,我建议您按照容易适合数据框的顺序提取所有数据。除非以正确的格式提取数据,否则您将不得不继续运行不必要的清理操作
playerdata = soup.find_all(class_='inline-table')
names = [[x.find('img')['title'],
x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]
df = pd.DataFrame(names,columns=['Name','Short','Position'])
Name Short Position
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
25 Romelu Lukaku Romelu Lukaku Centre-Forward
26 Paul Pogba Paul Pogba Central Midfield
27 Anthony Martial Anthony Martial Centre-Forward
28 Marcus Rashford Marcus Rashford Centre-Forward
29 Eric Bailly Eric Bailly Centre-Back