从网站抓取表格:无法解决正确的表格

时间:2019-12-09 15:46:41

标签: python html web-scraping html-table

我是Python的新手,刚开始学习它,并且遇到以下问题:我想从网站上抓取投资组合数据(https://www.wikifolio.com/de/de/w/wffalkinve;向下滚动,单击“投资组合”);但我无法处理正确的tr类“ c-portfolio”,但始终以右边的“ Erstemission 20.09.2019”等上的第一个表的值结尾。

我已经尝试了15种以上的方法,并提供了有关reddit / stackoverflow的网络教程和问题/答案,但无法解决,我想它在本网站上非常特别。下面是我最高级的代码。

如果有任何建议,我将非常感谢! :)

最好, 朱利安

import requests, six
import lxml.html as lh
from itertools import cycle, islice
from matplotlib import colors
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

url='https://www.wikifolio.com/de/de/w/wffalkinve'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of the site's HTML code
tr_elements = doc.xpath('//tbody')

#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

给出[5,4]的输出


#Create empty list
col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print('%d: %s' % (i,name))
    col.append((name,[]))

其他尝试:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup

url ='https://www.wikifolio.com/de/de/w/wffalkinve'
html = urlopen(url)

soup = BeautifulSoup(html, 'lxml')
type(soup)



soup.find_all('tr')

# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])

for row in rows:
    row_td = row.find_all('td')
print(row_td)
type(row_td)


str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)

import re

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

df = pd.DataFrame(list_rows)
df.head(10)

df1 = df[0].str.split(',', expand=True)
df1.head(10)

我以为这是解决方案,但它根本没有解决问题

from bs4 import BeautifulSoup
import requests
a = requests.get("https://www.wikifolio.com/de/de/w/wffalkinve")
soup = BeautifulSoup(a.text, 'lxml')
# searching for the rows directly
rows = soup.find_all('tr', {'class': 'c-portfolio'})
print(rows[:100])

编辑:为了更轻松地找到相应的表:tr类c-portfolio

https://ibb.co/nckVC1h

0 个答案:

没有答案