我是Python的新手,刚开始学习它,并且遇到以下问题:我想从网站上抓取投资组合数据(https://www.wikifolio.com/de/de/w/wffalkinve;向下滚动,单击“投资组合”);但我无法处理正确的tr类“ c-portfolio”,但始终以右边的“ Erstemission 20.09.2019”等上的第一个表的值结尾。
我已经尝试了15种以上的方法,并提供了有关reddit / stackoverflow的网络教程和问题/答案,但无法解决,我想它在本网站上非常特别。下面是我最高级的代码。
如果有任何建议,我将非常感谢! :)
最好, 朱利安
import requests, six
import lxml.html as lh
from itertools import cycle, islice
from matplotlib import colors
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
url='https://www.wikifolio.com/de/de/w/wffalkinve'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of the site's HTML code
tr_elements = doc.xpath('//tbody')
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]
给出[5,4]的输出
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print('%d: %s' % (i,name))
col.append((name,[]))
其他尝试:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='https://www.wikifolio.com/de/de/w/wffalkinve'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
soup.find_all('tr')
# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
我以为这是解决方案,但它根本没有解决问题
from bs4 import BeautifulSoup
import requests
a = requests.get("https://www.wikifolio.com/de/de/w/wffalkinve")
soup = BeautifulSoup(a.text, 'lxml')
# searching for the rows directly
rows = soup.find_all('tr', {'class': 'c-portfolio'})
print(rows[:100])
编辑:为了更轻松地找到相应的表:tr类c-portfolio