我想使用BeautifulSoup从网站中提取特定信息,但尚未找到正确的方法。该网站具有以下信息:
B。 Hübnerwechseltfür3.711.638从Computer zu Marcel。
Ginczek wechseltfür2.845.000 von Computer zu Max。
Embolo wechselt for 6.640.000 von Computer zu Chrissi。
Jäkelwechselt for 220.000 von Thilo zu Computer。
RaphaëlGuerreiro wechselt for 3.640.000 von Malte zu Computer。
在源代码中如下所示:
<div class="article_content2">
<div class="article_content_text">
<a href="../../bundesligaspieler/32426-B.+H%C3%BCbner.html" onclick="return(openSmallWindow('../../bundesligaspieler/32426-B.+H%C3%BCbner.html','44f6'))" style="font-weight:normal;" target="_blank">
B. Hübner
</a>
wechselt für 3.711.638 von Computer zu
<a href="playerInfo.phtml?pid=13059320" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))" style="font-weight:normal;" target="_blank">
Marcel
</a>
.
<br/>
<a href="../../bundesligaspieler/31700-Ginczek.html" onclick="return(openSmallWindow('../../bundesligaspieler/31700-Ginczek.html','44f6'))" style="font-weight:normal;" target="_blank">
Ginczek
</a>
wechselt für 2.845.000 von Computer zu
<a href="playerInfo.phtml?pid=13059734" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059734','p_13059734'))" style="font-weight:normal;" target="_blank">
Max
</a>
.
<br/>
<a href="../../bundesligaspieler/32642-Embolo.html" onclick="return(openSmallWindow('../../bundesligaspieler/32642-Embolo.html','44f6'))" style="font-weight:normal;" target="_blank">
Embolo
</a>
wechselt für 6.640.000 von Computer zu
<a href="playerInfo.phtml?pid=13059329" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059329','p_13059329'))" style="font-weight:normal;" target="_blank">
Chrissi
</a>
.
<br/>
<br/>
<a href="../../bundesligaspieler/33109-J%C3%A4kel.html" onclick="return(openSmallWindow('../../bundesligaspieler/33109-J%C3%A4kel.html','44f6'))" style="font-weight:normal;" target="_blank">
Jäkel
</a>
wechselt für 220.000 von
<a href="playerInfo.phtml?pid=13059353" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059353','p_13059353'))" style="font-weight:normal;" target="_blank">
Thilo
</a>
zu Computer.
<br/>
<a href="../../bundesligaspieler/32632-Rapha%C3%ABl+Guerreiro.html" onclick="return(openSmallWindow('../../bundesligaspieler/32632-Rapha%C3%ABl+Guerreiro.html','44f6'))" style="font-weight:normal;" target="_blank">
Raphaël Guerreiro
</a>
wechselt für 3.640.000 von
<a href="playerInfo.phtml?pid=13059325" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059325','p_13059325'))" style="font-weight:normal;" target="_blank">
Malte
</a>
zu Computer.
<br/>
<br/>
</div>
</div>
到目前为止,我只设法提取了全部代码:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://classic.comunio.de/login.phtml?login=USER&pass=PASSWORD")
soup = BeautifulSoup(r.text, 'lxml')
player_all = soup.find_all('a', href=re.compile('bundesligaspieler'))
作为输出,我想得到这样的东西:
Füllkrug,4.7787.771,计算机,Marcel
SergioCórdova,379.000,计算机,Thilo
J。博阿滕2.164.007,计算机,马塞尔(Marcel)
Stindl,5.922.500,尼古拉斯,计算机
答案 0 :(得分:0)
您是否正在按预期获得HTML r.text
?因为使用GET请求requests.get
登录似乎不正确。您需要发出如下所示的POST请求。
然后要提取交换详细信息,我遍历了所有字符串,并尝试使每两个人匹配它们之间发生的任何交换。
import csv
import re
from io import StringIO
from pprint import pprint
from typing import IO
import requests
from bs4 import BeautifulSoup
def get_report_html():
res = requests.post('https://classic.comunio.de/login.phtml', data={
"login": "your_username",
"pass": "your_password",
"action": "login",
">>+Login": "-1"
})
res.raise_for_status()
return res.text
def parse_exchange_details(soup: BeautifulSoup) -> list:
name_els = soup.select('.article_content_text a')
person_names = [a.text.strip() for a in name_els]
exchanges = []
persons = []
action = None
amount = None
for s in soup.stripped_strings:
if s in person_names:
persons.append(s)
# determine exchange direction
if 'von Computer zu' in s:
action = 'withdraw'
elif 'zu Computer' in s:
action = 'deposit'
# look for numbers
m = re.search('(\d[\d.]+)', s)
if m:
amount = m.group(1)
# did we collect all exchange details
if len(persons) == 2 and action and amount:
p1, p2 = persons
if action == 'deposit':
from_, to = p2, 'computer'
else:
from_, to = 'computer', p2
exc = {
'who': p1,
'amount': amount,
'from': from_,
'to': to
}
exchanges.append(exc)
# reset for the next exchange
persons = []
action = None
amount = None
return exchanges
def write_csv(file: IO, report: list):
fields = list(report[0].keys())
w = csv.DictWriter(file, fieldnames=fields)
for item in report:
w.writerow(item)
if __name__ == '__main__':
html = '''
<div class="article_content2">
<div class="article_content_text">
<a>B. Hübner</a> wechselt für 3.711.638 von Computer zu <a>Marcel</a> .
<br/>
<a>Ginczek</a> wechselt für 2.845.000 von Computer zu <a>Max</a> .
<br/>
<a>Embolo</a> wechselt für 6.640.000 von Computer zu <a>Chrissi</a> .
<br/>
<br/>
<a>Jäkel</a> wechselt für 220.000 von <a>Thilo</a> zu Computer.
<br/>
<a>Raphaël Guerreiro</a> wechselt für 3.640.000 von <a>Malte</a> zu Computer.
<br/>
<br/>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
exchanges = parse_exchange_details(soup)
pprint(exchanges, width=200)
file = StringIO()
# or `with open('filename.csv', 'w') as file:`
write_csv(file, exchanges)
file.seek(0)
print(file.read())
输出:
[{'amount': '3.711.638', 'from': 'computer', 'to': 'Marcel', 'who': 'B. Hübner'},
{'amount': '2.845.000', 'from': 'computer', 'to': 'Max', 'who': 'Ginczek'},
{'amount': '6.640.000', 'from': 'computer', 'to': 'Chrissi', 'who': 'Embolo'},
{'amount': '220.000', 'from': 'Thilo', 'to': 'computer', 'who': 'Jäkel'},
{'amount': '3.640.000', 'from': 'Malte', 'to': 'computer', 'who': 'Raphaël Guerreiro'}]
B. Hübner,3.711.638,computer,Marcel
Ginczek,2.845.000,computer,Max
Embolo,6.640.000,computer,Chrissi
Jäkel,220.000,Thilo,computer
Raphaël Guerreiro,3.640.000,Malte,computer
答案 1 :(得分:0)
soup = BeautifulSoup(html3, 'html.parser')
name_els = soup.select('.article_content_text a')
person_names = [a.text.strip() for a in name_els]
exchanges = []
persons = []
action = None
amount = None
for s in soup.stripped_strings:
if s in person_names:
persons.append(s)
# determine exchange direction
if 'von Computer zu' in s:
action = 'withdraw'
elif 'zu Computer' in s:
action = 'deposit'
elif 'von ' in s:
action = 'swap'
# look for numbers
m = re.search('(\d[\d.]+)', s)
if m:
amount = m.group(1)
# did we collect all exchange details
if len(persons) == 2 and action:
p1, p2 = persons
if action == 'deposit':
from_, to = p2, 'computer'
else:
from_, to = 'computer', p2
if len(persons) == 3 and action:
p1, p2, p3 = persons
if action == 'swap':
from_, to = p2, p3
exc = {
'who': p1,
'amount': amount,
'from': from_,
'to': to
}
exchanges.append(exc)
# reset for the next exchange
persons = []
action = None
amount = None
pprint(exchanges, width=200)
由于两个播放器之间也可能进行交换,因此我尝试修改代码,而我最初忘记了这一点。这是其中一部分的html代码的示例。
<div class="article_content_text">
<a href="../../bundesligaspieler/32780-Tolisso.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32780-Tolisso.html','7cbb'))">Tolisso</a> wechselt für 8.640.000 von Computer zu <a href="playerInfo.phtml?pid=13059329" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059329','p_13059329'))">Chrissi</a>.<br><a href="../../bundesligaspieler/32897-L%C3%B6wen.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32897-L%C3%B6wen.html','7cbb'))">Löwen</a> wechselt für 2.712.122 von Computer zu <a href="playerInfo.phtml?pid=13059337" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059337','p_13059337'))">Niklas</a>.<br><a href="../../bundesligaspieler/31740-Plattenhardt.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/31740-Plattenhardt.html','7cbb'))">Plattenhardt</a> wechselt für 2.260.000 von Computer zu <a href="playerInfo.phtml?pid=13059734" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059734','p_13059734'))">Max</a>.<br><a href="../../bundesligaspieler/32845-Sancho.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32845-Sancho.html','7cbb'))">Sancho</a> wechselt für 14.118.000 von Computer zu <a href="playerInfo.phtml?pid=13059315" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059315','p_13059315'))">Dennis</a>.<br><br><a href="../../bundesligaspieler/32584-Demme.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32584-Demme.html','7cbb'))">Demme</a> wechselt für 2.603.700 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><a href="../../bundesligaspieler/33108-Stierlin.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33108-Stierlin.html','7cbb'))">Stierlin</a> wechselt für 163.200 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><a href="../../bundesligaspieler/32374-Kosti%C4%87.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32374-Kosti%C4%87.html','7cbb'))">Kostić</a> wechselt für 7.068.600 von <a href="playerInfo.phtml?pid=13059315" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059315','p_13059315'))">Dennis</a> zu Computer.<br><a href="../../bundesligaspieler/31372-Hitz.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/31372-Hitz.html','7cbb'))">Hitz</a> wechselt für 222.200 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><br><a href="../../bundesligaspieler/33026-Kabak.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33026-Kabak.html','7cbb'))">Kabak</a> wechselt für 300.000 von <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a> zu <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a>.<br><a href="../../bundesligaspieler/33096-Trimmel.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33096-Trimmel.html','7cbb'))">Trimmel</a> wechselt für 0 von <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a> zu <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a>.<br><a href="../../bundesligaspieler/32208-Dahoud.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32208-Dahoud.html','7cbb'))">Dahoud</a> wechselt für 0 von <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a> zu <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a>.
</div>