使用BeautifulSoup提取正确的信息时出现问题

时间:2019-07-16 16:19:16

标签: python web-scraping

我想使用BeautifulSoup从网站中提取特定信息,但尚未找到正确的方法。该网站具有以下信息:

B。 Hübnerwechseltfür3.711.638从Computer zu Marcel。

Ginczek wechseltfür2.845.000 von Computer zu Max。

Embolo wechselt for 6.640.000 von Computer zu Chrissi。

Jäkelwechselt for 220.000 von Thilo zu Computer。

RaphaëlGuerreiro wechselt for 3.640.000 von Malte zu Computer。

在源代码中如下所示:

<div class="article_content2">
 <div class="article_content_text">
  <a href="../../bundesligaspieler/32426-B.+H%C3%BCbner.html" onclick="return(openSmallWindow('../../bundesligaspieler/32426-B.+H%C3%BCbner.html','44f6'))" style="font-weight:normal;" target="_blank">
   B. Hübner
  </a>
  wechselt für 3.711.638 von Computer zu
  <a href="playerInfo.phtml?pid=13059320" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))" style="font-weight:normal;" target="_blank">
   Marcel
  </a>
  .
  <br/>
  <a href="../../bundesligaspieler/31700-Ginczek.html" onclick="return(openSmallWindow('../../bundesligaspieler/31700-Ginczek.html','44f6'))" style="font-weight:normal;" target="_blank">
   Ginczek
  </a>
  wechselt für 2.845.000 von Computer zu
  <a href="playerInfo.phtml?pid=13059734" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059734','p_13059734'))" style="font-weight:normal;" target="_blank">
   Max
  </a>
  .
  <br/>
  <a href="../../bundesligaspieler/32642-Embolo.html" onclick="return(openSmallWindow('../../bundesligaspieler/32642-Embolo.html','44f6'))" style="font-weight:normal;" target="_blank">
   Embolo
  </a>
  wechselt für 6.640.000 von Computer zu
  <a href="playerInfo.phtml?pid=13059329" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059329','p_13059329'))" style="font-weight:normal;" target="_blank">
   Chrissi
  </a>
  .
  <br/>
  <br/>
  <a href="../../bundesligaspieler/33109-J%C3%A4kel.html" onclick="return(openSmallWindow('../../bundesligaspieler/33109-J%C3%A4kel.html','44f6'))" style="font-weight:normal;" target="_blank">
   Jäkel
  </a>
  wechselt für 220.000 von
  <a href="playerInfo.phtml?pid=13059353" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059353','p_13059353'))" style="font-weight:normal;" target="_blank">
   Thilo
  </a>
  zu Computer.
  <br/>
  <a href="../../bundesligaspieler/32632-Rapha%C3%ABl+Guerreiro.html" onclick="return(openSmallWindow('../../bundesligaspieler/32632-Rapha%C3%ABl+Guerreiro.html','44f6'))" style="font-weight:normal;" target="_blank">
   Raphaël Guerreiro
  </a>
  wechselt für 3.640.000 von
  <a href="playerInfo.phtml?pid=13059325" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059325','p_13059325'))" style="font-weight:normal;" target="_blank">
   Malte
  </a>
  zu Computer.
  <br/>
  <br/>
 </div>
</div>

到目前为止,我只设法提取了全部代码:

import requests
from bs4 import BeautifulSoup

r=requests.get("https://classic.comunio.de/login.phtml?login=USER&pass=PASSWORD")

soup = BeautifulSoup(r.text, 'lxml')

player_all = soup.find_all('a', href=re.compile('bundesligaspieler'))

作为输出,我想得到这样的东西:

Füllkrug,4.7787.771,计算机,Marcel

SergioCórdova,379.000,计算机,Thilo

J。博阿滕2.164.007,计算机,马塞尔(Marcel)

Stindl,5.922.500,尼古拉斯,计算机

2 个答案:

答案 0 :(得分:0)

您是否正在按预期获得HTML r.text?因为使用GET请求requests.get登录似乎不正确。您需要发出如下所示的POST请求。

然后要提取交换详细信息,我遍历了所有字符串,并尝试使每两个人匹配它们之间发生的任何交换。

import csv
import re
from io import StringIO
from pprint import pprint
from typing import IO

import requests
from bs4 import BeautifulSoup


def get_report_html():
    res = requests.post('https://classic.comunio.de/login.phtml', data={
        "login": "your_username",
        "pass": "your_password",
        "action": "login",
        ">>+Login": "-1"
    })
    res.raise_for_status()
    return res.text


def parse_exchange_details(soup: BeautifulSoup) -> list:
    name_els = soup.select('.article_content_text a')
    person_names = [a.text.strip() for a in name_els]

    exchanges = []

    persons = []
    action = None
    amount = None
    for s in soup.stripped_strings:
        if s in person_names:
            persons.append(s)

        # determine exchange direction
        if 'von Computer zu' in s:
            action = 'withdraw'
        elif 'zu Computer' in s:
            action = 'deposit'

        # look for numbers
        m = re.search('(\d[\d.]+)', s)
        if m:
            amount = m.group(1)

        # did we collect all exchange details
        if len(persons) == 2 and action and amount:
            p1, p2 = persons
            if action == 'deposit':
                from_, to = p2, 'computer'
            else:
                from_, to = 'computer', p2

            exc = {
                'who': p1,
                'amount': amount,
                'from': from_,
                'to': to
            }
            exchanges.append(exc)

            # reset for the next exchange
            persons = []
            action = None
            amount = None
    return exchanges

def write_csv(file: IO, report: list):
    fields = list(report[0].keys())
    w = csv.DictWriter(file, fieldnames=fields)
    for item in report:
        w.writerow(item)

if __name__ == '__main__':
    html = '''
<div class="article_content2">
 <div class="article_content_text">
  <a>B. Hübner</a> wechselt für 3.711.638 von Computer zu <a>Marcel</a> .
  <br/>
  <a>Ginczek</a> wechselt für 2.845.000 von Computer zu <a>Max</a> .
  <br/>
  <a>Embolo</a> wechselt für 6.640.000 von Computer zu <a>Chrissi</a> .
  <br/>
  <br/>
  <a>Jäkel</a> wechselt für 220.000 von <a>Thilo</a> zu Computer.
  <br/>
  <a>Raphaël Guerreiro</a> wechselt für 3.640.000 von <a>Malte</a> zu Computer.
  <br/>
  <br/>
 </div>
</div>
    '''
    soup = BeautifulSoup(html, 'html.parser')
    exchanges = parse_exchange_details(soup)
    pprint(exchanges, width=200)

    file = StringIO()
    # or `with open('filename.csv', 'w') as file:` 
    write_csv(file, exchanges)
    file.seek(0)
    print(file.read())

输出:

[{'amount': '3.711.638', 'from': 'computer', 'to': 'Marcel', 'who': 'B. Hübner'},
 {'amount': '2.845.000', 'from': 'computer', 'to': 'Max', 'who': 'Ginczek'},
 {'amount': '6.640.000', 'from': 'computer', 'to': 'Chrissi', 'who': 'Embolo'},
 {'amount': '220.000', 'from': 'Thilo', 'to': 'computer', 'who': 'Jäkel'},
 {'amount': '3.640.000', 'from': 'Malte', 'to': 'computer', 'who': 'Raphaël Guerreiro'}]

B. Hübner,3.711.638,computer,Marcel
Ginczek,2.845.000,computer,Max
Embolo,6.640.000,computer,Chrissi
Jäkel,220.000,Thilo,computer
Raphaël Guerreiro,3.640.000,Malte,computer

答案 1 :(得分:0)

soup = BeautifulSoup(html3, 'html.parser')
name_els = soup.select('.article_content_text a')
person_names = [a.text.strip() for a in name_els]
exchanges = []

persons = []
action = None
amount = None
for s in soup.stripped_strings:
        if s in person_names:
            persons.append(s)
 # determine exchange direction
        if 'von Computer zu' in s:
            action = 'withdraw'
        elif 'zu Computer' in s:
            action = 'deposit'
        elif 'von ' in s:
            action = 'swap'
        # look for numbers
        m = re.search('(\d[\d.]+)', s)
        if m:
            amount = m.group(1)        

        # did we collect all exchange details
        if len(persons) == 2  and action:
            p1, p2 = persons
            if action == 'deposit':
                from_, to = p2, 'computer'
            else:
                from_, to = 'computer', p2       

        if len(persons) == 3 and action:
            p1, p2, p3 = persons
            if action == 'swap':
                 from_, to = p2, p3

            exc = {
                'who': p1,
                'amount': amount,
                'from': from_,
                'to': to
            }
            exchanges.append(exc)

            # reset for the next exchange
            persons = []
            action = None
            amount = None

pprint(exchanges, width=200)

由于两个播放器之间也可能进行交换,因此我尝试修改代码,而我最初忘记了这一点。这是其中一部分的html代码的示例。

<div class="article_content_text">
            <a href="../../bundesligaspieler/32780-Tolisso.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32780-Tolisso.html','7cbb'))">Tolisso</a> wechselt für 8.640.000 von Computer zu <a href="playerInfo.phtml?pid=13059329" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059329','p_13059329'))">Chrissi</a>.<br><a href="../../bundesligaspieler/32897-L%C3%B6wen.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32897-L%C3%B6wen.html','7cbb'))">Löwen</a> wechselt für 2.712.122 von Computer zu <a href="playerInfo.phtml?pid=13059337" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059337','p_13059337'))">Niklas</a>.<br><a href="../../bundesligaspieler/31740-Plattenhardt.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/31740-Plattenhardt.html','7cbb'))">Plattenhardt</a> wechselt für 2.260.000 von Computer zu <a href="playerInfo.phtml?pid=13059734" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059734','p_13059734'))">Max</a>.<br><a href="../../bundesligaspieler/32845-Sancho.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32845-Sancho.html','7cbb'))">Sancho</a> wechselt für 14.118.000 von Computer zu <a href="playerInfo.phtml?pid=13059315" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059315','p_13059315'))">Dennis</a>.<br><br><a href="../../bundesligaspieler/32584-Demme.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32584-Demme.html','7cbb'))">Demme</a> wechselt für 2.603.700 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><a href="../../bundesligaspieler/33108-Stierlin.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33108-Stierlin.html','7cbb'))">Stierlin</a> wechselt für 163.200 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><a href="../../bundesligaspieler/32374-Kosti%C4%87.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32374-Kosti%C4%87.html','7cbb'))">Kostić</a> wechselt für 7.068.600 von <a href="playerInfo.phtml?pid=13059315" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059315','p_13059315'))">Dennis</a> zu Computer.<br><a href="../../bundesligaspieler/31372-Hitz.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/31372-Hitz.html','7cbb'))">Hitz</a> wechselt für 222.200 von <a href="playerInfo.phtml?pid=13060984" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060984','p_13060984'))">Johannes</a> zu Computer.<br><br><a href="../../bundesligaspieler/33026-Kabak.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33026-Kabak.html','7cbb'))">Kabak</a> wechselt für 300.000 von <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a> zu <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a>.<br><a href="../../bundesligaspieler/33096-Trimmel.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/33096-Trimmel.html','7cbb'))">Trimmel</a> wechselt für 0 von <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a> zu <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a>.<br><a href="../../bundesligaspieler/32208-Dahoud.html" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('../../bundesligaspieler/32208-Dahoud.html','7cbb'))">Dahoud</a> wechselt für 0 von <a href="playerInfo.phtml?pid=13060183" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13060183','p_13060183'))">Olé Sané</a> zu <a href="playerInfo.phtml?pid=13059320" target="_blank" style="font-weight:normal;" onclick="return(openSmallWindow('playerInfo.phtml?pid=13059320','p_13059320'))">Marcel</a>.
            </div>