从网页上再抓取 1 个字段

时间:2021-04-08 00:42:18

标签: python selenium web-scraping beautifulsoup

我的代码进入网页,并从每一行获取某些数据

不过,我还想从每一行中获取“主题”。例如,在“演讲者”文本上方的第 1 行中列为“总统会议和社区精神病学”。

我的代码目前能够抓取每一行的 Titles 和 Chairs(表示为 Role 和 Name),但不能抓取主题?

from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
import pandas as pd

driver = webdriver.Chrome()
driver.get('https://s7.goeshow.com/apa/annual/2021/session_search.cfm?_ga=2.259773066.1015449088.1617295032-97934194.1617037074')
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

tables = soup.select('#datatable')
for table in tables:
    for title in table.select('tr td.title'):
        print(title.text.strip())
        title_row = title.parent
        speaker_row = title_row.next_sibling
        for speaker in speaker_row.select('span.session-speaker'):
            role = speaker.select_one('span.session-speaker-role').text.strip()
            name = speaker.select_one('span.session-speaker-name').text.strip()
            topic=speaker.select_one('span.session-track-label').text.strip() 
            print(role, name,topic)

        print()

2 个答案:

答案 0 :(得分:1)

tables = soup.select('#datatable')
for table in tables:
    for title in table.select('tr td.title'):
        print(title.text.strip())
        title_row = title.parent
        speaker_row = title_row.next_sibling
        for topic in speaker_row.select('span.session-track-label'):
            print(topic.text.strip())
        for speaker in speaker_row.select('span.session-speaker'):
            role = speaker.select_one('span.session-speaker-role').text.strip()
            name = speaker.select_one('span.session-speaker-name').text.strip()
            
            print(role, name)

如果您想要名称和角色之前的所有主题,您必须从行中定位它们,而不是以下同级。

答案 1 :(得分:1)

我认为这一行只包含“角色”和“姓名”。

这意味着“span.session-speaker”只包含“span.session-speaker-role”和“span.session-speaker-name”。

for speaker in speaker_row.select('span.session-speaker'):

你可以试试下面的代码。

L_topics=[]
for speaker in speaker_row.select('td.session-divider-line') :
    role = speaker.select_one('span.session-speaker-role').text.strip()
    name = speaker.select_one('span.session-speaker-name').text.strip()
    for topics in speaker_row.select('span.session-track-label'):
        L_topics.append(topics.text.strip())
    print(role,name,L_topics[0],L_topics[1])