Scrape certain data from a column

时间:2018-02-01 18:42:06

标签: python python-3.x beautifulsoup

I need to scrape links with text from table. There are two identical tables on page, but I need to select second table.

Then data are only on 3rd column of this table. In each cell there are 4 different links a href="link">link</a> and I need only 2nd link it each cell.

I was able to extract 3rd column with this:

table = soup.find_all(class_='bordercolor')[1]
rows = table.find_all('tr')

first_columns = []
third_columns = []
for row in rows[1:]:
    third_columns.append(row.find_all('td')[2])

for third in third_columns:
    print(third.text)

EDIT: example page where i try to extract: https://www.simplemachines.org/community/index.php?board=9.0 need to extract subject with link

1 个答案:

答案 0 :(得分:1)

试试这个。它应该从该网页中的该表中获取标题及其链接:

import requests
from bs4 import BeautifulSoup

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get("https://www.simplemachines.org/community/index.php?board=9.0")
    soup = BeautifulSoup(res.text, 'lxml')
    table = soup.select("table.table_grid")[0]
    for items in table.select("tr"):
        data = [' , '.join([item.text,item['href']]) for item in items.select("td.subject [id^='msg_'] a")[:1]]
        print(data)

部分输出:

['SMF 1.1.x incompatibility with recent PHP versions (PHP5.5+) , https://www.simplemachines.org/community/index.php?P=e48c449722e0aebc62554d9fb6bb2a48&topic=534915.0']
['SMF Online Manual -- FAQs available , https://www.simplemachines.org/community/index.php?P=e48c449722e0aebc62554d9fb6bb2a48&topic=472542.0']