I need to scrape links with text from table. There are two identical tables on page, but I need to select second table.
Then data are only on 3rd column of this table.
In each cell there are 4 different links a href="link">link</a>
and I need only 2nd link it each cell.
I was able to extract 3rd column with this:
table = soup.find_all(class_='bordercolor')[1]
rows = table.find_all('tr')
first_columns = []
third_columns = []
for row in rows[1:]:
third_columns.append(row.find_all('td')[2])
for third in third_columns:
print(third.text)
EDIT: example page where i try to extract: https://www.simplemachines.org/community/index.php?board=9.0 need to extract subject with link
答案 0 :(得分:1)
试试这个。它应该从该网页中的该表中获取标题及其链接:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get("https://www.simplemachines.org/community/index.php?board=9.0")
soup = BeautifulSoup(res.text, 'lxml')
table = soup.select("table.table_grid")[0]
for items in table.select("tr"):
data = [' , '.join([item.text,item['href']]) for item in items.select("td.subject [id^='msg_'] a")[:1]]
print(data)
部分输出:
['SMF 1.1.x incompatibility with recent PHP versions (PHP5.5+) , https://www.simplemachines.org/community/index.php?P=e48c449722e0aebc62554d9fb6bb2a48&topic=534915.0']
['SMF Online Manual -- FAQs available , https://www.simplemachines.org/community/index.php?P=e48c449722e0aebc62554d9fb6bb2a48&topic=472542.0']