如何在curand文档中从<a> in Python

时间:2016-10-23 01:07:53

标签: python web-scraping beautifulsoup

I'm just getting started with web scraping and I'm trying to pull specific links from a webpage. I've successfully made a list of the <a> classes (not sure what those things are actually called) that contain the hrefs for the links, but my problem is that they also contain other noise I don't want. How can I isolate just the href part of it? Here's the code:

data2015 = requests.get('https://www.tabroom.com/index/index.mhtml?country=&year=2015&month=')
data2015 = BeautifulSoup(data2015.content, 'lxml')
data2015 = data2015.find_all('tr')
for tr in range(len(data2015)): 
    data2015[tr] = data2015[tr].find_all('td')

relevantData = [0,2]
for tr in range(len(data2015)):
    try:
        data2015[tr] = [data2015[tr][i] for i in relevantData]
    except:
        pass
    for td in range(len(data2015[tr])):
        data2015[tr][td] = [data2015[tr][td].get_text().strip(), data2015[tr][td].find_all('a')]

To clarify, data2015 is now a list of lists containing two elements, the second of which's (also a list of two elements) second element is the <a> thing containing the link, but also other stuff I don't want. That element looks like this:

[<a class="white smallish nearfull" href="tourn/index.mhtml?tourn_id=4445">\n\t\t\t\t\t\t\tGranite District Novice Imp Spar and Congress\n\t\t\t\t\t\t</a>]

How can I finish cleaning this so that I get just the link in a way that I can then open that with BSoup?

1 个答案:

答案 0 :(得分:0)

也许是这样的,假设你只是试图将表格中每行的文本/链接拉到某个列表中的自己的行,这是你的数据集。

这将找到行中的所有链接,并将text / href放在每行的列表中的两元组中。因此,您的数据集将包含每行的列表,其中包含该行中每个锚点的两元组元素,每个元组中包含text / href。

 from bs4 import BeautifulSoup as BS
import requests 

response = requests.get('https://www.tabroom.com/index/index.mhtml?country=&year=2015&month=')
soup = BS(response.content, "html.parser")
trs = soup.find('table', {'id': 'tournlist'}).find_all('tr')
dataset = [
    [(y.text.strip(), y['href']) for y in x.find_all("a")] for x in trs
]
import pprint
pprint.pprint(dataset)