Question

我正试图从一个网页上搜索股票代码，其网页来源如下：

<thead>
                            <tr>
                                <th>Company</th>
                                 <th>Symbol</th>
                                 <th>Weight</th>
                        </tr>
                    </thead>


                    <tbody>

                        <tr>
                            <td><a href="http://www.google.com/finance?q=AAPL">Apple Inc.</a></td>
                            <td><form action="/charts" method="post"> <div><input type="hidden" name="symbol" value="AAPL"/> <input type="submit" value="AAPL"/> </div></form></td>
                            <td>3.635302</td>
                        </tr>

到目前为止，我的python代码（下面）只返回公司名称（“Apple Inc.”），并将3.635的权重返回到csv文件中 - 但是我想要包含'AAPL'的代码。在网站上，代码被格式化为超链接 - 不确定如何刮取该数据。

url = "http://slickcharts.com/sp500"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html5lib")

table=soup.find_all('table')[0]
rows=table.find_all('tr')[1:]

data = {
    'Company' : [],
    'Symbol' : [],
    'Weight' : []
}

for row in rows:
    cols = row.find_all('td')
    data['Company'].append(cols[0].get_text())
    data['Symbol'].append(cols[1].get_text())
    data['Weight'].append(cols[2].get_text())

Answer 1

cols[1].get_text()

中没有任何内容

您需要data['Symbol'].append(cols[1].find('input')['value'])

Answer 2

您可以通过找到<a>代码并获取href属性来获取代码，如下所示，然后根据=分割链接会给出一个带有第二个值的列表作为必需AAPL

url = "http://slickcharts.com/sp500"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html5lib")

table=soup.find_all('table')[0]
rows=table.find_all('tr')[1:]

data = {
    'Company' : [],
    'Symbol' : [],
    'Weight' : [],
    'q':[]
}
for row in rows:
    cols = row.find_all('td')
    data['Company'].append(cols[0].get_text())
    data['Symbol'].append(cols[1].get_text())
    data['Weight'].append(cols[2].get_text())
    data['q'].append(cols[0].find("a").get("href").split("=")[1])

使用python从HTML网站上抓取网站

2 个答案: