不确定我想做的事情是否可能......但是这里有。我正试图从这张表中导航和抓取信息(简化)......
> <tr class="transaction odd" id="transaction_7"><td><a
> href="/show_customer/11111">Erin</a></td></tr> <tr class="transaction
> even" id="transaction_6"><td><a
> href="/show_customer/2222">Jack</a></td></tr> <tr class="transaction
> odd" id="transaction_5"><td><a
> href="/show_customer/3333">Carl</a></td></tr> <tr class="transaction
> even" id="transaction_4"><td><a
> href="/show_customer/4444">Kelly</a></td></tr>
这是我用来刮掉表格并输出到csv中的代码......效果很好。
columns = ["User Name", "Source", "Staff", "Location", "Attended On", "Used", "Date"]
table = []
for row in table_1.find_all('tr'):
tds = row.find_all('td')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
table.append(data)
except:
print("Bad string value")
import csv
with open("myfile.csv", "wb") as outf:
outcsv = csv.writer(outf)
# header row
outcsv.writerow(columns)
# data
outcsv.writerows(table)
我需要做的是导航到表格中的每个链接
<a> href="/show_customer/11111">Erin</a>
并获取此html表单中的客户电子邮件地址
<div class="field">
<div class = "label">Email</div>
<p>XXXX@email.com</p>
</div>
并将其添加到我的csv中的相关行。
非常感谢任何帮助!
答案 0 :(得分:1)
您必须为href
中的每个td
发出http请求。这是您修改现有代码的方法:
from urllib2 import urlopen
for row in table_1.find_all('tr'):
tds = row.find_all('td')
# Get all the hrefs to make http request
links = row.find_all('a').get('href')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
# For every href make a request, get the page,
# create a BS object
for link in links:
link_soup = BeautifulSoup(urlopen(link))
# Use link_soup BS instance to get the email
# by navigating the div and p and add it to your data
table.append(data)
except:
print("Bad string value")
请注意,您的href
与网站的网址相关。因此,在您提取href
之后,您必须在网站的网址前面添加它以形成有效的网址