有人可以建议我如何从<td>
检索href和img值。我编写了以下代码来检索下面给出的结果。我能够检索到<td>
之前的值。我不确定如何进一步追求。
请注意,有很多<tr>
个值。我刚刚举了两个例子。
mycode的:
from bs4 import BeautifulSoup
import urllib2
url="http://mywebsite.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
records = []
tabledata = soup.find("table", {"class" : "class1"})
for row in tabledata.findAll('tr'):
col = row.findAll('td')
if col:
col1 = col[1].string.strip()
col2 = col[2].string.strip()
col3 = col[3].string.strip()
record = '%s %s %s' % (col1,col2,col3)
records.append(record)
for values in records:
print values
数据
<table class="class1">
<tr>
<th></th>
<th>Heading1</th>
<th>Heading2</th>
<th>Heading3</th>
</th>
</tr>
<tr>
<td><img src="http://image.com/new.png"/></td>
<td>Data1</td>
<td><a href="www.sample.com">Data2</a></td>
<td>Data3</td>
</tr>
输出:
Data1 Data2 Data3
必需输出:
Data1 Data2 Data3 www.sample.com new.png
答案 0 :(得分:0)
string属性最有可能只返回子节点的文本内容。您还需要从每个col中查找您感兴趣的其他标记(和),并从要打印的属性中提取属性。
答案 1 :(得分:0)
以下是解决方案:
from bs4 import BeautifulSoup
import urllib2
#url="http://mywebsite.com/"
#page=urllib2.urlopen(url)
def getdata(col):
record = []
for image in col.findAll('img'):
src = image.get('src')
record.append(src)
for a in col.findAll('a'):
href = a.get('href')
record.append(href)
if col.string:
record.append(col.string.strip())
return record
def extract():
url="test.html"
soup = BeautifulSoup(open(url).read())
records = []
tabledata = soup.find("table", {"class" : "class1"})
for row in tabledata.findAll('tr'):
cols = row.findAll('td')
for col in cols:
record = getdata(col)
records.extend(record)
return records
if __name__ == "__main__":
records = extract()
print "recorsd:", records
for v in records:
print v
输出:
http://image.com/new.png
Data1
www.sample.com
Data2
Data3
循环遍历所有'td',提取必要的数据并附加到记录中。