美丽的href

时间:2013-08-22 10:00:43

标签: python beautifulsoup

有人可以建议我如何从<td>检索href和img值。我编写了以下代码来检索下面给出的结果。我能够检索到<td>之前的值。我不确定如何进一步追求。

请注意,有很多<tr>个值。我刚刚举了两个例子。

mycode的:

from bs4 import BeautifulSoup
import urllib2
url="http://mywebsite.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

records = [] 
tabledata = soup.find("table", {"class" : "class1"})
for row in tabledata.findAll('tr'):
    col = row.findAll('td')
    if col:
        col1 = col[1].string.strip()
        col2 = col[2].string.strip()
        col3 = col[3].string.strip()
        record = '%s %s %s' % (col1,col2,col3)
        records.append(record)


for values in records:
    print values

数据

<table class="class1">
<tr>
<th></th>
<th>Heading1</th>
<th>Heading2</th>
<th>Heading3</th>
</th>
</tr>
<tr>
<td><img src="http://image.com/new.png"/></td>
<td>Data1</td>
<td><a href="www.sample.com">Data2</a></td>
<td>Data3</td>
</tr>

输出:

Data1 Data2 Data3

必需输出:

Data1 Data2 Data3 www.sample.com new.png

2 个答案:

答案 0 :(得分:0)

string属性最有可能只返回子节点的文本内容。您还需要从每个col中查找您感兴趣的其他标记(和),并从要打印的属性中提取属性。

答案 1 :(得分:0)

以下是解决方案:

from bs4 import BeautifulSoup
import urllib2
#url="http://mywebsite.com/"
#page=urllib2.urlopen(url)


def getdata(col):
    record = []
    for image in col.findAll('img'):
        src = image.get('src')
        record.append(src)
    for a in col.findAll('a'):
        href = a.get('href')
        record.append(href)
    if col.string:
        record.append(col.string.strip())
    return record


def extract():
    url="test.html"
    soup = BeautifulSoup(open(url).read())

    records = [] 
    tabledata = soup.find("table", {"class" : "class1"})
    for row in tabledata.findAll('tr'):
        cols = row.findAll('td')
        for col in cols:
            record = getdata(col)
            records.extend(record)
    return records

if __name__ == "__main__":
    records = extract()
    print "recorsd:", records
    for v in records:
        print v

输出:

http://image.com/new.png
Data1
www.sample.com
Data2
Data3

循环遍历所有'td',提取必要的数据并附加到记录中。