Question

我正在尝试使用以下代码在每次迭代的同一行中搜索并写入输出。

import urllib2
from bs4 import BeautifulSoup
import re
page = urllib2.urlopen("http://www.siema.org/members.html")
soup = BeautifulSoup(page)
tds = soup.findAll('td', attrs={'class':'content'})
for table in zip(*[iter(tds)]*2):
    data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in table.find_all(text=True) if text.strip()]
    print [','.join(data) for x in data]

现在我得到像

这样的输出

A K Ponnusamy & Co
cjm@yahoo.co.in
Manufacturing of Rough Castings
Aelenke PL Industrials

All types of Pulleys
Agri Pump Industries

Submersible Pumpsset Jet Pumps Centrifugal Monoblocks Motor & pumps
Akshaya Engineering

pumpsets
Altech Industries
altech@vsnl.com|www.altechindustries.org
Engineering College Lab Equipment (FM and Therai lab Equipment)
Ammurun Foundry
ammarun@vsnl.com|www.ammarun.com
Grey Iron & S.G. Iron Rough Castings
Anugraha Valve Castings Ltd
anugraha@anugrahavalvecastings.com
valve & spares
Apex Bright Bars (Cbe) Pvt Ltd
apexcbe@sify.com

我希望它像

A K Ponnusamy & Co  |cjm@yahoo.co.in  |  Manufacturing of Rough Castings
Aelenke PL Industrials |    | All types of Pulleys

Answer 1

您的zip(*[iter(tds)]*2正在返回包含td标记的元组列表。因此，表变量是一个没有find_all方法的元组。

此：

import urllib2
from bs4 import BeautifulSoup
import re
page = urllib2.urlopen("http://www.siema.org/members.html")
soup = BeautifulSoup(page)
tds = soup.findAll('td', attrs={'class':'content'})
for table in zip( *[iter(tds)]*3 ):
    data = []
    for td in table:
        data += [re.sub('\s+', ' ', text).strip().encode('utf8') for text in td.find_all(text=True) if text.strip()]
    print ', '.join(data)

返回：

Name & Address of the Company, E Mail & Web, Product Manufactured
A K Ponnusamy & Co, cjm@yahoo.co.in, Manufacturing of Rough Castings
Aelenke PL Industrials, All types of Pulleys
Agri Pump Industries, Submersible Pumpsset, Jet Pumps, Centrifugal Monoblocks, Motor & pumps
... more skipped ...

该页面上的第一个TD标记包含标题，但您可能希望跳过这些标记。

Answer 2

这与之前的答案非常相似，但输出结果稍微好一点。

for table in zip( *[iter(tds)]*3 ):
    row = [', '.join([re.sub('\s+', ' ', text).strip().encode('utf8') 
                        for text in td.find_all(text=True) 
                        if text.strip()])
                       for td in table]
    print ' | '.join(row)

它给出了以下输出：

Name & Address of the Company | E Mail & Web | Product Manufactured
A K Ponnusamy & Co | cjm@yahoo.co.in | Manufacturing of Rough Castings
Aelenke PL Industrials |  | All types of Pulleys
...

AttributeError：'tuple'对象没有属性'find_all'

2 个答案: