我正在尝试使用以下代码在每次迭代的同一行中搜索并写入输出。
import urllib2
from bs4 import BeautifulSoup
import re
page = urllib2.urlopen("http://www.siema.org/members.html")
soup = BeautifulSoup(page)
tds = soup.findAll('td', attrs={'class':'content'})
for table in zip(*[iter(tds)]*2):
data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in table.find_all(text=True) if text.strip()]
print [','.join(data) for x in data]
现在我得到像
这样的输出A K Ponnusamy & Co
cjm@yahoo.co.in
Manufacturing of Rough Castings
Aelenke PL Industrials
All types of Pulleys
Agri Pump Industries
Submersible Pumpsset Jet Pumps Centrifugal Monoblocks Motor & pumps
Akshaya Engineering
pumpsets
Altech Industries
altech@vsnl.com|www.altechindustries.org
Engineering College Lab Equipment (FM and Therai lab Equipment)
Ammurun Foundry
ammarun@vsnl.com|www.ammarun.com
Grey Iron & S.G. Iron Rough Castings
Anugraha Valve Castings Ltd
anugraha@anugrahavalvecastings.com
valve & spares
Apex Bright Bars (Cbe) Pvt Ltd
apexcbe@sify.com
我希望它像
A K Ponnusamy & Co |cjm@yahoo.co.in | Manufacturing of Rough Castings
Aelenke PL Industrials | | All types of Pulleys
答案 0 :(得分:4)
您的zip(*[iter(tds)]*2
正在返回包含td标记的元组列表。因此,表变量是一个没有find_all方法的元组。
此:
import urllib2
from bs4 import BeautifulSoup
import re
page = urllib2.urlopen("http://www.siema.org/members.html")
soup = BeautifulSoup(page)
tds = soup.findAll('td', attrs={'class':'content'})
for table in zip( *[iter(tds)]*3 ):
data = []
for td in table:
data += [re.sub('\s+', ' ', text).strip().encode('utf8') for text in td.find_all(text=True) if text.strip()]
print ', '.join(data)
返回:
Name & Address of the Company, E Mail & Web, Product Manufactured
A K Ponnusamy & Co, cjm@yahoo.co.in, Manufacturing of Rough Castings
Aelenke PL Industrials, All types of Pulleys
Agri Pump Industries, Submersible Pumpsset, Jet Pumps, Centrifugal Monoblocks, Motor & pumps
... more skipped ...
该页面上的第一个TD标记包含标题,但您可能希望跳过这些标记。
答案 1 :(得分:2)
这与之前的答案非常相似,但输出结果稍微好一点。
for table in zip( *[iter(tds)]*3 ):
row = [', '.join([re.sub('\s+', ' ', text).strip().encode('utf8')
for text in td.find_all(text=True)
if text.strip()])
for td in table]
print ' | '.join(row)
它给出了以下输出:
Name & Address of the Company | E Mail & Web | Product Manufactured
A K Ponnusamy & Co | cjm@yahoo.co.in | Manufacturing of Rough Castings
Aelenke PL Industrials | | All types of Pulleys
...