<div class="members_box_second">
<div class="members_box0">
<p>1</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Ukkadam South</p>
<p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
<p class="clear"><b>Email:</b><span><a href="mailto:jagadhesan@infognana.com">jagadhesan@infognana.com</a></span></p>
</div>
</div>
<div class="members_box">
<div class="members_box0">
<p>2</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Alagar Nivas, 284 NSR Road</p>
<p class="clear"><b>Phone:</b><span>2435674</span></p>
<h4>Factory Address</h4>
Coimbatore - 641 027
<p class="clear"><b>Phone:</b><span>2435674</span></p>
</div>
</div>
我有上述结构。由此我试图仅在div
members_box1 和 members_box2 的class
内删除文字。
我有以下脚本,它只从members_box1
获取数据from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
print '\n'
这就是我尝试从两个方框中获取数据的方式
from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
data = eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
print data
但我得到的结果与 members_box1
相同更新
我希望输出像这样(单行)进行迭代
Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861
但我得到如下
Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861
答案 0 :(得分:4)
问题是,您要为每个eachbox2
添加data
,而不是要循环播放的内容列表。
最重要的是,你有一个迷路空间'div '
而不是'div'
,导致eachbox2
成为空列表。
试试这个:
eachbox1 = soup.findAll('div', {'class':'members_box1'})
eachbox2 = soup.findAll('div', {'class':'members_box2'})
for eachuniversity in eachbox1 + eachbox2:
data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
这不是最好的做事方式,它只是对现有做事方式的最简单的解决方法。 BeautifulSoup提供了在一个查询中搜索多个内容的各种不同方法 - 例如,您可以基于值('members_box1', 'members_box2')
或正则表达式(re.compile(r'members_box[12]')
)或过滤函数({{1)进行搜索}})...
答案 1 :(得分:4)
您可以使用regex
来匹配members_box1
或members_box2
:
import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:
例如,
import bs4 as bs
import urllib2
import re
import csv
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)
with open('/tmp/ccc.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', lineterminator='\n', )
eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
for pair in zip(*[iter(eachbox)]*2):
writer.writerow([text.strip() for item in pair for text in item.stripped_strings])
请注意,您必须删除
中div
后的杂散空间
soup.findAll('div ')
以便找到任何<div>
代码。
上面的代码使用非常方便的grouper idiom:
zip(*[iter(iterable)]*n)
此表达式从n
收集iterable
项,并将它们分组为元组。因此,此表达式允许您迭代n
项的块。我试图解释how the grouper idiom works here。