如果值不存在则留空间BS4 python

时间:2013-11-15 08:17:26

标签: python python-2.7 web-scraping beautifulsoup

我有以下代码来抓取数据。数据被刮掉了。但输出很少搞砸。

from bs4 import BeautifulSoup
import urllib2
import re
import csv
with open('ccccc.csv', 'wb') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)
    for i in xrange(1,3):
        try:
            page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg={}".format(i))
        except urllib2.HTTPError:
            continue
        else:
            soup = BeautifulSoup(page.read(), from_encoding=page.info().getparam('charset'))
            eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
            for pair in zip(*[iter(eachbox)]*2):
                writer.writerow([text.strip() for item in pair for text in item.stripped_strings])

在我添加的图片中,您会看到列不匹配。

这是我正在抓取的数据的结构

<div class="members_box_second">
                    <div class="members_box0">
                        <p>1</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>
                        <p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
                        <p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Ukkadam South</p>
                        <p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
                        <p class="clear"><b>Email:</b><span><a href="mailto:jagadhesan@infognana.com">jagadhesan@infognana.com</a></span></p>                       
                    </div>
</div>
<div class="members_box">
                    <div class="members_box0">
                        <p>2</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>

                        <p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Alagar Nivas, 284 NSR Road</p>
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>      
                        <h4>Factory Address</h4>
                        Coimbatore - 641 027
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>
                    </div>
</div>

我希望将数据放在相应的列中。例如,所有名称应该属于相同的列名称,如明智的电话号码和电子邮件等等。如果电话号码不存在,它应该在csv文件上留一个空格。 我甚至没有接近实现它的想法。

My Current Output

1 个答案:

答案 0 :(得分:1)

根据您的要求,我建议您按顺序提取所有值,然后逐个写入。例如,查看以下代码:

from bs4 import BeautifulSoup
import urllib2
import re
import csv

with open('ccccc.csv', 'wb') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)
    for i in xrange(1,3):
      try:
         page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg={}".format(i))
      except urllib2.HTTPError:
         continue
      else:
         soup = BeautifulSoup(page.read(), from_encoding=page.info().getparam('charset'))
         eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
         for pair in zip(*[iter(eachbox)]*2):
            # I put only the most relevant, you can put more...
            dict = {'Name':'','Designation':'','Name of the Industry':'','Specification':'','Date of Admission':'','Phone':'', 'Email':''}
            for item in pair:

              #this is for the members_box1 , because all is organized by this tree                  
              for p in item.find_all('p'):
                 dict[p.b.text] = item.p.span.text

              # for the second members you need to use it equal but based in his HTML Tree, for example, of course in the second iteration :
              dict['Address'] = item.p.text


            # write to the CSV
            write.writerow([dict[key] for key dict.keys()])

通过这种方式,如果元素不存在,您可以编写空字符串,因为您在字典中设置了空字符串并尊重CSV中的订单。