使用beautifulsoup

时间:2017-05-16 09:05:49

标签: python csv web-scraping beautifulsoup

我试图抓取商店位置的文本详细信息,并使用BeautifulSoup将它们写入csv。阿拉巴马州的2家商店位于LocationSecContent类,亚利桑那州的17家商店位于另一类LocationSecContent。 在乔治亚州,第一家商店机场位于类别LocationSecContent内的单一类别中,其余4位于位于LocationSecContent内的另一个类位置。 我想抓取文本详细信息,并将商店详细信息(如姓名,位置,街道,电话,传真,小时内容和所有详细信息)写入csv文件。我在firefox中使用firebug。对不起,如果有任何错误,我是beautifulsoup的初学者。

这是我尝试过的:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://freshvites.com/store-locator/')

soup = BeautifulSoup(page.text, 'html.parser')
d={}
for table in soup.find_all("div", {"class":"content freshvites-location"}):
    table
for col in table.find_all("td"):

    LocationSecHdr=col.find_all("div",{'class':'LocationSecHdr'})
    Location=col.find_all("div",{'class':'location'})


dt="LocationSecHdr:%s,Location: %s" %(LocationSecHdr, Location)
zx=BeautifulSoup(dt, "html.parser")

print zx.get_text()

我无法遍历行并刮掉文本。

方法2:

from bs4 import BeautifulSoup

import requests


page = requests.get('http://freshvites.com/store-locator/')
#print page


soup = BeautifulSoup(page.text, 'html.parser')
#print soup.find_all('a')

for table in soup.find_all("div",{'class':'content freshvites-location'}):
    table


LocationSecHdr=''
LocationSecContent=''
Location=''
LocationTitle=''
Phone=''
Fax=''
HoursTitle=''
HoursContent=''


for col in table.find_all("td"):      
    LocationSecHdr=col.find_all("div",{'class':'LocationSecHdr'})
    #LocationSecContent= col.find_all("div",{'class':'LocactionSecContent'})
    #Location= col.find_all("div",{'class':'location'})
    LocationTitle= col.find_all("div",{'class':'locationTitle'})
    Phone= col.find_all("div",{'class':'Phone'})
    Fax= col.find_all("div",{'class':'Fax'})
    HoursContent=col.find_all("div",{'class':'HoursContent'})

    data="LocationSecHdr: %s, LocationSecContent: %s, Location:%s, LocationTitle : %s, Phone:%s, Fax :%s, HoursContent:%s " %(LocationSecHdr, LocationSecContent, Location, LocationTitle, Phone, Fax, HoursContent)
    zax=BeautifulSoup(data,"html.parser")

print zax.get_text()

如果我尝试使用此代码,我无法获取商店的地址,我也不知道如何将这些细节作为字典获取

1 个答案:

答案 0 :(得分:1)

I think I have enough information now to answer your question.

You are looking for the wrong tag/class combination. All informations for a location are contained inside of a <div class="location">. Here is a sample:

<div class="location">
<div class="locationTitle">32nd Street &amp; Thunderbird</div>
Fresh Vitamins<br> 
13802 N. 32nd St #11<br> 
Phoenix, AZ 85032<br>
<div class="Phone">&nbsp;</div>
<div class="Fax">877.935.6902</div>
<div class="HoursTitle">Hours:</div>
<div class="HoursContent">9am - 7pm M-F<br> 9am - 6pm Sat<br> 11am - 4pm Sun</div>
</div>

As you can see in the sample there is no <tr> or <td> so looking for that doesn't really make sense.

Here's a short python script I put together to find all locations:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://freshvites.com/store-locator/')

soup = BeautifulSoup(page.content, 'html.parser')

for div in soup.find_all("div", {"class":"location"}):
    print(div)

Now you just need to filter the information you need from div. Everything you need for that should be easy to find on so.