使用BeautifulSoup和urllib刮取<span>流文本</span>

时间:2017-06-26 04:24:10

标签: python web-scraping beautifulsoup urllib

我正在使用BeautifulSoup从网站上抓取数据。无论出于何种原因,我似乎无法找到一种方法来获取span元素之间的文本进行打印。这是我正在运行的。

data = """ <div class="grouping">
     <div class="a1 left" style="width:20px;">Text</div>
     <div class="a2 left" style="width:30px;"><span 
     id="target_0">Data1</span>
   </div>
   <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
   </span></div>
   <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
   </span</div>
</div>
"""

我的最终目标是能够为每个条目打印一个列表[&#34; Text&#34;,&#34; Data1&#34;,&#34; Data2&#34;]。但是现在我无法让python和urllib在之间生成任何文本。这是我正在运行的:

import urllib
from bs4 import BeautifulSoup

url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

Search_List = [0,4,5] # list of Target IDs to scrape

for i in Search_List:
    h = str(i)
    root = 'target_' + h
    taggr = soup.find("span", { "id" : root })
    print taggr, ", ", taggr.text

当我使用urllib时,它产生了这个:

<span id="target_0"></span>, 
<span id="target_4"></span>, 
<span id="target_5"></span>, 

但是,我也下载了html文件,当我解析下载的文件时,它产生了这个输出(我想要的那个):

<span id="target_0">Data1</span>, Data1 
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1

有人可以向我解释为什么urllib不会产生结果吗?

2 个答案:

答案 0 :(得分:0)

使用此代码:

...
soup = BeautifulSoup(html, 'html.parser')

your_data = list()

for line in soup.findAll('span', attrs={'id': 'target_0'}):
    your_data.append(line.text)


...

同样添加从csv文件中提取数据和写入class attributes列表所需的所有your_data。希望如果没有成功,这将有所帮助。让我知道。

答案 1 :(得分:0)

您可以使用以下方法根据您显示的源HTML创建列表:

from bs4 import BeautifulSoup

data = """ 
<div class="grouping">
     <div class="a1 left" style="width:20px;">Text0</div>
     <div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>

<div class="grouping">
     <div class="a1 left" style="width:20px;">Text2</div>
     <div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>

<div class="grouping">
     <div class="a1 left" style="width:20px;">Text4</div>
     <div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""

soup = BeautifulSoup(data, "lxml")

search_ids = [0, 4, 5] # list of Target IDs to scrape

for i in search_ids:
    span = soup.find("span", id='target_{}'.format(i))

    if span:
        grouping = span.parent.parent
        print list(grouping.stripped_strings)[:-1]      # -1 to remove "Data3"

该示例略有修改,以显示它找到ID 04。这将显示以下输出:

[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']

注意,如果您从URL返回的HTML与浏览器中查看源的HTML不同(即您想要的数据完全丢失),那么您将需要使用{{1连接到您的浏览器并提取HTML。这是因为在这种情况下,HTML可能是通过Javascript在本地生成的,selenium没有Javascript处理器。