如何用美丽的汤下载intext图像

时间:2018-03-20 16:44:18

标签: python html web-scraping beautifulsoup python-requests

我试图使用美丽的汤和请求在Python中编写网站刮刀。我可以轻松收集我想要的所有文本,但我尝试下载的一些文本具有重要的内嵌图像。我想用它的标题替换图像,并将其添加到我稍后可以解析的字符串中,但我不确定如何执行此操作。

这是我尝试解析的HTML类型的示例:

    <td colspan="3"><b>"Assemble under Siegfried!"</b> 
        <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
        </a> This unit gains +10 attack for each 
        <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
        </a> and 
        <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
        </a> ally besides this unit.
    </td>

从这个HTML我想拉:

  

&#34;在Siegfried下组装! CONT除了这个单位之外,每个黑白盟友都会获得+10攻击。&#34;

使用普通get_text()方法不包括图像的标题,这就是问题所在。

3 个答案:

答案 0 :(得分:0)

哦......我得到了你需要的东西。

试试这个:

html_data = """ <td colspan="3"><b>"Assemble under Siegfried!"</b> 
    <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
    </a> This unit gains +10 attack for each 
    <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
    </a> and 
    <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
    </a> ally besides this unit.
</td>"""
from bs4 import BeautifulSoup
html = BeautifulSoup(html_data, "html.parser")

texts = [html.find("b").get_text()]
for a in html.find_all("a"):
    texts.append(a.attrs.get("title"))
    texts.append(a.next_element.next_element.next_element.strip())
print(" ".join(texts))

我不确定你真的想要。但我的目的是需要标签。

实施例:     来自bs4 import BeautifulSoup

html = BeautifulSoup(html_data)
for a in html.find_all("a"):
    print(a.attrs.get("title"))

输出:

CONT
Black
White

如果您想下载图片:     来自urllib.parse导入urljoin     导入请求     来自bs4 import BeautifulSoup

cdn_url = "http://some.com/" # root url of site with static content
html = BeautifulSoup(html_data)
for img in html.find_all("img"):
    img_response = requests.get(urljoin(cdn_url, img.attrs.get("src"))) #img content should save in file

答案 1 :(得分:0)

您希望从上面的html元素中获得的输出并不容易实现(至少对我而言)。但是,我已经尝试了一个可以获取所需输出的输出。

from bs4 import BeautifulSoup

content="""
<td colspan="3"><b>"Assemble under Siegfried!"</b> 
    <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
    </a> This unit gains +10 attack for each 
    <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
    </a> and 
    <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
    </a> ally besides this unit.
</td>
"""
soup = BeautifulSoup(content,"lxml")
part1 = soup.select_one("td > b").text.strip('"')
part2 = ' '.join(''.join([''.join([item['title'], item.next_sibling]) for item in soup.select("td a")]).split())
print("{} {}".format(part1,part2))

输出:

Assemble under Siegfried! CONT This unit gains +10 attack for each Black and White ally besides this unit.

我们不要再这样做了。

答案 2 :(得分:0)

另一种方法是迭代td标签的内容。我觉得这有点容易理解。

html = '''<td colspan="3"><b>"Assemble under Siegfried!"</b> 
    <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
    </a> This unit gains +10 attack for each 
    <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
    </a> and 
    <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
    </a> ally besides this unit.
</td>'''

soup = BeautifulSoup(html, 'lxml')
final_text = []

for content in soup.find('td').contents:
    if content.name == 'a':
        final_text.append(content['title'])
    elif content.name == 'b':
        final_text.append(content.text.strip())
    else:
        final_text.append(content.strip())

print(' '.join(final_text))

输出:

"Assemble under Siegfried!"  CONT This unit gains +10 attack for each Black and White ally besides this unit.

或者,单行:

final_text = ' '.join((x['title'] if x.name == 'a' else (x.text.strip() if x.name == 'b' else x.strip())) for x in soup.find('td').contents)
print(final_text)

或者更好的是,使用类似于get_text()的函数名称来获取td标记的文本:

def get_modified_text(td):
    return ' '.join((x['title'] if x.name == 'a' else (x.text.strip() if x.name == 'b' else x.strip())) for x in td.contents)

soup = BeautifulSoup(html, 'lxml')
print(get_modified_text(soup.find('td')))
# "Assemble under Siegfried!"  CONT This unit gains +10 attack for each Black and White ally besides this unit.

注意:如果您不想在第一个文字周围使用引号",请使用.strip('"')