无法弄清楚如何使用漂亮的汤(Python)刮取身体标签中的数据

时间:2016-06-19 02:18:03

标签: python beautifulsoup

from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter

r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r)
rate = soup.find_all('body')

print rate
print type(soup)

我试图在容器中捕获值,例如data-bedroom =" 3",特别是引号中给出的值,但我不知道它们是正式调用的,或者如何解析他们。

以下是"身体"的部分打印样本。所以我知道价值观存在,捕捉特定部分是我无法获得的:

数据ratemaximum =" $ 260#34;数据rateminimum =" $ 220#34;数据rateunits ="夜间"数据rawlistingnumber =" 576329"数据requestuuid =" 73bcfaa3-9637-40a8-801c-ae86f93caf39"数据searchpdptab =" C"数据serverday =" 18"数据showbookingphone ="假"

3 个答案:

答案 0 :(得分:1)

要获取属性使用率[' attr']的值,例如:

from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter

r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r, "html.parser")
rate = soup.find('body')
print rate['data-ratemaximum']
print rate['data-rateunits']
print rate['data-rawlistingnumber']
print rate['data-requestuuid']
print rate['data-searchpdptab']
print rate['data-serverday']
print rate['data-searchpdptab']
print rate['data-showbookingphone']

print rate
print type(soup)

from bs4 import BeautifulSoup import urllib from openpyxl import Workbook from openpyxl.compat import range from openpyxl.cell import get_column_letter r = urllib.urlopen('https://www.vrbo.com/576329').read() soup = BeautifulSoup(r, "html.parser") rate = soup.find('body') print rate['data-ratemaximum'] print rate['data-rateunits'] print rate['data-rawlistingnumber'] print rate['data-requestuuid'] print rate['data-searchpdptab'] print rate['data-serverday'] print rate['data-searchpdptab'] print rate['data-showbookingphone'] print rate print type(soup)

答案 1 :(得分:0)

你需要挑选你的结果。知道您所寻求的内容在HTML中被称为标记属性可能会有所帮助:

body_tag = rate[0]
data_bedrooms = body_tag.attrs['data-bedrooms']

上面的代码假设您只有一个<body> - 如果您有更多,则需要在for上使用rate循环。您还可能希望将值转换为int()的整数。

答案 2 :(得分:-1)

不确定您是否只想从data-bedrooms对象中 soup。我粗略地检查了输出产品,并且能够推断出你提到的data-*项是属性,而不是标签。如果doc结构是一致的,您可以找到与该属性关联的相应标记,并使这些标记更有效:

import re
# regex pattern for attribs
data_tag_pattern = re.compile('^data\-')

# Create list of attribs
attribs_wanted = "data-bedrooms data-rateminimumdata-rateunits data-rawlistingnumber data-requestuuid data-searchpdptab data-serverday data-showbookingphone".split()


# Search entire tree
for item in soup.findAll():
    # Use descendants to recurse downwards
    for child in item.descendants:
        try:
            for attribute in child.attrs:
                if data_tag_pattern.match(attribute) and attribute in attribs_wanted:
                    print("{}: {}".format(attribute, child[attribute]))
        except AttributeError:
            pass

这将产生输出:

data-showbookingphone: False
data-bedrooms: 3
data-requestuuid: 2b6f4d21-8b04-403d-9d25-0a660802fb46
data-serverday: 18
data-rawlistingnumber: 576329
data-searchpdptab: C

HTH!