我正在抓这个URL。
我必须抓取页面的主要内容,例如房间功能和互联网访问
这是我的代码:
for h3s in Column: # Suppose this is div.RightColumn
for index,test in enumerate(h3s.select("h3")):
print("Feature title: "+str(test.text))
for v in h3s.select("ul")[index]:
print(v.string.strip())
此代码会删除所有<li>
,但是当涉及到刮取Internet访问时
我得到了
AttributeError: 'NoneType' object has no attribute 'strip'
因为 Internet Access 标题下的<li>
数据包含在双引号内,例如&#34;有线高速互联网接入......&#34;
我尝试将print(v.string.strip())
替换为print(v)
,结果为<li>Wired High...</li>
此外,我尝试过使用print(v.text)
,但它也不起作用
相关部分如下:
<h3>Internet Access</h3>
<ul>
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
</ul>
答案 0 :(得分:1)
如果该字符串是元素中的 only child ,则BeautifulSoup元素只有.string
值。您的<li>
代码包含<span>
元素和文字。
使用.text
属性将所有字符串提取为一个:
print(v.text.strip())
print(v.get_text().strip())
还带有一个方便的strip
标志来删除额外的空格:
print(v.get_text(' ', strip=True))
第一个参数是用于将各种字符串连接在一起的分隔符;我在这里用了一个空间。
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <h3>Internet Access</h3>
... <ul>
... <li>Wired High Speed Internet Access in All Guest Rooms
... <span class="fee">
... 25 USD per day
... </span>
... </li>
... </ul>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.li
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
>>> soup.li.string
>>> soup.li.text
u'Wired High Speed Internet Access in All Guest Rooms\n \n 25 USD per day\n \n'
>>> soup.li.get_text(' ', strip=True)
u'Wired High Speed Internet Access in All Guest Rooms 25 USD per day'
请务必在元素上调用它:
for index, test in enumerate(h3s.select("h3")):
print("Feature title: ", test.text)
ul = h3s.select("ul")[index]
print(ul.get_text(' ', strip=True))
您可以在此处使用find_next_sibling()
功能,而不是索引到.select()
:
for header in h3s.select("h3"):
print("Feature title: ", header.text)
ul = header.find_next_sibling("ul")
print(ul.get_text(' ', strip=True))
演示:
>>> for header in h3s.select("h3"):
... print("Feature title: ", header.text)
... ul = header.find_next_sibling("ul")
... print(ul.get_text(' ', strip=True))
...
Feature title: Room Features
Non-Smoking Room Connecting Rooms Available Private Terrace Sea View Room Suites Available Private Balcony Bay View Room Honeymoon Suite Starwood Preferred Guest Room Room with Sitting Area
Feature title: Internet Access
Wired High Speed Internet Access in All Guest Rooms 25 USD per day