How can I get the specific item with same Class name and attributes?
I need to get these 3 items
April 14, 2013
580
Fort Pierce, FL
<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort
Pierce, FL</a>
答案 0 :(得分:1)
Using they lie under the <dd>
tag, using .find_all()
:
from bs4 import BeautifulSoup
test = '''<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort Pierce, FL</a>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("dd")
for d in data:
print(d.text.strip())
OUTPUT:
Apr 14, 2013
580
Fort Pierce, FL
答案 1 :(得分:0)
这是一个很好的起点:
In [18]: for a in response.css('.extraUserInfo'):
...: print(a.css('*::text').extract())
...: print('\n\n\n')
...:
['\n', '\n', '\n', '\n'] # <--this (and other outputs like this) is because there is an extra `extraUserInfo` class block above the desired info block if the user has a user group picture/avatar below their username
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Apr 14, 2013', '\n', '\n', '\n', 'Messages:', '\n', '580', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Fort Pierce, FL', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Oct 20, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,476', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Philadelphia, PA', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Dec 11, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,938', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Colorado', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Sep 30, 2016', '\n', '\n', '\n', 'Messages:', '\n', '833', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Indiana', '\n', '\n', '\n']
...
有很多方法可以解决这个问题。稍微摆弄一下即可将数据格式化为您喜欢的格式。上面的方法只是一个很好的起点,因为有许多行只将换行符列表作为输出,这是因为(看来)用户信息会阻止用户拥有用户组图像(如亚利桑那州的特斯拉)的地方,然后{ {1}}类也用于对html块进行分组。会有更好的方法将其分组...
基本上,response.css('。extraUserInfo')将聚集所有类为extraUserInfo
的块,这似乎是保存您要查找的用户信息的块。
使用extraUserInfo
伪选择器从那里提取所有基础文本,并解析数组。
如果仔细看一下html结构,肯定有更好的方法来解决此问题,因此您提取它的方式可以减少以后的处理工作,但这应该可以使您走上正确的道路。 CSS选择器或xpath文档应该有很大帮助。