我有这个HTML代码:http://imgur.com/a/dPNYI
我正在尝试提取并打印图像中突出显示的行
(“some text”)。
“some text”是第一个div的文本,其中class=chat-message
嵌套在div id=chat-messages
中(换句话说,我试图提取文本div id=chat-messages
的第一个孩子div,而他的所有孩子的结构都相似)。
我尝试了什么:
import requests
from bs4 import BeautifulSoup
url = "the url this is used for"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
g_data = soup.find('div',{'class':'chat-message-content selectable'})
print(g_data.text)
这给了我错误:
AttributeError: 'NoneType' object has no attribute 'text'
好像g_data
是NULL
。
我做错了什么?谢谢!
HTML代码:
<html>
<head>
<title>
</title>
</head>
<body>
<div id="main">
<div data-reactroot="" id="app">
<div class="top-bar-authenticated" id="top-bar">
</div>
<div class="closed" id="navigation-bar">
</div>
<div id="right-sidebar">
<div id="chat">
<div id="chat-head">
</div>
<div id="chat-title">
</div>
<div id="chat-messages">
<div class="chat-message">
<div class="chat-message-avatar" style="background-image: url("https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/65/657dcec97cc00bc378629930ecae1776c0d981e0.jpg");">
</div>
<a class="chat-message-username clickable">
<div class="iron-color">
aloe
</div></a>
<div class="chat-message-content selectable">
<!-- react-text: 2532 -->some text<!-- /react-text -->
</div>
</div>
<div class="chat-message">
<div class="chat-message-avatar" style="background-image: url("https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/65/657dcec97cc00bc378629930ecae1776c0d981e0.jpg");">
</div>
<a class="chat-message-username clickable">
<div class="iron-color">
aloe
</div></a>
<div class="chat-message-content selectable">
<!-- react-text: 2533 -->some other text<!-- /react-text -->
</div>
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
答案 0 :(得分:1)
阅读您对该问题的评论我看到您正在尝试解析使用JavaScript加载内容的网站,这就是为什么requests
不适合您的原因。您应该将selenium
与网络驱动程序一起使用(例如,Chromedriver
,PhantomJS
)。类似下面的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.csgoarena.com/home")
soup = BeautifulSoup(driver.page_source, 'lxml')
g_data = soup.findAll('div',{'class':'chat-message-content selectable'})
print(g_data)
由于您需要所有选定元素的.text
:
>>> for match in g_data:
print(match.text)
not everytime :D
I understand :)
NuuZy csgoarena.com but he won GA's only when it were long
Yea I always saw him
Everyday
caught
(...)
答案 1 :(得分:0)
如果要搜索与两个或更多CSS类匹配的标记,则应使用CSS选择器:
country_rdd:
(id, country)
income_rdd:
(id, (income, month, year))
joined_rdd = income_rdd.join(country_rdd)