下面提到的html是表中的单个“ tr”块。 我的目标是从“ tr”块中提取每个值。
HTML:
<tr><th colspan="7" class="tablehead">
Historical Data for NIFTY 50
</th></tr>
<tr><th colspan="7" class="tablehead">
For the period 01-01-2019 to 10-01-2019
</th></tr>
<tr>
<th >Date</th>
<th >Open</th>
<th >High</th>
<th >Low</th>
<th >Close</th>
<th>Shares Traded</th>
<th >Turnover <br/>(<!--Rs.--> <img src="/images/rup_t1.gif"
alt = "Rs." border="0"> Cr)</th>
</tr>
<tr>
<td class="date"><nobr>01-Jan-2019</nobr></td>
<td class="number"> 10881.70</td>
<td class="number"> 10923.60</td>
<td class="number"> 10807.10</td>
<td class="number"> 10910.10</td>
<td class="number"> 159404542</td>
<td class="number"> 8688.26</td>
</tr>
Python代码:
rows=soup.find_all("tr")
for row in rows:
cells=row.find_all('td')
print(cells)
输出:
[]
所需的输出:
[01-Jan-2019,10881.70,10923.60,10807.10,10910.10,159404542,8688.26]
答案 0 :(得分:0)
from bs4 import BeautifulSoup
test = '''<tr> <td class="date"><nobr>01-Jan-2019</nobr></td>
<td class="number">10881.70</td>
<td class="number">10923.60</td>
<td class="number">10807.10</td>
<td class="number">10910.10</td>
<td class="number">159404542</td>
<td class="number">8688.26</td>
</tr>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("tr")
for d in data:
print([d.text.strip().replace("\n", ",").rstrip()])
输出:
['01-Jan-2019,10881.70,10923.60,10807.10,10910.10,159404542,8688.26']
编辑:
from bs4 import BeautifulSoup
test = '''<tr> <td class="date"><nobr>01-Jan-2019</nobr></td>
<td class="number">10881.70</td>
<td class="number">10923.60</td>
<td class="number">10807.10</td>
<td class="number">10910.10</td>
<td class="number">159404542</td>
<td class="number">8688.26</td>
</tr>'''
soup = BeautifulSoup(test, 'html.parser')
number = soup.find("nobr")
data = soup.find_all("td", class_ ="number")
data_list = []
for n in number:
data_list.append(n)
for d in data:
data_list.append(d.text)
print(data_list)
输出:
['01-Jan-2019', '10881.70', '10923.60', '10807.10', '10910.10', '159404542', '8688.26']
答案 1 :(得分:0)
看起来您可以使用css或语法指定多个选择器
soup.select("nobr, td.number")
那是
data = [item.text for item in soup.select("nobr, td.number")]
输出:
答案 2 :(得分:0)
这是答案。 定义一个可以检查两个类是否都存在的函数可以解决该问题。
def class_variants(css_class):
list=['date','number']
return css_class in list
rows = soup.find_all("tr")
for row in rows:
cells=row.find_all('td', class_=class_variants)
print(str(cells))
答案 3 :(得分:0)
我不确定这是否是您需要的内容,但是我带走了您的html,然后在营业额<th>
中插入了另一个日期<td>
(正好我们有两个日期可以看到(如果两个都可以被捕获)以及另一个<td>
(带有伪类)以查看是否可以将其排除。现在看起来像这样:
html ='''
<tr><th colspan="7" class="tablehead">
Historical Data for NIFTY 50
</th></tr>
<tr><th colspan="7" class="tablehead">
For the period 01-01-2019 to 10-01-2019
</th></tr>
<tr>
<th >Date</th>
<th >Open</th>
<th >High</th>
<th >Low</th>
<th >Close</th>
<th>Shares Traded</th>
<th >Turnover <br/>(<!--Rs.--> <img src="/images/rup_t1.gif"
alt = "Rs." border="0"> Cr)</th>
</tr>
<tr>
<td class="date"><nobr>01-Jan-2019</nobr></td>
<td class="number"> 10881.70</td>
<td class="number"> 10923.60</td>
<td class="date"><nobr>08-Jan-2017</nobr></td>
<td class="number"> 10807.10</td>
<td class="number"> 10910.10</td>
<td class="sushi"> zumba</td>
<td class="number"> 159404542</td>
<td class="number"> 8688.26</td>
</tr>
'''
代码:
info = soup.findAll('td', {'class':['date', 'number']})
或
info = [item.text for item in soup.select(".date, .number")]
然后:
for i in info:
print(i.strip())
输出:
01-Jan-2019
10881.70
10923.60
08-Jan-2017
10807.10
10910.10
159404542
8688.26