我在下面有html提取,请注意,我需要捕获的每一行都有两个td重复。
<table class="ent">
<tbody class=""><tr class="tablestyle">
<td class="hide_on_mobile"> <a href="../" class="">
<img class="ProductImage" src="https://.."></a>
</td>
<td class="hide_on_mobile" align="center">
<strong class="">
<span style="font-size:1.4em;" class="">Scraped okay - col0</span>
<br>
<br>Scrape this text - col1</strong><br>
<br><i><span style="color:indigo;" class="">Scrape this text - col2
<br class="">
<br>Next Event: Scrape this text -col3</span></i>
</td>
我需要捕获4个不同的数据块col0,col1,col2,col3
我已经可以使用col0了。我需要捕获col1,col2,col3
我正在尝试使用BR,即 跨度之后
在col1的第二个BR之后获取文本
在col2之后的第3个BR之后获取文本
在col3的第5个BR后获取文本
我无法让col1与br> br一起工作。有什么想法可以解决这个问题吗?
import sqlite3
import datetime
import requestsnt
import pandas as pd
from bs4 import BeautifulSoup
url = "http:/*"
r = requests.get(url)
source = r.text
t = datetime.datetime.now().date()
soup = BeautifulSoup(source, "lxml")
row_count=200
row_marker = 0
new_table = pd.DataFrame(columns = ["col0", "col1", "col2","col3", "DateAdded"], index = range(0,row_count)) # I don't know the number of rows
# For col0
column_marker = 0
for layout in soup.select("strong > span"):
new_table.iat[row_marker,column_marker] = layout.text.strip()
new_table.iat[row_marker,4] = t
row_marker +=1
# For col 1
column_marker = 1
row_marker = 0
for layout in soup.select("strong > span > br > br"):
new_table.iat[row_marker,column_marker] = layout.text.strip()
row_marker +=1
答案 0 :(得分:0)
#since you said there are multiple trs
trs = data.find_all('tr')
for tr in trs:
l = []
td = tr.find_all('td')
#since first td will never have data.. acc to the above posted ques
for tags in td[1]:
try:
if tags.text:
print(tags.text)
l.extend((tags.text).split('\n'))
except:
pass
#once there are more trs keep below code inside the loop
#then store the data in a df..since each loop will give new list
str_data = [' '.join(s.split()) for s in l if s]
str_data.remove('')
print(str_data)
输出
['Scraped okay - col0',
'Scrape this text - col1',
'Scrape this text - col2',
'Next Event: Scrape this text -col3']