我想解析一个大的.txt文件,并根据其父标签提取数据。问题是,例如,“ class =“ ro”'包含数百个不同的文本和数字位,其中大多数没有用。
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
# load the data
soup = BeautifulSoup(data.text, 'html.parser')
# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
db = [td.text.strip() for td in tr.find_all('td')]
print(db)
正如我之前说过的,这样做可以获取所有这些标签,但是95%的回报都是无用的。我想使用for循环或类似的方式根据文件名进行过滤...“对于FILENAME = R2,R3等的所有文件” ...使用类“ ro”,“ rou”抓取所有标签,等等。到目前为止,我尝试过的所有操作都会返回空容器...有人可以帮忙吗?预先感谢!
<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm** <------- for everything with this filename
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text"> <span></span> <------ return this tag
</td>
.....removed for brevity
</tr>
可以在此处完整找到两个示例文件:
(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt)(https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)
答案 0 :(得分:1)
不确定如何输出,但是使用bs4 4.7.1可以使用:contains
伪类来过滤文件名标签
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
soup = BeautifulSoup(data.text, 'lxml')
filenames = ['R2.htm', 'R3.htm']
for filename in filenames:
print('-----------------------------')
i = 1
for item in soup.select('filename:contains("' + filename + '")'):
print(filename, ' ', 'result' + str(i))
for tr in item.find_all('tr', {'class':['rou','ro','re','reu']}):
db = [td.text.strip() for td in tr.find_all('td')]
print(db)
i+=1