我仍在尝试学习抓取网站。我遇到了与我的工作有关的此案: 网站:https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files
筛选批准的列表。我正在尝试获取您在以下位置看到的3个CSV hrefs:
我知道我可以从python本身的链接中导入csv,但我想增强自己的抓取知识。所有的“ tr”具有相同的类名“ text-align-left”。因此,我想到了使用类“ ms-rteTble-1”遍历主“ table”元素,然后将选择范围缩小到以CSV结尾的那些,但是我的循环似乎无法正常工作。我究竟做错了什么? 感谢所有帮助
source_html = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files').text
soup = BeautifulSoup(source_html,'lxml')
for i in soup.find_all('table', attrs={'class' : 'ms-rteTable-1'}):
print(i.a['href'])
我只得到第一个文件.zip,该怎么办呢?
答案 0 :(得分:1)
另一种方式,使用xpath选择器
print(*[a[0:i] for i in range(len(a),0,-6)],sep="\n")
this is a really big long string which i am going to print substrings from
this is a really big long string which i am going to print substring
this is a really big long string which i am going to print sub
this is a really big long string which i am going to pri
this is a really big long string which i am going
this is a really big long string which i am
this is a really big long string which
this is a really big long string
this is a really big long
this is a really big
this is a real
this is
th
答案 1 :(得分:0)
我想提供一种不同的方法。为什么不选择其中有.csv
的所有锚点?
方法如下:
import requests
from bs4 import BeautifulSoup
url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'lxml').find_all(lambda t: t.name == "a" and ".CSV" in t.text)
print([a["href"] for a in soup])
输出:
['https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv']
对于您的代码,您正在遍历包含表的单个项目,即您的汤。您必须遍历所有行,以获取所有href
属性。
要解决此问题,请尝试以下操作:
import requests
from bs4 import BeautifulSoup
url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'html.parser').find('table', attrs={'class': 'ms-rteTable-1'})
for item in soup.find_all("th"):
anchor = item.a
if anchor:
print(anchor["href"])
输出:
https://www.treasury.gov/ofac/downloads/consolidated/consall.zip
https://www.treasury.gov/ofac/downloads/sanctions/1.0/cons_advanced.xml
https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.pip