在没有唯一类的情况下BeautifulSoup4表抓取-学习

时间:2020-11-12 16:35:15

标签: python web-scraping beautifulsoup

我仍在尝试学习抓取网站。我遇到了与我的工作有关的此案: 网站:https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files

筛选批准的列表。我正在尝试获取您在以下位置看到的3个CSV hrefs:

  • CONS_PRIM.CSV
  • CONS_ADD.CSV
  • CONS_ALT.CSV

我知道我可以从python本身的链接中导入csv,但我想增强自己的抓取知识。所有的“ tr”具有相同的类名“ text-align-left”。因此,我想到了使用类“ ms-rteTble-1”遍历主“ table”元素,然后将选择范围缩小到以CSV结尾的那些,但是我的循环似乎无法正常工作。我究竟做错了什么? 感谢所有帮助

source_html = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files').text
soup = BeautifulSoup(source_html,'lxml')

for i in soup.find_all('table', attrs={'class' : 'ms-rteTable-1'}):
    print(i.a['href'])

我只得到第一个文件.zip,该怎么办呢?

2 个答案:

答案 0 :(得分:1)

另一种方式,使用xpath选择器

print(*[a[0:i] for i in range(len(a),0,-6)],sep="\n")
this is a really big long string which i am going to print substrings from
this is a really big long string which i am going to print substring
this is a really big long string which i am going to print sub
this is a really big long string which i am going to pri
this is a really big long string which i am going 
this is a really big long string which i am 
this is a really big long string which
this is a really big long string
this is a really big long 
this is a really big
this is a real
this is 
th

答案 1 :(得分:0)

我想提供一种不同的方法。为什么不选择其中有.csv的所有锚点?

方法如下:

import requests
from bs4 import BeautifulSoup

url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'lxml').find_all(lambda t: t.name == "a" and ".CSV" in t.text)
print([a["href"] for a in soup])

输出:

['https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv']

对于您的代码,您正在遍历包含表的单个项目,即您的汤。您必须遍历所有行,以获取所有href属性。

要解决此问题,请尝试以下操作:

import requests
from bs4 import BeautifulSoup

url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'html.parser').find('table', attrs={'class': 'ms-rteTable-1'})

for item in soup.find_all("th"):
    anchor = item.a
    if anchor:
        print(anchor["href"])

输出:

https://www.treasury.gov/ofac/downloads/consolidated/consall.zip
https://www.treasury.gov/ofac/downloads/sanctions/1.0/cons_advanced.xml
https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.pip