使用漂亮的汤提取选择性文本并将结果写入CSV

时间:2017-07-03 10:21:34

标签: python-3.x web-scraping beautifulsoup

我正在尝试从网站[https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal%20asc%2C%20score%20desc%2C%20metadata_modified%20desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0]

中提取选择性文字

并使用美丽的汤编写代码: `

wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")

data3 = []
getdata = []
for link in data2:
    data3  = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
for data in data3:
      getdata = data.text
      print(getdata)

len(getdata)
`

我的HTML就像:



<a href = "/dataset/banks-assets, class = "label" data-format = "xls">XLS<\a>
&#13;
&#13;
&#13;

当我在代码上面运行时,我会收到我想要的文字但是&#39; XLS&#39;这个词即将来临,我想删除&#39; XLS&#39;并希望在一列中解析csv中的剩余文本。我的输出是:

  • 银行 - 资产
  • XLS
  • 合并曝光 - 即时和终极       风险基础
  • XLS
  • 外汇交易与控股       官方储备资产
  • XLS
  • 财务公司和一般金融家        - 选定的资产和负债
  • XLS
  • 负债和资产 -       每月XLS合并风险 - 即时风险基础 -       国家国际索赔
  • XLS 等等.......

我检查了上面的输出是否是列表。给出了列表,但它只有一个元素,但正如我上面所示,我的输出是很多文本。   请帮我解决。

1 个答案:

答案 0 :(得分:1)

如果目的只是从结果列中删除XLS行,那么就可以达到它,例如:

from urllib.request import urlopen
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")

    data3 = []
    getdata = []
    for link in data2:
        data3  = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
    for data in data3:
        if data.text.upper() != 'XLS':
            getdata.append(data.text)
    print(getdata)

您将获得一个包含所需文字的列表。然后可以轻松地将其转换为DataFrame,其中此数据将显示为列。

import pandas as pd
df = pd.DataFrame(columns=['col1'], data=getdata)

输出:

                                                 col1
0                                      Banks – Assets
1   Consolidated Exposures – Immediate and Ultimat...
2   Foreign Exchange Transactions and Holdings of ...
3   Finance Companies and General Financiers – Sel...
4                    Liabilities and Assets – Monthly
5   Consolidated Exposures – Immediate Risk Basis ...
6        Consolidated Exposures – Ultimate Risk Basis
7   Banks – Consolidated Group off-balance Sheet B...
8        Liabilities of Australian-located Operations
9   Building Societies – Selected Assets and Liabi...
10  Consolidated Exposures – Immediate Risk Basis ...
11         Banks – Consolidated Group Impaired Assets
12  Assets and Liabilities of Australian-Located O...
13                                      Managed Funds
14            Daily Net Foreign Exchange Transactions
15        Consolidated Exposures-Immediate Risk Basis
16                                  Public Unit Trust
17                            Securitisation Vehicles
18            Assets of Australian-located Operations
19                 Banks – Consolidated Group Capital

放入csv:

df.to_csv('C:\Users\Username\output.csv')