并使用美丽的汤编写代码: `
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")
data3 = []
getdata = []
for link in data2:
data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
for data in data3:
getdata = data.text
print(getdata)
len(getdata)
`
我的HTML就像:
<a href = "/dataset/banks-assets, class = "label" data-format = "xls">XLS<\a>
&#13;
当我在代码上面运行时,我会收到我想要的文字但是&#39; XLS&#39;这个词即将来临,我想删除&#39; XLS&#39;并希望在一列中解析csv中的剩余文本。我的输出是:
我检查了上面的输出是否是列表。给出了列表,但它只有一个元素,但正如我上面所示,我的输出是很多文本。 请帮我解决。
答案 0 :(得分:1)
如果目的只是从结果列中删除XLS行,那么就可以达到它,例如:
from urllib.request import urlopen
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")
data3 = []
getdata = []
for link in data2:
data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
for data in data3:
if data.text.upper() != 'XLS':
getdata.append(data.text)
print(getdata)
您将获得一个包含所需文字的列表。然后可以轻松地将其转换为DataFrame
,其中此数据将显示为列。
import pandas as pd
df = pd.DataFrame(columns=['col1'], data=getdata)
输出:
col1
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
5 Consolidated Exposures – Immediate Risk Basis ...
6 Consolidated Exposures – Ultimate Risk Basis
7 Banks – Consolidated Group off-balance Sheet B...
8 Liabilities of Australian-located Operations
9 Building Societies – Selected Assets and Liabi...
10 Consolidated Exposures – Immediate Risk Basis ...
11 Banks – Consolidated Group Impaired Assets
12 Assets and Liabilities of Australian-Located O...
13 Managed Funds
14 Daily Net Foreign Exchange Transactions
15 Consolidated Exposures-Immediate Risk Basis
16 Public Unit Trust
17 Securitisation Vehicles
18 Assets of Australian-located Operations
19 Banks – Consolidated Group Capital
放入csv:
df.to_csv('C:\Users\Username\output.csv')