我有一个有行的表,每行有6列。我已经阅读了列中的值,并将它们添加到数据框中,但是在列1和列6中也有一些链接我也想添加。我承认我是python的新手。我需要帮助。
我已经尝试创建一个新的数据帧并将链接存储在第一列中,但是两个数据帧中的行并不相等。
import urllib3
from bs4 import BeautifulSoup
import pandas as pd
import time
COLUMNS = ['Legal Name', 'Status', 'Size', 'Suburb or Town', 'State', 'ABN']
COLUMNS2 = ['Link1']
urls = []
for i in range(3):
quotepage = "https://www.acnc.gov.au/charity?items_per_page=60&"
quotepage = quotepage + "facet__select__field_beneficiaries=0&"
quotepage = quotepage + "facet__select__field_countries=0&"
quotepage = quotepage + "facet__select__acnc_search_api_sub_history=0&"
quotepage = quotepage + "facet__select__field_status=307&"
quotepage = quotepage + "page="+str(i)+"#search"
#print (quotepage)
urls.append(quotepage)
i=0
dataframes = []
dataframes2 = []
cy_data = []
cy_data2 = []
for url in urls:
i=i+1
print(i)
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = BeautifulSoup(response.data, "html5lib")
pagetable = soup.find('table')
rows = soup.find("table").find_all('tr')
time.sleep(.5)
for row in rows:
cells = row.find_all("td")
cells = cells[0:6] # Select the correct columns
cy_data.append([cell.text.strip() for cell in cells])
links = pagetable.find_all("a")
for link in links:
if len(link["href"]) == 41:# href for charity
cy_data2.append(link["href"])
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
dataframes2.append(pd.DataFrame(cy_data2, columns=COLUMNS2).drop(0, axis=0))
#data = pd.concat([dataframes, dataframes2], axis=1)
data = pd.concat(dataframes)
data2 = pd.concat(dataframes2)
我想将链接添加到数据框,仅此而已。
答案 0 :(得分:0)
不要从数据帧中删除零索引,就像这样:
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS))
dataframes2.append(pd.DataFrame(cy_data2, columns=COLUMNS2))
然后将查找表行代码的内容更改为:
rows = soup.find("table").find("tbody").find_all('tr')
结果:
DataFrame 1 [180 rows x 6 columns]
DataFrame 2 [180 rows x 1 columns]