我一直在尝试从证券交易所的表格中获取一些信息 (https://www.idx.co.id/en-us/listed-companies/company-profiles/)
使用python(lxlml,请求和熊猫) 这是我使用的参考:
https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059
由于我是python /编程的绝对新手,也许有人对如何在表主体中的行元素上仅应用.xpath
并提取内容感兴趣吗?我也研究过使用bs4 / beautifulsoup,但也没有使它起作用。任何帮助或建议,不胜感激!谢谢您的时间
我的代码
from lxml import html as lh
import requests
import pandas as pd
#create a handle page to handle the contents of the website
page = requests.get('http://www.idx.co.id/en-us/listed-companies/company-profiles/')
# stores contents under doc
doc = lh.fromstring(page.content)
#parses data stored in between <tr>..<tr> of the html
tr_elements = doc.xpath('//*[@id="companyTable"]/tbody')
#create empty list
col = []
i = 0
for j in range(0,len(tr_elements)):
#T is our j'th row
T = tr_elements[j]
#If row is not of size 4, the //tr data is not from our table
if len(T)!=4:
break
# i is column index
i=0
# Iterate through each element of the row
for t in T.iterchildren():
data = t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
[len(C) for (title,C) in col] # checking no of values in all columns
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
print(df)
输出print(df)
Empty DataFrame
Columns: []
Index: []
预期输出:
Columns: [No, Code, Name, Listing Date]
Index: [1, AALI, Astra Agro Lestari Tbk, 09 Dec 1997]
答案 0 :(得分:0)
无法获得空结果的原因是因为页面使用AJAX加载表的内容(它使用https://datatables.net)。如果要抓取javascript生成的结果,则requests
不足,因为它不执行javascript。您需要使用Chromedriver之类的库来运行浏览器或selenium-python之类的无头浏览器。如果您想走这条路,internet中有很多教程。
但是,有更好的方法。如果您了解how AJAX works,则该页面显然需要调用API来检索数据。找到它之后,您只需使用几行代码就可以直接使用该API轻松检索数据:
import requests
import pandas as pd
res = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = res.json()
df = pd.DataFrame.from_dict(data['data'])
print(df.columns)
print(df)
结果:
Index(['Alamat', 'BAE', 'DataID', 'Divisi', 'EfekEmiten_EBA', 'EfekEmiten_ETF',
'EfekEmiten_Obligasi', 'EfekEmiten_SPEI', 'EfekEmiten_Saham', 'Email',
'Fax', 'JenisEmiten', 'KegiatanUsahaUtama', 'KodeDivisi', 'KodeEmiten',
'Logo', 'NPKP', 'NPWP', 'NamaEmiten', 'PapanPencatatan', 'Sektor',
'Status', 'SubSektor', 'TanggalPencatatan', 'Telepon', 'Website', 'id'],
dtype='object')
Alamat ... id
0 Jl Pulo Ayang Raya Blok OR No. 1 Kawasan Indu... ... 0
1 Sahid Office Boutique, Blok G Jl Jend Sudirman... ... 0
2 Plaza ABDA Lt. 27 Jl. Jend. Sudirman Kav. 59 ... ... 0
3 Gedung TMT 1 Lantai 18 Jl. Cilandak KKO No. 1... ... 0
4 Gedung Kawan Lama Lantai 5 Jl. Puri Kencana N... ... 0
5 ACSET Building, Jalan Majapahit No.26, Kelurah... ... 0
6 Perkantoran Hijau Arkadia Tower C Lantai 15\rJ... ... 0
7 Jalan Raya Pasar Minggu Km. 18 Jakarta 12510 ... 0
8 Gedung The Landmark I Lantai 26-31\r\nJl. Jend... ... 0
9 Gedung Wisma 46 Kota BNI Kav 1 LT. 20 JL Jend.... ... 0
.. ... ... ..
620 Gedung Graha Irama lt. 2-E\rJl. H.R. Rasuna Sa... ... 0
621 Plaza Mutiara Lt. 5,\rJl. Dr. Ide Anak Agung G... ... 0
622 Jl. Jemur Sari Selatan IV/3, \r\nSurabaya 6023... ... 0
623 Jl. Pantai Indah Selatan I, Elang Laut Blok A ... ... 0
624 Jalan Karet Pedurenan No. 240, Karet Kuningan,... ... 0
[625 rows x 27 columns]