我正在学习抓取网页并尝试做以下问题:
从apj abdul kalam的维基百科页面及其成就摘录中读取所有数据。
我要提取此表:image of the table i want to extract from that page
from urllib.request import urlopen as ur
import wikipedia as wp
from bs4 import BeautifulSoup as bs
x=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(),"lxml")
for i in parse.findAll("table",{"class":"wikitable sortable"}):
print(i.text)
当我运行上面的代码时,我得到了表,但它不是行和列的形式:
答案 0 :(得分:4)
您需要重新格式化。
from urllib.request import urlopen as ur
import wikipedia as wp
from bs4 import BeautifulSoup as bs
x=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(),"lxml")
table = parse.find("table",{"class":"wikitable sortable"})
rows = table.findAll('tr')[1:]
for row in rows:
columns = [data.text for data in row.findAll('td')]
columns = [col.replace('\n', '') for col in columns]
print (columns)
输出
['2014', 'Doctor of Science', 'Edinburgh University, UK[168]']
['2013', 'Von Braun Award', 'National Space Society']
['2012', 'Doctor of Laws (Honoris Causa)', 'Simon Fraser University[169]']
['2011', 'IEEE Honorary Membership', 'IEEE[170]']
['2010', 'Doctor of Engineering', 'University of Waterloo[171]']
['2009', 'Honorary Doctorate', 'Oakland University[172]']
['2009', 'Hoover Medal', 'ASME Foundation, USA[173]']
['2009', 'International von Kármán Wings Award', 'California Institute of Technology, USA[174]']
['2008', 'Doctor of Engineering (Honoris Causa)', 'Nanyang Technological University, Singapore[175]']
['2008', 'Doctor of Science (Honoris Causa)', 'Aligarh Muslim University, Aligarh[176][177]']
['2007', 'Honorary Doctorate of Science and Technology', 'Carnegie Mellon University[178]']
['2007', 'King Charles II Medal', 'Royal Society, UK[179][180][181]']
['2007', 'Honorary Doctorate of Science', 'University of Wolverhampton, UK[182]']
['2000', 'Ramanujan Award', 'Alwars Research Centre, Chennai[183]']
['1998', 'Veer Savarkar Award', 'Government of India[13]']
['1997', 'Indira Gandhi Award for National Integration', 'Indian National Congress[13][183]']
['1997', 'Bharat Ratna', 'Government of India[183][184]']
['1995', 'Honorary Fellow', 'National Academy of Medical Sciences,[185]']
['1994', 'Distinguished Fellow', 'Institute of Directors (India)[186]']
['1990', 'Padma Vibhushan', 'Government of India[183][187]']
['1981', 'Padma Bhushan', 'Government of India[183][187]']
答案 1 :(得分:1)
在将HTML格式读入数据帧时,我将执行以下操作。然后,我索引结果以获取所需的表。
import pandas as pd
result = pd.read_html("https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam")
print(result[1])
答案 2 :(得分:1)
我使用了qmaruf答案,并使用prettyTable lib添加了一些更漂亮的输出
from prettytable import PrettyTable
import wikipedia as wp
from bs4 import BeautifulSoup as bs
pretty_table=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(), "lxml")
table = parse.find("table",{"class":"wikitable sortable"})
title_row = table.findAll('tr')[0]
title_row_list = [r.text.strip() for r in title_row.findAll('th')]
rows = table.findAll('tr')[1:]
pretty_table = PrettyTable()
pretty_table.field_names = title_row_list
for row in rows:
columns = [data.text for data in row.findAll('td')]
columns = [col.replace('\n', '') for col in columns]
pretty_table.add_row(columns)
print(pretty_table)
输出:
+-------------------------+----------------------------------------------+--------------------------------------------------+
| Year of award or honour | Name of award or honour | Awarding organisation |
+-------------------------+----------------------------------------------+--------------------------------------------------+
| 2014 | Doctor of Science | Edinburgh University, UK[168] |
| 2013 | Von Braun Award | National Space Society |
| 2012 | Doctor of Laws (Honoris Causa) | Simon Fraser University[169] |
| 2011 | IEEE Honorary Membership | IEEE[170] |
| 2010 | Doctor of Engineering | University of Waterloo[171] |
| 2009 | Honorary Doctorate | Oakland University[172] |
| 2009 | Hoover Medal | ASME Foundation, USA[173] |
| 2009 | International von Kármán Wings Award | California Institute of Technology, USA[174] |
| 2008 | Doctor of Engineering (Honoris Causa) | Nanyang Technological University, Singapore[175] |
| 2008 | Doctor of Science (Honoris Causa) | Aligarh Muslim University, Aligarh[176][177] |
| 2007 | Honorary Doctorate of Science and Technology | Carnegie Mellon University[178] |
| 2007 | King Charles II Medal | Royal Society, UK[179][180][181] |
| 2007 | Honorary Doctorate of Science | University of Wolverhampton, UK[182] |
| 2000 | Ramanujan Award | Alwars Research Centre, Chennai[183] |
| 1998 | Veer Savarkar Award | Government of India[13] |
| 1997 | Indira Gandhi Award for National Integration | Indian National Congress[13][183] |
| 1997 | Bharat Ratna | Government of India[183][184] |
| 1995 | Honorary Fellow | National Academy of Medical Sciences,[185] |
| 1994 | Distinguished Fellow | Institute of Directors (India)[186] |
| 1990 | Padma Vibhushan | Government of India[183][187] |
| 1981 | Padma Bhushan | Government of India[183][187] |
+-------------------------+----------------------------------------------+--------------------------------------------------+
答案 3 :(得分:0)
要使其更简单并包含标题,可以尝试以下操作。尽量不要使用FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
DocumentReference docIdRef = rootRef.collection("yourCollection").document(docId);
docIdRef.get().addOnCompleteListener(new OnCompleteListener<DocumentSnapshot>() {
@Override
public void onComplete(@NonNull Task<DocumentSnapshot> task) {
if (task.isSuccessful()) {
DocumentSnapshot document = task.getResult();
if (document.exists()) {
Log.d(TAG, "Document exists!");
} else {
Log.d(TAG, "Document does not exist!");
}
} else {
Log.d(TAG, "Failed with: ", task.getException());
}
}
});
之类的复合类名;相反,请使用连接到该表的一个FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
CollectionReference yourCollRef = rootRef.collection("yourCollection");
Query query = yourCollRef.whereEqualTo("yourPropery", "yourValue");
query.get().addOnCompleteListener(new OnCompleteListener<QuerySnapshot>() {
@Override
public void onComplete(@NonNull Task<QuerySnapshot> task) {
if (task.isSuccessful()) {
for (QueryDocumentSnapshot document : task.getResult()) {
Log.d(TAG, document.getId() + " => " + document.getData());
}
} else {
Log.d(TAG, "Error getting documents: ", task.getException());
}
}
});
,因为复合类名容易中断。
wikitable sortable