Question

我正在学习抓取网页并尝试做以下问题：

从apj abdul kalam的维基百科页面及其成就摘录中读取所有数据。

我要提取此表：image of the table i want to extract from that page

from urllib.request import urlopen as ur
import wikipedia as wp
from bs4 import BeautifulSoup as bs
x=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(),"lxml")
for i in parse.findAll("table",{"class":"wikitable sortable"}):
     print(i.text)

当我运行上面的代码时，我得到了表，但它不是行和列的形式：

it is like this form

Answer 1

您需要重新格式化。

from urllib.request import urlopen as ur
import wikipedia as wp
from bs4 import BeautifulSoup as bs

x=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(),"lxml")

table = parse.find("table",{"class":"wikitable sortable"})
rows = table.findAll('tr')[1:]

for row in rows:
    columns = [data.text for data in row.findAll('td')]
    columns = [col.replace('\n', '') for col in columns]
    print (columns)

输出

['2014', 'Doctor of Science', 'Edinburgh University, UK[168]']
['2013', 'Von Braun Award', 'National Space Society']
['2012', 'Doctor of Laws (Honoris Causa)', 'Simon Fraser University[169]']
['2011', 'IEEE Honorary Membership', 'IEEE[170]']
['2010', 'Doctor of Engineering', 'University of Waterloo[171]']
['2009', 'Honorary Doctorate', 'Oakland University[172]']
['2009', 'Hoover Medal', 'ASME Foundation, USA[173]']
['2009', 'International von Kármán Wings Award', 'California Institute of Technology, USA[174]']
['2008', 'Doctor of Engineering (Honoris Causa)', 'Nanyang Technological University, Singapore[175]']
['2008', 'Doctor of Science (Honoris Causa)', 'Aligarh Muslim University, Aligarh[176][177]']
['2007', 'Honorary Doctorate of Science and Technology', 'Carnegie Mellon University[178]']
['2007', 'King Charles II Medal', 'Royal Society, UK[179][180][181]']
['2007', 'Honorary Doctorate of Science', 'University of Wolverhampton, UK[182]']
['2000', 'Ramanujan Award', 'Alwars Research Centre, Chennai[183]']
['1998', 'Veer Savarkar Award', 'Government of India[13]']
['1997', 'Indira Gandhi Award for National Integration', 'Indian National Congress[13][183]']
['1997', 'Bharat Ratna', 'Government of India[183][184]']
['1995', 'Honorary Fellow', 'National Academy of Medical Sciences,[185]']
['1994', 'Distinguished Fellow', 'Institute of Directors (India)[186]']
['1990', 'Padma Vibhushan', 'Government of India[183][187]']
['1981', 'Padma Bhushan', 'Government of India[183][187]']

Answer 2

在将HTML格式读入数据帧时，我将执行以下操作。然后，我索引结果以获取所需的表。

import pandas as pd
result = pd.read_html("https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam")
print(result[1])

Answer 3

我使用了qmaruf答案，并使用prettyTable lib添加了一些更漂亮的输出

from prettytable import PrettyTable
import wikipedia as wp
from bs4 import BeautifulSoup as bs
pretty_table=wp.search("A P J ABDUL KALAM")
p=wp.page("A P J ABDUL KALAM")
parse=bs(p.html(), "lxml")
table = parse.find("table",{"class":"wikitable sortable"})
title_row = table.findAll('tr')[0]
title_row_list = [r.text.strip() for r in title_row.findAll('th')]

rows = table.findAll('tr')[1:]

pretty_table = PrettyTable()
pretty_table.field_names = title_row_list

for row in rows:
    columns = [data.text for data in row.findAll('td')]
    columns = [col.replace('\n', '') for col in columns]
    pretty_table.add_row(columns)

print(pretty_table)

输出：

+-------------------------+----------------------------------------------+--------------------------------------------------+
| Year of award or honour |           Name of award or honour            |              Awarding organisation               |
+-------------------------+----------------------------------------------+--------------------------------------------------+
|           2014          |              Doctor of Science               |          Edinburgh University, UK[168]           |
|           2013          |               Von Braun Award                |              National Space Society              |
|           2012          |        Doctor of Laws (Honoris Causa)        |           Simon Fraser University[169]           |
|           2011          |           IEEE Honorary Membership           |                    IEEE[170]                     |
|           2010          |            Doctor of Engineering             |           University of Waterloo[171]            |
|           2009          |              Honorary Doctorate              |             Oakland University[172]              |
|           2009          |                 Hoover Medal                 |            ASME Foundation, USA[173]             |
|           2009          |     International von Kármán Wings Award     |   California Institute of Technology, USA[174]   |
|           2008          |    Doctor of Engineering (Honoris Causa)     | Nanyang Technological University, Singapore[175] |
|           2008          |      Doctor of Science (Honoris Causa)       |   Aligarh Muslim University, Aligarh[176][177]   |
|           2007          | Honorary Doctorate of Science and Technology |         Carnegie Mellon University[178]          |
|           2007          |            King Charles II Medal             |         Royal Society, UK[179][180][181]         |
|           2007          |        Honorary Doctorate of Science         |       University of Wolverhampton, UK[182]       |
|           2000          |               Ramanujan Award                |       Alwars Research Centre, Chennai[183]       |
|           1998          |             Veer Savarkar Award              |             Government of India[13]              |
|           1997          | Indira Gandhi Award for National Integration |        Indian National Congress[13][183]         |
|           1997          |                 Bharat Ratna                 |          Government of India[183][184]           |
|           1995          |               Honorary Fellow                |    National Academy of Medical Sciences,[185]    |
|           1994          |             Distinguished Fellow             |       Institute of Directors (India)[186]        |
|           1990          |               Padma Vibhushan                |          Government of India[183][187]           |
|           1981          |                Padma Bhushan                 |          Government of India[183][187]           |
+-------------------------+----------------------------------------------+--------------------------------------------------+

Answer 4

要使其更简单并包含标题，可以尝试以下操作。尽量不要使用FirebaseFirestore rootRef = FirebaseFirestore.getInstance(); DocumentReference docIdRef = rootRef.collection("yourCollection").document(docId); docIdRef.get().addOnCompleteListener(new OnCompleteListener<DocumentSnapshot>() { @Override public void onComplete(@NonNull Task<DocumentSnapshot> task) { if (task.isSuccessful()) { DocumentSnapshot document = task.getResult(); if (document.exists()) { Log.d(TAG, "Document exists!"); } else { Log.d(TAG, "Document does not exist!"); } } else { Log.d(TAG, "Failed with: ", task.getException()); } } });之类的复合类名；相反，请使用连接到该表的一个FirebaseFirestore rootRef = FirebaseFirestore.getInstance(); CollectionReference yourCollRef = rootRef.collection("yourCollection"); Query query = yourCollRef.whereEqualTo("yourPropery", "yourValue"); query.get().addOnCompleteListener(new OnCompleteListener<QuerySnapshot>() { @Override public void onComplete(@NonNull Task<QuerySnapshot> task) { if (task.isSuccessful()) { for (QueryDocumentSnapshot document : task.getResult()) { Log.d(TAG, document.getId() + " => " + document.getData()); } } else { Log.d(TAG, "Error getting documents: ", task.getException()); } } });，因为复合类名容易中断。

wikitable sortable

如何在python中逐行打印Wikipedia的抓取表？

4 个答案: