Question

目前我正在尝试解析此Wiki页面上的所有表格。但是，正如您可以通过我的代码告诉我，我只检索一个表。我想抓住所有表并将它们放在适当的列/行中。

下面是我的代码，我对下一步需要做的事情有点失落。

import csv
import urllib 
import requests
import codecs
import re
from bs4 import BeautifulSoup

url = \
    'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States'

response = requests.get(url)
html = response.content

#remove references Brackets
removeBrackets = re.sub(r'\[.*\]', '', html)
#remove Trailing 0's in numbers
removeTrails = removeBrackets.replace('0,000,001','')

soup = BeautifulSoup(removeTrails)

table = soup.find('table', {'class': 'sortable wikitable'})

# remove all extra tags in the HTML Tables
for div in soup.findAll('span', 'sortkey'):
    div.extract();
for div in soup.findAll('span', 'sorttext'):
    div.extract();

#scan through table
list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)
#write 
outfile = open("schoolshootings.csv", "wb")
writer = csv.writer(outfile)
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) 
writer.writerow(["Date", "Location", "Deaths", "Injuries", "Description"])
writer.writerows(list_of_rows)

Answer 1

您还需要为表格使用find而不是table = soup.find('table', {'class': 'sortable wikitable'})。如果你改变这一行

for table in soup.findAll('table', {'class': 'sortable wikitable'}):

为：

list_of_rows.append(list_of_cells)

并将所有行缩进到list_of_rows = []一个额外的4个空格，它将获得所有其他表。您还需要将.findAll移至.text。

已编辑添加

你有一堆你真正不需要的正则表达式，因为它更容易使用span。此外，当您使用sorttext提取span时，请删除您不想要的日期字段。由于我删除了正则表达式，因此我还需要使用display:none

提取

url = 'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

list_of_rows = []
for table in soup.findAll('table', {'class': 'sortable wikitable'}):

    # remove all extra tags in the HTML Tables
    for div in soup.findAll('span', 'sortkey'):
        div.extract();
    for div in soup.findAll('span', {'style':'display:none'}):
        div.extract();

    #scan through table
    for row in table.findAll('tr')[1:]:
        list_of_cells = []
        for cell in row.findAll('td'):
            list_of_cells.append(cell.text)
        list_of_rows.append(list_of_cells)

以下代码可满足您的需求：

graph.cypher.execute('''
   MERGE (tom:Person {name: "Tom"})
   MERGE (jerry:Person {name: "Jerry"})
   CREATE UNIQUE (tom)-[:KNOWS]->(jerry)
''')

在Wiki页面中解析多个表

1 个答案: