目前我正在尝试解析此Wiki页面上的所有表格。但是,正如您可以通过我的代码告诉我,我只检索一个表。我想抓住所有表并将它们放在适当的列/行中。
下面是我的代码,我对下一步需要做的事情有点失落。
import csv
import urllib
import requests
import codecs
import re
from bs4 import BeautifulSoup
url = \
'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States'
response = requests.get(url)
html = response.content
#remove references Brackets
removeBrackets = re.sub(r'\[.*\]', '', html)
#remove Trailing 0's in numbers
removeTrails = removeBrackets.replace('0,000,001','')
soup = BeautifulSoup(removeTrails)
table = soup.find('table', {'class': 'sortable wikitable'})
# remove all extra tags in the HTML Tables
for div in soup.findAll('span', 'sortkey'):
div.extract();
for div in soup.findAll('span', 'sorttext'):
div.extract();
#scan through table
list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#write
outfile = open("schoolshootings.csv", "wb")
writer = csv.writer(outfile)
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
writer.writerow(["Date", "Location", "Deaths", "Injuries", "Description"])
writer.writerows(list_of_rows)
答案 0 :(得分:1)
您还需要为表格使用find
而不是table = soup.find('table', {'class': 'sortable wikitable'})
。如果你改变这一行
for table in soup.findAll('table', {'class': 'sortable wikitable'}):
为:
list_of_rows.append(list_of_cells)
并将所有行缩进到list_of_rows = []
一个额外的4个空格,它将获得所有其他表。您还需要将.findAll
移至.text
。
已编辑添加
你有一堆你真正不需要的正则表达式,因为它更容易使用span
。此外,当您使用sorttext
提取span
时,请删除您不想要的日期字段。由于我删除了正则表达式,因此我还需要使用display:none
url = 'https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
list_of_rows = []
for table in soup.findAll('table', {'class': 'sortable wikitable'}):
# remove all extra tags in the HTML Tables
for div in soup.findAll('span', 'sortkey'):
div.extract();
for div in soup.findAll('span', {'style':'display:none'}):
div.extract();
#scan through table
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
以下代码可满足您的需求:
graph.cypher.execute('''
MERGE (tom:Person {name: "Tom"})
MERGE (jerry:Person {name: "Jerry"})
CREATE UNIQUE (tom)-[:KNOWS]->(jerry)
''')