我正在尝试从<li>
标记中提取日期并将其存储在Excel文件中。
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
代码:
import urllib2
import os
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
soup = BeautifulSoup(li[count])
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
count+=1
错误:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
soup =BeautifulSoup(li[count])
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]
我不知道如何编写excel中提取的每个文本。没有包含代码。请参阅问题:Web crawler to extract in between the list
答案 0 :(得分:1)
问题是 - 有不相关的li
标签不包含您需要的数据。
更具体。例如,如果要获取“20世纪”的事件列表,首先找到标题并从其父级following ul
sibling获取事件列表。此外,并非列表中的每个项目都具有%B %d, %Y
格式的日期 - 您需要通过try/except
块处理它:
import urllib2
from datetime import datetime
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text
打印:
19/09/1902
30/12/1903
11/01/1908
24/12/1913
23/10/1942
09/03/1946
1954 500-800 killed at Kumbha Mela, Allahabad.
01/01/1956
02/01/1971
03/12/1979
20/10/1982
29/05/1985
13/03/1988
20/08/1988
更新版本(获得一个世纪以下的所有ul组):
events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
if tag.name == 'h2':
break
for event in tag.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text