我试图从HTML网址中获取来自许多不同表格的信息,而没有任何HTML缩进/标签格式。我使用get_text来生成我想要的内容,但它会打印出大量的空白区域和标签。我尝试过.strip并没有达到我想要的效果。
这是我正在使用的python脚本:
import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text
最后,我想创建一个事件日历的csv,但首先我要创建一个.txt或者不需要太多手动清理的东西。< / p>
任何帮助表示赞赏。
答案 0 :(得分:1)
您无需“清理”HTML以便使用BeautifulSoup
解析它。
直接将日期和事件解析为csv文件:
import csv
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
for item in soup.select('td div[align=center] > b'):
date = ' '.join(el.strip() for el in item.find_all(text=True))
event = item.parent.parent.find_next_sibling('td').get_text(strip=True)
writer.writerow([date, event])
运行脚本后的output.csv
内容:
Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.