Question

我想从网址＆＃34; http://www.nycgo.com/venues/thalia-restaurant#menu＆＃34;中删除文字。我感兴趣的文字在＆＃39;菜单中。页面上的标签。我尝试使用BeautifulSoup来获取页面上的所有文本，但是以下代码中的返回值错过了菜单中的所有文本。

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html=html.read()
soup = BS(html)
print soup.get_text()

当我从菜单内容中检查元素时，菜单的内容似乎是页面上html的一部分。我注意到在物理浏览页面时，菜单需要几秒钟才能完全加载。不确定上述代码是否无法获取菜单内容。

任何见解都将受到赞赏。

Answer 1

虽然soup.get_text() 将从HTML文档（网页）返回所有文本，但这里的问题是该菜单以PDF格式嵌入页面中，而美丽的汤无法访问。实际的PDF文件在Javascript中定义如下：

{
    name: "menu",
    show: Boolean(1),
    url: "/assets/files/programs/rw/2016W/thalia-restaurant.pdf"
}

提取它的最简单方法可能是使用正则表达式。虽然这通常是一个坏主意，但在这里你正在寻找一个非常具体的东西 - 一个文件，包含在＆＃34;引用＆＃34;以.pdf结尾。以下代码将找到并提取URL：

import re
from urllib import urlopen

html = urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html_doc = html.read()

match = re.search(b'\"(.*?\.pdf)\"', html_doc)
pdf_url = "http://www.nycgo.com" + match.group(1).decode('utf8')

现在pdf_url是：

u'http://www.nycgo.com/assets/files/programs/rw/2016W/thalia-restaurant.pdf'

但是，从PDF中提取文本有点棘手。您可以先下载该文件：

from urllib import urlretrieve
urlretrieve(pdf_url, "download.pdf")

然后使用函数in this answer to another question：

按照描述提取文本

text = convert_pdf_to_txt("download.pdf")
print(text)

返回：

NEW YOUR CITY 
RESTAURANT WEEK

WINTER 2016

MONDAY - FRIDAY
828 Eighth Avenue
New York City, 10019

Tel: 212.399.4444

www.restaurantthalia.com

LUNCH $25
FIRST COURSE
CREAMY POLENTA
fricassee of truffle mushrooms

...

Python从URL抓取pdf

1 个答案: