这是网页:
<html>
<head>
<!--eBay V3- msxml 6.0 XXXXXXXXXXXXXXXXXXXXXXXXXX-->
<!--srcId - File Exchange Programmatically Upload-->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<title>Upload File Programmatically</title><script language="JavaScript"><!--
var pageName = "File Exchange Upload";
//--></script><script language="javascript" src="http://include.ebaystatic.com/js/e867/us/legacy/globals_e8672us.js"> </script><script src="http://include.ebaystatic.com/js/e885/us/legacy/common_functions_e8852us.js"> </script></head>
<body>
File upload successful. Your ref # is 711103172.<br><a href="javascript:void(0);" onclick="self.close();return false;">Close</a></body>
</html>
我需要提取数字711103172,BeautifulSoup会适合这个吗?或者其他一些方法(我现在正在使用BS,但这个页面结构很少。
我可以获取正文中的数据来返回:
<body>
File upload successful. Your ref # is 711103172.<br><a href="javascript:void(0);" onclick="self.close();return false;">Close</a></body>
然而,一旦我到达那里,我就被困住了。
答案 0 :(得分:2)
使用BeautifulSoup
获取body
文字,然后使用regular expressions提取所需的数字:
import re
from bs4 import BeautifulSoup
data = """
Your HTML code here
"""
soup = BeautifulSoup(data, "html.parser")
match = re.search(r'Your ref # is (\d+)', soup.body.text)
print match.group(1) if match else 'Not Found'
打印:
711103172
仅供参考,(\d+)
正则表达式的一部分是saving/capturing group。 \d+
匹配一个或多个数字。