我正在使用reguler experessions创建代理搜寻器。使用re进行Html解析非常糟糕,所以我需要确保最终结果中没有字符串显示。如何用空格替换所有字符串。我必须清理解析数据的当前代码是
print title.replace(',', '').replace("!", '').replace(":", '').replace(";", '').replace(str, '')
str部分是我试过的......它没有用。还有其他方法吗?
答案 0 :(得分:3)
如果要从HTML文档中提取所有可见数字,可以先使用BeautifulSoup解析HTML文档,然后从中提取文本。之后,您可以从这些文本元素中提取所有数字:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
# let’s use the StackOverflow homepage as an example
r = urlopen('http://stackoverflow.com')
soup = BeautifulSoup(r)
# As we don’t want to get the content from script related
# elements, remove those.
for script in soup(['script', 'noscript']):
script.extract()
# And now extract the numbers using regular expressions from
# all text nodes we can find in the (remaining) document.
numbers = [n for t in soup(text=True) for n in re.findall('\d+', t)]
然后 numbers
将包含文档中可见的所有数字。如果您想将搜索限制为仅限某些元素,则可以更改soup(text=True)
部分。
答案 1 :(得分:1)
replace1 = range(0,46)+range(58,127)+[47] #Makes a list of all the
#ASCII characters values that you don't want it to show,
#http://www.asciitable.com/, this includes all the letters,
#and excludes all numbers and '.'
text = '<html><body><p>127.0.0.1</p></body></html>' #Test data.
tmp = ''
for i in range(len(text)-1): #this goes through each character in the text
... if not ord(text[i]) in replace1: #checks if that character's
#ASCII value is in not the list of 'Blacklisted' ASCII values,
#then appends it to the tmp variable
... tmp += text[i]
print tmp
127.0.0.1