我正在尝试抓取一个网站,到目前为止,我仍然可以抓取一个网站,但是我想将文件输出到文本文件,然后从那里删除一些字符串。
from urllib.request import urlopen
from bs4 import BeautifulSoup
delete = ['https://', 'http://', 'b\'http://', 'b\'https://']
url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )
print(bs.encode('utf_8'))
结果是很多链接,我可以显示一个示例。
“ b'https://certain-wrench.000webhostapp.com/auth/signin/details.html \ nhttps://sweer-adherence.000webhostapp.com/auth/signin/details.html \ n”
已更新
import requests
from bs4 import BeautifulSoup
url = "https://openphish.com/feed.txt"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
with open('url.txt', 'w', encoding='utf-8') as f_out:
f_out.write(soup.prettify())
delete = ["</p>", "</body>", "</html>", "<body>", "<p>", "<html>", "www.",
"https://", "http://", " ", " ", " "]
with open(r'C:\Users\v-morisv\Desktop\scripts\url.txt', 'r') as file:
with open(r'C:\Users\v-morisv\Desktop\scripts\url1.txt', 'w') as
file1:
for line in file:
for word in delete:
line = line.replace(word, "")
print(line, end='')
file1.write(line)
上面的代码可以工作,但是我有一个问题,因为我不仅获得域名,而且在正则破折号之后也得到了所有东西,所以看起来像这样 bofawebplus.webcindario.com/index4.html ,我想删除“ /”及其后的所有内容。
答案 0 :(得分:0)
使用正则表达式似乎是正确的情况。
MyObject().apply{myFun { doAnotherThing() }}
答案 1 :(得分:0)
这里没有理由使用BeautifulSoup,它用于解析HTML,但是打开的URL是纯文本。
这是一个应您所需的解决方案。它使用Python urlparse
作为提取域名的更简便,更可靠的方法。
这也使用python set
删除重复的条目,因为有很多条目。
from urllib.request import urlopen
from urllib.parse import urlparse
feed_list = urlopen('https://openphish.com/feed.txt')
domains = set()
for line in feed_list:
url = urlparse(line)
domain = url.netloc.decode('utf-8') # decode from utf-8 to string
domains.add(domain) # Keep all the domains in the set to remove duplicates
for domain in domains:
print(domains)