Question

我正在尝试抓取一个网站，到目前为止，我仍然可以抓取一个网站，但是我想将文件输出到文本文件，然后从那里删除一些字符串。

from urllib.request import urlopen
from bs4 import BeautifulSoup

delete = ['https://', 'http://', 'b\'http://', 'b\'https://']

url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )

print(bs.encode('utf_8'))

结果是很多链接，我可以显示一个示例。

“ b'https://certain-wrench.000webhostapp.com/auth/signin/details.html \ nhttps：//sweer-adherence.000webhostapp.com/auth/signin/details.html \ n”

已更新

    import requests
    from bs4 import BeautifulSoup

    url = "https://openphish.com/feed.txt"
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')

    with open('url.txt', 'w', encoding='utf-8') as f_out:
        f_out.write(soup.prettify())

    delete = ["</p>", "</body>", "</html>", "<body>", "<p>", "<html>", "www.", 
    "https://", "http://", "   ", " ", "  "]

    with open(r'C:\Users\v-morisv\Desktop\scripts\url.txt', 'r') as file:
        with open(r'C:\Users\v-morisv\Desktop\scripts\url1.txt', 'w') as 
    file1:
            for line in file:
                for word in delete:
                    line = line.replace(word, "")
                    print(line, end='')
                file1.write(line)

上面的代码可以工作，但是我有一个问题，因为我不仅获得域名，而且在正则破折号之后也得到了所有东西，所以看起来像这样 bofawebplus.webcindario.com/index4.html ，我想删除“ /”及其后的所有内容。

Answer 1

使用正则表达式似乎是正确的情况。

MyObject().apply{myFun { doAnotherThing() }}

Answer 2

这里没有理由使用BeautifulSoup，它用于解析HTML，但是打开的URL是纯文本。

这是一个应您所需的解决方案。它使用Python urlparse作为提取域名的更简便，更可靠的方法。

这也使用python set删除重复的条目，因为有很多条目。

from urllib.request import urlopen
from urllib.parse import urlparse

feed_list = urlopen('https://openphish.com/feed.txt')

domains = set()
for line in feed_list:
    url = urlparse(line)
    domain = url.netloc.decode('utf-8') # decode from utf-8 to string
    domains.add(domain) # Keep all the domains in the set to remove duplicates

for domain in domains:
    print(domains)

如何从BeautifulSoup保存文件？

2 个答案: