从文本文件Python中提取URL和TITLE

时间:2018-08-03 14:36:10

标签: python regex

我以以下方式拥有一个文本文件:

<a href="https://en.wikipedia.org/wiki/Scotland" h="ID=SERP,5161.1">Scotland - Wikipedia
<a href="https://www.visitscotland.com/" h="ID=SERP,5177.1">VisitScotland - Official Site
<a href="https://www.bbc.co.uk/news/scotland" h="ID=SERP,5191.1">BBC Scotland News - Official Site
<a href="https://www.lonelyplanet.com/scotland" h="ID=SERP,5207.1">Scotland travel - Lonely Planet

我想从此文本文件中提取URL,即仅将主域(例如“ en.wikipedia.org”,“ www.bbc.co.uk”等)提取到Links.txt中

将Title(即“苏格兰-维基百科”,“ VisitScotland-官方网站”等)添加到Titles.txt中

我是regex的新手,尝试使用某些regex函数进行提取,但没有成功。

4 个答案:

答案 0 :(得分:1)

如果您的文件是html文件,则可以使用Beautifulsoup

from bs4 import BeautifulSoup

html = #YOUR FILE HERE

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        #Do whatever with the link

答案 1 :(得分:0)

此正则表达式herehere的说明。假设您的数据存储在data.txt中:

import re

with open('data.txt', 'r', newline='') as f_in, \
    open('links.txt', 'w', newline='') as links_out, \
    open('titles.txt', 'w', newline='') as titles_out:

    data = f_in.read()

    for link in re.findall(r'(?:href=")([^"]+)', data):
        links_out.write(link + '\n')

    for title in re.findall(r'(?:>)(.*?)$', data, flags=re.M):
        titles_out.write(title + '\n')

在titles.txt中,您将:

Scotland - Wikipedia
VisitScotland - Official Site
BBC Scotland News - Official Site
Scotland travel - Lonely Planet

在links.txt中,您将拥有:

https://en.wikipedia.org/wiki/Scotland
https://www.visitscotland.com/
https://www.bbc.co.uk/news/scotland
https://www.lonelyplanet.com/scotland

注意: 使用BeautifulSoup或类似的库,可以更好地完成HTML文档的解析,并且更加健壮。

编辑:

要仅解析域,可以使用urllib.urlparse

# on the top:
from urllib.parse import urlparse

for link in re.findall(r'(?:href=")([^"]+)', data):
    url = urlparse(link)
    links_out.write(url.scheme + '://' + url.netloc + '\n')

links.txt外观如下:

https://en.wikipedia.org
https://www.visitscotland.com
https://www.bbc.co.uk
https://www.lonelyplanet.com

答案 2 :(得分:0)

import re
s = """<a href="https://en.wikipedia.org/wiki/Scotland" h="ID=SERP,5161.1">Scotland - Wikipedia
<a href="https://www.visitscotland.com/" h="ID=SERP,5177.1">VisitScotland - Official Site
<a href="https://www.bbc.co.uk/news/scotland" h="ID=SERP,5191.1">BBC Scotland News - Official Site
<a href="https://www.lonelyplanet.com/scotland" h="ID=SERP,5207.1">Scotland travel - Lonely Planet"""

links = re.findall(r"href=\"(.*?)\"", s)
titles = re.findall(r">(.*)", s)
print(links)
print(titles)

写入文件

with open("links.txt", "w") as links_file, open("titles.txt", "w") as titles_file:
    links_file.write("\n".join(links))
    titles_file.write("\n".join(titles))

输出:

['https://en.wikipedia.org/wiki/Scotland', 'https://www.visitscotland.com/', 'https://www.bbc.co.uk/news/scotland', 'https://www.lonelyplanet.com/scotland']
['Scotland - Wikipedia', 'VisitScotland - Official Site', 'BBC Scotland News - Official Site', 'Scotland travel - Lonely Planet']

答案 3 :(得分:0)

即使使用正则表达式解决方案,使用解析HTML几乎总是是个坏主意。如果遇到以前意外的符号或标记具有其他参数等,则可能会遇到各种问题。

这是使用python的内置库来解析HTML和URL的一种方法。 使用的模块是html.parserurllib.parse

from html.parser import HTMLParser
from urllib.parse import urlparse

class URLTitleParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.links = []
        self.titles = []

    def handle_starttag(self, tag, attrs):
        if tag.lower() != 'a':
            return

        for key, value in attrs:
            if key == 'href':
                url = urlparse(value)
                self.links.append(url.hostname)
                break

    def handle_data(self, data):
        self.titles.append(data.strip())


if __name__ == '__main__':
    parser = URLTitleParser()

    with open('data.txt') as data:
        parser.feed(data.read())

    with open('links.txt', 'w') as links:
        links.write('\n'.join(parser.links))

    with open('titles.txt', 'w') as titles:
        titles.write('\n'.join(parser.titles))