我以以下方式拥有一个文本文件:
<a href="https://en.wikipedia.org/wiki/Scotland" h="ID=SERP,5161.1">Scotland - Wikipedia
<a href="https://www.visitscotland.com/" h="ID=SERP,5177.1">VisitScotland - Official Site
<a href="https://www.bbc.co.uk/news/scotland" h="ID=SERP,5191.1">BBC Scotland News - Official Site
<a href="https://www.lonelyplanet.com/scotland" h="ID=SERP,5207.1">Scotland travel - Lonely Planet
我想从此文本文件中提取URL,即仅将主域(例如“ en.wikipedia.org”,“ www.bbc.co.uk”等)提取到Links.txt中
将Title(即“苏格兰-维基百科”,“ VisitScotland-官方网站”等)添加到Titles.txt中
我是regex的新手,尝试使用某些regex函数进行提取,但没有成功。
答案 0 :(得分:1)
如果您的文件是html文件,则可以使用Beautifulsoup
from bs4 import BeautifulSoup
html = #YOUR FILE HERE
soup = BeautifulSoup(html)
links = soup.find_all('a')
for tag in links:
link = tag.get('href',None)
if link is not None:
#Do whatever with the link
答案 1 :(得分:0)
此正则表达式here和here的说明。假设您的数据存储在data.txt
中:
import re
with open('data.txt', 'r', newline='') as f_in, \
open('links.txt', 'w', newline='') as links_out, \
open('titles.txt', 'w', newline='') as titles_out:
data = f_in.read()
for link in re.findall(r'(?:href=")([^"]+)', data):
links_out.write(link + '\n')
for title in re.findall(r'(?:>)(.*?)$', data, flags=re.M):
titles_out.write(title + '\n')
在titles.txt中,您将:
Scotland - Wikipedia
VisitScotland - Official Site
BBC Scotland News - Official Site
Scotland travel - Lonely Planet
在links.txt中,您将拥有:
https://en.wikipedia.org/wiki/Scotland
https://www.visitscotland.com/
https://www.bbc.co.uk/news/scotland
https://www.lonelyplanet.com/scotland
注意:
使用BeautifulSoup
或类似的库,可以更好地完成HTML文档的解析,并且更加健壮。
编辑:
要仅解析域,可以使用urllib.urlparse
:
# on the top:
from urllib.parse import urlparse
for link in re.findall(r'(?:href=")([^"]+)', data):
url = urlparse(link)
links_out.write(url.scheme + '://' + url.netloc + '\n')
links.txt外观如下:
https://en.wikipedia.org
https://www.visitscotland.com
https://www.bbc.co.uk
https://www.lonelyplanet.com
答案 2 :(得分:0)
import re
s = """<a href="https://en.wikipedia.org/wiki/Scotland" h="ID=SERP,5161.1">Scotland - Wikipedia
<a href="https://www.visitscotland.com/" h="ID=SERP,5177.1">VisitScotland - Official Site
<a href="https://www.bbc.co.uk/news/scotland" h="ID=SERP,5191.1">BBC Scotland News - Official Site
<a href="https://www.lonelyplanet.com/scotland" h="ID=SERP,5207.1">Scotland travel - Lonely Planet"""
links = re.findall(r"href=\"(.*?)\"", s)
titles = re.findall(r">(.*)", s)
print(links)
print(titles)
写入文件
with open("links.txt", "w") as links_file, open("titles.txt", "w") as titles_file:
links_file.write("\n".join(links))
titles_file.write("\n".join(titles))
输出:
['https://en.wikipedia.org/wiki/Scotland', 'https://www.visitscotland.com/', 'https://www.bbc.co.uk/news/scotland', 'https://www.lonelyplanet.com/scotland']
['Scotland - Wikipedia', 'VisitScotland - Official Site', 'BBC Scotland News - Official Site', 'Scotland travel - Lonely Planet']
答案 3 :(得分:0)
即使使用正则表达式解决方案,使用
这是使用python的内置库来解析HTML和URL的一种方法。 使用的模块是html.parser和urllib.parse
from html.parser import HTMLParser
from urllib.parse import urlparse
class URLTitleParser(HTMLParser):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.links = []
self.titles = []
def handle_starttag(self, tag, attrs):
if tag.lower() != 'a':
return
for key, value in attrs:
if key == 'href':
url = urlparse(value)
self.links.append(url.hostname)
break
def handle_data(self, data):
self.titles.append(data.strip())
if __name__ == '__main__':
parser = URLTitleParser()
with open('data.txt') as data:
parser.feed(data.read())
with open('links.txt', 'w') as links:
links.write('\n'.join(parser.links))
with open('titles.txt', 'w') as titles:
titles.write('\n'.join(parser.titles))