我正在尝试从一个文本文件中提取所有域并将其保存到另一个文本文件中,但是它显示所有域名和其他内容,并且还会返回:
ads.css
abc.js
Kashi.png
我的输入字符串是:
token$script,domain=liveresult.ru
euroiphone.eu##.div-download-h
||ausujet.com/skins/common/ads.js
@@||cyberdean.fr/js/advertisement.js
biggestplayer.me##.adblock + *
hearthhead.com,wowhead.com##.block-bg
wowhead.com##.block-bgimg
euroiphone.eu##.div-download-h
euroiphone.eu##.div-download-v
findretros.com##.fuck-adblock
@@||ausujet.com/skins/common/ads.js
@@||cyberdean.fr/js/advertisement.js
@@||dbz-fantasy.com/ads.css
@@||dev-dyod.fr/styles/ads.css
forums.ru###mdl_adb
ostroh.info###modal.modal-bg
7days2die.info###nafikblock
all-episodes.net###odin
我必须从中提取域的很多规则
我的结果应该是:
liveresult.ru
cyberdean.fr
euroiphone.eu
ausujet.com
biggestplayer.me
hearthhead.com
wowhead.com
euroiphone.eu
ausujet.com
cyberdean.fr
dbz-fantasy.com
dev-dyod.frforums.ru
7days2die.infoy
我尝试过:
import re
Domains = ['ru', 'fr' ,'eu', 'com']
with open('easylist.txt', 'r') as f:
a=f.read()
result=re.findall(r'[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',a)
unique_result = list(set(result))
for r in result:
domain_name = r.split('.')[1]
If domain_name in domains:
file_out.write(r+/n)
但是为此,我必须列出一个属于劳动过程的域,我想创建一些模式来自动提取域,而忽略诸如ads.js,ads.css,advertise.js等之类的东西,所以请告诉我我做错了。
答案 0 :(得分:0)
如果要在新行中打印所有内容,则应执行file_out.write(r+'\n')
以在新行中写入每个字符串,并且可以使用set
import re
domains = ['ru', 'fr' ,'eu', 'com']
with open('easylist.txt', 'r') as f:
a=f.read()
result=re.findall(r'[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',a)
unique_result = list(set(result))
for r in result:
#Extract domain name out of url
domain_name = r.split('.')[1]
#Check if domain name is in list of domains, only then add it
if domain_name in domains:
file_out.write(r)