从网址列表中检索唯一网站

时间:2012-06-11 19:25:11

标签: regex

我有一个包含大量页面网址的列表。我想要检索独特的网站。

"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3-tablet-wifi-1-2ghz-cpu-flash10-3"
"http://www.malma.mx/products/pan-digital"
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch-screen-android-2-3-tabletwifi-samsung-cortex-a8-1-2ghz-cpu-camera-1080p-external-3g"
"http://www.spiritualityandwellness.com/products/internalized-motivation"
"http://www.spiritualityandwellness.com/products/evergreen-motivation"

将导致:

www.gadgetgiants.com
www.malma.mx
www.spiritualityandwellness.com

2 个答案:

答案 0 :(得分:1)

egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" YOUR_FILE_NAME | sort -u

here

获得正则表达式

(编辑)使用和输出示例

$ cat ur.txt
"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3"
"http://www.malma.mx/products/pan-digital"
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch"
"http://www.spiritualityandwellness.com/products/internalized-motivation"
"http://www.spiritualityandwellness.com/products/evergreen-motivation"
"http://www.swellness.com.au/products/evergreen-motivation"

$ egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" ur.txt | sort -u
www.gadgetgiants.com
www.malma.mx
www.spiritualityandwellness.com
www.swellness.com.au

答案 1 :(得分:0)

没有正则表达式的想法:

从每个地址检索主机:

Uri uri = new Uri (yourLink);
string host = uri.Host;

现在你可以将所有这些主机放入HashSet或其他东西。