Question

我有这个bash脚本，我写的是分析任何给定网页的html。它实际应该做的是返回该页面上的域。目前它返回该网页上的URL数量。

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

如何让它返回域而不是URL。根据我的编程知识，我知道它应该从右边解析，但我是bash脚本的新手。有人可以帮帮我吗。这就是我已经走了。

Answer 1

我知道有一种更好的方法可以在awk中执行此操作但是您可以使用sed执行此操作，方法是在awk '/http/'后添加此内容：

| sed -e 's;https\?://;;' | sed -e 's;/.*$;;'

然后你想把你的排序和uniq移到那个结尾。

这样整条线看起来像：

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | awk   '/http/' | sed -e 's;https\?://;;' | sed -e 's;/.*$;;' | sort | uniq -c > out)

你可以摆脱这一行：

output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

Answer 2

编辑2： 请注意，您可能希望根据需要调整sed表达式中的搜索模式。此解决方案仅考虑http[s]?:// - 协议和www. - 服务器...

修改
如果你想要计数和域名：

lynx -dump -listonly http://zelleke.com | \ sed -n '4,$ s@^.*http[s]?://$[^/]*$.*$@\1@p' | \ sort | \ uniq -c | \ sed 's/www.//'

给出

2 wordpress.org 10 zelleke.com

原始答案：

您可能希望使用lynx从网址
中提取链接
lynx -dump -listonly http://zelleke.com

给出

# blank line at the top of the output References 1. http://www.zelleke.com/feed/ 2. http://www.zelleke.com/comments/feed/ 3. http://www.zelleke.com/ 4. http://www.zelleke.com/#content 5. http://www.zelleke.com/#secondary 6. http://www.zelleke.com/ 7. http://www.zelleke.com/wp-login.php 8. http://www.zelleke.com/feed/ 9. http://www.zelleke.com/comments/feed/ 10. http://wordpress.org/ 11. http://www.zelleke.com/ 12. http://wordpress.org/

根据此输出，您可以获得所需的结果：

lynx -dump -listonly http://zelleke.com | \ sed -n '4,$ s@^.*http://$[^/]*$.*$@\1@p' | \ sort -u | \ sed 's/www.//'

给出

wordpress.org zelleke.com

Answer 3

您可以使用sed从网址中删除路径：

sed s@http://@@; s@/.*@@

我想也说你，这两行是错误的：

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

您必须重定向（> out）或命令替换$()，但不能同时进行两项操作。因为在这种情况下变量将为空。

这部分

content=$(wget "$url" -q -O -)
echo $content > $file

用这种方式编写也会更好：

wget "$url" -q -O - > $file

Answer 4

您可能对此感兴趣：

http://tools.ietf.org/html/rfc3986#appendix-B

解释使用正则表达式解析uri的方法。

因此您可以通过这种方式解析左中的uri，并提取包含域名和子域名的“权限”。

sed -r 's_^([^:/?#]+:)?(//([^/?#]*))?.*_\3_g';
grep -Eo '[^\.]+\.[^\.]+$' # pipe with first line, give what you need

这很有趣：

http://www.scribd.com/doc/78502575/124/Extracting-the-Host-from-a-URL

假设网址始终以这种方式开始

https?://(www\.)?

非常危险。

Bash脚本返回域而不是URL

4 个答案: