我正在尝试编写一个可以从文件中删除网址的sed表达式
示例
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)
但我不明白:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
固定!!!!!
处理几乎所有情况,甚至是格式错误的网址
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
答案 0 :(得分:10)
以下内容会移除http://
或https://
以及下一个空格的所有内容:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor @kdpartak :)
修改强>
我应该使用:
sed -e 's!http[s]\?://\S*!!g' posFile
与“[s]\?
”相比,“s
”是一种更具可读性的“可选\(s\)\{0,1\}
”写作方式
“\S*
”比“[^[:space:]]*
”更具有“任何非空格字符”的可读版本
在撰写此答案时,我一定是在使用Mac上安装的sed
(brew install gnu-sed
FTW)。
有更好的URL正则表达式(例如,那些考虑了HTTP(S)以外的方案),但这给你的工作,给出你给出的例子。为什么复杂化?
答案 1 :(得分:0)
接受的答案提供了我用来从文件中删除网址等的方法。然而它留下了空白"线。这是一个解决方案。
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
GNU sed标志,使用的表达式是:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
然而,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
留下非打印字符,大概是\n
(换行符)。基于标准sed
的方法删除"空白"行,制表符和空格,例如
sed -i 's/^[ \t]*//; s/[ \t]*$//'
不起作用,这里:如果你不使用"分支标签"要处理换行符,你不能用sed替换它们(它一次读取一行输入)。
解决方案是使用以下perl表达式:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
使用shell替换
'`echo "\012"`'
替换八进制值
\012
(即,换行符\n
),发生2次或更多次,
(否则我们会打开所有线条),还有别的东西;这里:
//
即没什么。
[下面的第二个参考提供了这些值的精彩表格!]
使用的perl标志是:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
参考文献:
示例:强>
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$