我正在使用Twitter数据,并且稍微清理数据,我想摆脱所有标点符号。我能够轻松地做到这一点,但我的问题是我还想保留URL,其中包括一些标点符号。
例如,让我们说Tweet A的内容是:
tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!".
我可以使用以下代码消除标点符号。但是,这会消除所有标点符号,包括URL中的标点符号。
cleaned = re.sub(r'[^a-zA-Z0-9\s]','',tweet)
这会产生:
cleaned = "check out my httpgooglecom324fasdcsdasdf32 links httpsgooglecomersf8vaddasddd2 hooray"
但是,我希望最终输出看起来像URL中的标点符号保持在哪里:
cleaned = "check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray".
使用Python,我该怎么做?在此先感谢您的帮助!
答案 0 :(得分:2)
使用John Gruber's regex查找网址:
import re
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")
在网址上拆分推文:
tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
split_tweet = gruber.split(tweet)
你得到一个字符串列表。非URL始终是列表中的偶数条目,URL是奇数。因此,我们可以迭代列表并从偶数编号中删除标点符号。 (出现range()
迭代的罕见用例!)
from string import punctuation
punc_table = {ord(c): None for c in punctuation)
for i in range(0, len(split_tweet), 2):
split_tweet[i] = split_tweet[i].translate(punc_table)
现在我们一起加入它:
final_tweet = "".join(split_tweet)
这是Python,其中大部分可以使用生成器表达式在一行中完成,因此最终代码为:
import re
from string import punctuation
punc_table = {ord(c): None for c in punctuation)
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")
tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
final_tweet = "".join(t if i % 2 else t.translate(punc_table) for (i, t) in enumerate(gruber.split(tweet)))
请注意,我使用了{3}的Python 3样式。对于Python 2,您不需要创建str.translate
,只需使用{Wevenman的答案中所见的punc_table
即可。您也可能希望使用text.translate(None, punctuation)
代替xrange
。
答案 1 :(得分:1)
这是一种方法。首先找到网址,然后找到所有标点符号,然后删除网址中没有的任何标点符号。
可能不是最有效的方法,但至少它比疯狂的正则表达式更容易理解!
import re
def remove_punc_except_urls(s, punctuationRegex=r'[^a-zA-Z0-9\s]'):
# arrays to keep track of indices
urlInds = []
puncInds = []
# find all the urls
for m in re.finditer(r'(https?|ftp)://[^\s/$.?#].[^\s]*', s):
urlInds.append((m.start(0), m.end(0)))
# find all the punctuation
for m in re.finditer(punctuationRegex, s):
puncInds.append((m.start(0), m.end(0)))
# start removing punctuation from end so that indices do not change
puncInds.reverse()
# go through each of the punctuation indices and remove the character if it is not inside a url
for puncRange in puncInds:
inUrl = False
# check each url to see if this character is in it
for urlRange in urlInds:
if puncRange[0] >= urlRange[0] and puncRange[0] <= urlRange[1]:
inUrl = True
break
if not inUrl:
# remove the punctuation from the string
s = s[:puncRange[0]] + s[puncRange[1]:]
return s
以下是您的例子:
samp = 'check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!'
print(samp)
print(remove_punc_except_urls(samp))
输出:
check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!
check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray
答案 2 :(得分:0)
假设您的推文内容存储为名为string
的{{1}}:
tweet
答案 3 :(得分:0)
你可以做到的一种方法是找到网址;移除并保存它们;去掉了痘痘;找到新破坏的网址;并用保存的那些替换破碎的那些:
import re
tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!"
urls_real = []
urls_busted = []
p = re.compile("http\S*")
for m in p.finditer(tweet):
urls_real.append(m.group())
tweet = re.sub(r'[^a-zA-Z0-9\s]','',tweet)
for m in p.finditer(tweet):
urls_busted.append(m.group())
for i in range(len(urls_real)):
tweet = tweet.replace(urls_busted[i], urls_real[i])
print(tweet)
结果:
check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray
此代码要求普通和已破坏的网址都以http
开头,并以空白字符结尾。埃里克斯在他的回答中使用的正则表达式也有效(并且更强大)。