作为Python(建立一个小型搜索引擎)的Information Retrieval项目的一部分,我要保留下载的tweets(.csv tweets数据集-准确地是27000条tweets)中的纯文本,一条tweet如下所示:< / p>
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL
或
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX
我想使用正则表达式删除推文中不必要的部分,例如URL,标点符号等
所以结果将是:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
和
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
对此进行了尝试:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')
,但是它做得并不完美,例如,结果中仍然存在部分URL。
请帮助我找到可以满足我需要的正则表达式模式。
答案 0 :(得分:1)
这可能有帮助。
演示:
cmd.ExecuteScalar();