Question

作为Python（建立一个小型搜索引擎）的Information Retrieval项目的一部分，我要保留下载的tweets（.csv tweets数据集-准确地是27000条tweets）中的纯文本，一条tweet如下所示：< / p>

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî@POTUS https://twitter.com/OZRd5o4wRL

或

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî@POTUS in Greece https://twitter.com/PIO9dG2qjX

我想使用正则表达式删除推文中不必要的部分，例如URL，标点符号等

所以结果将是：

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

和

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

对此进行了尝试：pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')，但是它做得并不完美，例如，结果中仍然存在部分URL。

请帮助我找到可以满足我需要的正则表达式模式。

Answer 1

这可能有帮助。

演示：

cmd.ExecuteScalar();

保持网址中的文字干净

1 个答案: