Question

我正在使用RSS（'http://www.reddit.com/new/.rss?sort=new'）并将其上传到SQL数据库。这是我的步骤：

通过该URL，我能够创建一个熊猫数据框，然后将其上传到SQL数据库。数据框中的列名称为标题，链接，摘要，作者和标签。清理摘要列并清除所有标签的最佳方法是什么？

'<!-- SC_OFF --><div class="md"><p>The title says most of it, I’m running about a 12-13 min mile. I haven’t run in about 4.5 years and I need to get to my fastest 1.5 with more in the tank afterward, and I need it to be solid. </p> <p>I’ve read blogs and running guides, but I thought I’d get it from the source, people who just love to run, just like the way I used to love to lift. </p> <p>I guess my question is, where do I start? Some say football conditioning, others say just run… Some even say just walk. I’m trying to slim down fast and have a solid mile and a half to 2-mile sprint. </p> <p>The only other conditioning I’m doing right now is three days of fight sports (2 Krav/kickboxing, 1 combat fitness style). Looking at running 3ish days and taking Sunday off.</p> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://www.reddit.com/user/Logical_penguin"> /u/Logical_penguin </a> &#32; to &#32; <a href="https://www.reddit.com/r/running/"> r/running </a> <br/> <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[comments]</a></span>'

我可以将以下内容用于其中一部分

df['summary'] = df['summary'].map(lambda x: x.lstrip('<!-- SC_OFF --->'))

但是，这对于摘要列中的所有内容来说将花费太长时间。

Answer 1

import re
df['summary'] = df['summary'].map(lambda x: re.sub('<[^<]+?>', '', x))

这可以删除，您的示例结果将是：

'标题说明了大部分内容，我的行驶距离约为12-13分钟。我已经有4.5年没有跑了，我需要达到我的最快1.5，然后再加装更多，我需要它坚固。我已经阅读过博客和跑步指南，但我想我可以从源头上获得这些信息，就像那些喜欢跑步的人一样，他们只喜欢跑步。我想我的问题是，我从哪里开始？有些人说足球训练，有些人说跑步…有些人甚至说步行。我正在尝试快速减肥，并实现半英里到2英里的冲刺。我现在唯一要做的其他条件就是进行三天的搏击运动（2次Krav /跆拳道，1次战斗健身）。看连续3天的工作日，并休假星期日。由/ u / Logical_penguin提交给r / running [link] [comments]'

对于同一示例上的单个操作，re和lstrip的性能：

对于具有10行（相同字符串）的数据框：

在这种情况下，

re比lstrip / rstrip更强大。在这里进行比较只是为了提供一些信息，说明重新执行需要花费多少时间。

此外，为了节省时间，最好对df.values使用re而不是apply / map。

在Pandas Dataframe中清理单元格值的最佳方法

1 个答案: