我正在使用RSS('http://www.reddit.com/new/.rss?sort=new')并将其上传到SQL数据库。这是我的步骤:
通过该URL,我能够创建一个熊猫数据框,然后将其上传到SQL数据库。数据框中的列名称为标题,链接,摘要,作者和标签。清理摘要列并清除所有标签的最佳方法是什么?
'<!-- SC_OFF --><div class="md"><p>The title says most of it, I’m running about a 12-13 min mile. I haven’t run in about 4.5 years and I need to get to my fastest 1.5 with more in the tank afterward, and I need it to be solid. </p> <p>I’ve read blogs and running guides, but I thought I’d get it from the source, people who just love to run, just like the way I used to love to lift. </p> <p>I guess my question is, where do I start? Some say football conditioning, others say just run… Some even say just walk. I’m trying to slim down fast and have a solid mile and a half to 2-mile sprint. </p> <p>The only other conditioning I’m doing right now is three days of fight sports (2 Krav/kickboxing, 1 combat fitness style). Looking at running 3ish days and taking Sunday off.</p> </div><!-- SC_ON -->   submitted by   <a href="https://www.reddit.com/user/Logical_penguin"> /u/Logical_penguin </a>   to   <a href="https://www.reddit.com/r/running/"> r/running </a> <br/> <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[link]</a></span>   <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[comments]</a></span>'
我可以将以下内容用于其中一部分
df['summary'] = df['summary'].map(lambda x: x.lstrip('<!-- SC_OFF --->'))
但是,这对于摘要列中的所有内容来说将花费太长时间。
答案 0 :(得分:0)
import re
df['summary'] = df['summary'].map(lambda x: re.sub('<[^<]+?>', '', x))
这可以删除,您的示例结果将是:
'标题说明了大部分内容,我的行驶距离约为12-13分钟。我已经有4.5年没有跑了,我需要达到我的最快1.5,然后再加装更多,我需要它坚固。我已经阅读过博客和跑步指南,但我想我可以从源头上获得这些信息,就像那些喜欢跑步的人一样,他们只喜欢跑步。我想我的问题是,我从哪里开始?有些人说足球训练,有些人说跑步…有些人甚至说步行。我正在尝试快速减肥,并实现半英里到2英里的冲刺。我现在唯一要做的其他条件就是进行三天的搏击运动(2次Krav /跆拳道,1次战斗健身)。看连续3天的工作日,并休假星期日。由/ u / Logical_penguin提交给r / running [link] [comments]'
在这种情况下,re比lstrip / rstrip更强大。在这里进行比较只是为了提供一些信息,说明重新执行需要花费多少时间。
此外,为了节省时间,最好对df.values使用re而不是apply / map。