Question

我是python的新手（以及SQL，SAS和一点R之外的大量编码），我正在尝试使用它来构建基于来自许多不同网页的数据的数据集。在此先感谢您的帮助。

我正在使用Python 3.4.4并成功提取了网站的代码，但我在编写正则表达式代码时遇到了问题，无法隔离我想要的特定数据元素/指标。下面是网页代码的示例，我想在tdclass语句之间单独隔离整数。

<tr class="Company"><td class="Company"> <ahref="http://www.theacsi.org/index.php?option=com_content&view=article&id=149&catid=&Itemid=214&amp;c=Liz+Claiborne&amp;i=Apparel" id="L">Liz Claiborne</a> </td><td class="Baseline"> 84 </td><td class="Y1995"> 81 </td><td class="Y1996"> 81 </td><td class="Y1997"> 77 </td><td class="Y1998"> 78 </td><td class="Y1999"> 76 </td><td class="Y2000"> 79 </td><td class="Y2001"> 79 </td><td class="Y2002"> 80 </td><td class="Y2003"> 78 </td><td class="Y2004"> 79 </td><td class="Y2005"> 78 </td><td class="Y2006"> 81 </td><td class="Y2007"> 79 </td><td class="Y2008"> 79 </td><td class="Y2009"> 82 </td><td class="Y2010"> 79 </td><td class="Y2011"> 79 </td><td clas

Answer 1

我想你可能想看一下lxml和xpath，更不用说其他刮软了。在aera的帖子的thousends。请查看以下链接：

http://docs.python-guide.org/en/latest/scenarios/scrape/

如果您不想使用其他模块，请使用构建RE（正则表达式）模块，该模块为您提供有关如何从字符串中提取特定文本的有用信息。

需要Python站点Scrape帮助

1 个答案: