使用Python爬行WoS

时间:2016-12-09 16:40:26

标签: python regex beautifulsoup

我试图从WoS(Web of Science)数据库下载信息。我需要诸如文章名称,作者,引用次数,数量等信息 enter image description here

这是我的代码:

import sys 
from BeautifulSoup import BeautifulSoup
import urllib
import re
    var = raw_input("Link WoS: ")
    conn = urllib.urlopen(var)
    html = conn.read()
    soup = BeautifulSoup(html)
    titles = re.findall('<value lang_id="">(.+?)</value>',str(soup))
    volume = re.findall('Volume: </span><span class="data_bold"><value>(.+?)</value>', str(soup))
    print(volume)

它非常适合获得标题。但是我在获取以下信息时遇到问题:数量,问题,页面,日期(已发布)和引用的时间。这是网页的来源:

</span><span name="source_title_1" id="source_title_1">
<value>
<span class="hitHilite">EDUCATIONAL RESEARCH</span>
</value>
</span>&nbsp;&nbsp;<span class="label">Volume: </span><span     class="data_bold">
<value>35</value>
</span> &nbsp;&nbsp;<span class="label">Issue: </span><span  class="data_bold">
<value>1</value>
</span> &nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">
<value>3-25</value>
</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">
<value>SPR 1993</value>
</span> 
</div>
<div style="display: inline-block" id="links_1">
<nobr><span id="links_openurl_1"></span> <span id="links_full_text_1">     </span> <span id="links_doc_del_1"></span> <span id="links_patent_1">    </span> </nobr>
</div>
<div class="search-action-item">
<span id="solo_full_text_1" class="solo_full_text"></span><a      name="full_text_1" id="full_text_1" title="Full Text" class="button2link     button-ft" href="javascript:;"><span id="full_text_1" name="full_text_1" title="Full Text" class="button2 button-ft">Full Text</span></a>
<div class="popup-full-text" id="full_text_1_menu">
<span id="full_text_1_links"></span>
</div>
</div>
<script type="text/javascript">$("#full_text_1").hide();</script><span style="display: inline-block" class="button-abstract" id="ViewAbstract1_text"><a title="View Abstract" alt="View Abstract" onclick="return hide_show_abstract('1', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'View Abstract', 'Close Abstract');" href="javascript:;" class="button9"><img align="absmiddle" title="View Abstract" alt="View Abstract" src="http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif" id="ViewAbstract1_img">View Abstract<nobr></nobr></a></span><span style="display: none" class="button-abstract" id="HideAbstract1_text"><a title="Close Abstract" alt="Close Abstract" onclick="return hide_show_abstract('1',  'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'View Abstract', 'Close Abstract');" href="javascript:;" class="button9"><img align="absmiddle" title="Close Abstract" alt="Close Abstract" src="http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif" id="HideAbstract1_img">Close Abstract<nobr></nobr></a></span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=WOS&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=5&amp;SID=W1tvVEGCvoimqQujw4V&amp;page=1&amp;doc=1" id="ViewAbstract_Span1">
<!----></span></div><div class="search-results-data">
<div class="search-results-data-cite">Times Cited: <a title="View all of the articles that cite this one" href="/CitingArticles.do?product=WOS&amp;SID=W1tvVEGCvoimqQujw4V&amp;search_mode=CitingArticles&amp;parentProduct=WOS&amp;parentQid=5&amp;parentDoc=1&amp;REFID=448550&amp;excludeEventConfig=ExcludeIfFromNonInterProduct">487</a>
<br>

我认为我有问题,因为数据是数字的...你能帮我吗?

3 个答案:

答案 0 :(得分:1)

Beautifulsoup有自己的正则表达式功能

SetImage([In] byte ddsData, [In] ulong ddsDataSize); // C#

void Track3D::SetImage(const uint8_t* ddsData, size_t ddsDataSize); // C++.

注意:这只是一个例子,无法访问实际的html

答案 1 :(得分:0)

BeautifulSoup将为您做很多繁重的工作。正则表达式通常是HTML的最后手段。最好使用此产品的最新版本,如以下代码所示。

HTML = '''\
<value>
<span class="htmllite">EDUCATIONAL RESEARCH</span>
</value>
</span>&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">
<value>29</value>
</span>&nbsp;&nbsp;<span class="label">Issue: </span><span class="data_bold">
<value>2</value>
</span>&nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">
<value>26-152</value>
</span>&nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">
<value>JUN 1987</value>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(HTML, "html.parser")

items = soup.findAll('span', attrs = {'class': 'label' })
for item in items:
    label = item.contents[0]
    sibling = item.find_next_sibling('span')
    value = sibling.select('value')[0].text
    print (label, value )

结果:

Volume:  29
Issue:  2
Pages:  26-152
Published:  JUN 1987

我远没有聪明到没有尝试失败的可能性而写下这个。您是否正在使用像IDLE这样的建议替代方案,并尝试使用代码片段来查看它们给出的结果?

PS:当你再次回到SO时,请将HTML和其他文本作为文本(而不是图像文件)发布,以便回答者可以使用剪切和粘贴。

答案 2 :(得分:0)

我终于做到了!我刚刚写了这个:

numericValues= re.findall('<value>(.+?)</value>', str(soup)) 

这给出了以下输出:

['100-121', '35', '1', '3-25', 'SPR 1993']

第一个数字我不知道它是什么,但接下来的数字是我需要的数字。然后我只是迭代值:

i = 0
while i < len(numericValues):
    columnVolume.append(numericValues[i+1])
    columnIssue.append(numericValues[i+2])
    columnPages.append(numericValues[i+3])
    columnDate.append(numericValues[i+4][-4:])
    i = i + 5

谢谢大家的帮助!