Question

我目前正在尝试从网页上抓取一些图表，但是我是新来的，所以不知道最好的解决方案。

<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

我需要的是g:=Graph<..>部分。这是我到目前为止尝试过的（基于其他一些类似的问题）：

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")
graphurls.append(rate[0].text_content())

但是问题是它还有很多其他问题。我认为这是可以实现的，因为它具有独特的模式g:=Graph<...>，因此不会刮擦其他任何东西。

你能帮我吗？

Answer 1

我会尝试使用正则表达式https://docs.python.org/3/library/re.html，可以使用https://regex101.com/进行实验，直到找到正确的公式为止

具体来说，您可以使用捕获组compute_output_shape = function(input_shape) { input_shape }来找到

的重复序列

“ {2，3}，{4，5}，{1，3}，{1，2}，{1，5}，{1，4}，{2，4}，{3，5 }，...“

我重新阅读了您的问题，您可能已经知道所有这些问题，但是您也可以在漂亮的汤中使用正则表达式https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression

Answer 2

第一种方法：您拥有字符串，因此可以使用字符串的函数来过滤结果-即

if text.strip().startswith('g:=Graph') :

示例：

data = '''<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

<font color="DarkBLUE">h:=Other&lt;...&gt;;</font>'''

import lxml.html as lh

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")

for item in rate:
    text = item.text_content()
    text = text.strip()
    if text.startswith('g:=Graph'):
        print(' OK:', text)
    else:
        print('NOT:', text)

第二种方法：您可以使用xpath对其进行过滤

tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")

或

tree.xpath("//font[@color='DarkBLUE'][contains(text(), 'g:=Graph')]")

示例：

data = '''<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

<font color="DarkBLUE">h:=Other&lt;...&gt;;</font>'''

import lxml.html as lh

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")

for item in rate:
    text = item.text_content()
    text = text.strip()
    print(text)

最终使用starts-with()，但是数据中的文本换行，因此xpath中的文本在开始时需要\n

tree.xpath("//font[@color='DarkBLUE' and starts-with(text(), '\ng:=Graph')]")

顺便说一句： xpath cheatsheet

Answer 3

一种方法是通过regex：

import re

graphs=re.findall("g:=.*;;", rate[0].text_content())

这将捕获所有以“ g：=“开头，以“ ;;”结尾的匹配项。它会在字符串rate[0].text_content()中查找此类匹配项。

注意： 将其应用于字符串，即.text_content()，而不应用于原始HTML。

抓取特定文本表单网页

3 个答案: