Question

我正在尝试从URL获取数据。下面是URL格式。

我要做什么
1）逐行阅读并查找该行是否包含所需的关键字。 3）如果是，则将上一行的内容“ GETCONTENT”存储在列表中

<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT    
 a       <http://www.example.com/XYZ/mount/v1#NNNN> , 
<http://www.w3.org/2002/w#Individual> ;
        <http://www.w3.org/2000/01/rdf-schema#label>
                "some content , "some url content ;
        <http://www.example.com/XYZ/log/v1#hasRelation>
                <http://www.example.com/XYZ/data/v1#Change> ;
        <http://www.example.com/XYZ/log/v1#ServicePage>
                <https://dev.org.net/apis/someLabel> ;
        <http://www.example.com/XYZ/log/v1#Description>
                "Some API Content .

<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a       <http://www.w3.org/01/07/w#BBBBBB> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
               "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
                <http://www.w3.org/2001/XMLSchema#boolean> ;
       <http://www.example.com/XYZ/log/v1#Description>
            "Some description .

<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
 a       <http://www.w3.org/01/07/w#AAAAAA> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#Version> ;
        <http://www.w3.org/2000/01/schema#label>
                "some content ;
        <http://www.w3.org/2000/01/schema#range>
            <http://www.example.com/XYZ/data/v1#uuu> .

<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
 a       <http://www.w3.org/01/07/w#CCCCCC , 
<http://www.w3.org/2002/07/w#Name> 
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
              "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
               <http://www.w3.org/2001/XMLSchema#boolean> ;
        <http://www.example.com/XYZ/log/v1#Description>
               "Some description .

下面是我到目前为止尝试过的代码，但是它正在打印文件的所有内容

  import re
        def read_from_url():
            try:
                from urllib.request import urlopen
            except ImportError:
                from urllib2 import urlopen
            url_link = "examle.com"
            html = urlopen(url_link)
            previous=None
            for line in html:
                previous=line
                line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b", 
        line.decode('UTF-8'))
                print(previous)
        if __name__ == '__main__':
        read_from_url()

预期输出：

GETBBBBBB , GETAAAAAA , GETCCCCCC

提前谢谢！

Answer 1

当涉及从URL读取数据时，requests库要简单得多：

import requests

url = "https://www.example.com/your/target.html"
text = requests.get(url).text

如果尚未安装，则可以使用以下方法进行安装：

pip3 install requests

接下来，当您可以使用单词数组然后使用for循环代替时，为什么还要将所有单词都推到一个正则表达式中呢？

例如：

search_words = "hello word world".split(" ")
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

然后您将输出结果，如下所示：

print(matching_lines)

在text变量等于的地方运行此代码：

"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""

应输出：

[
  "this word will save the line",
  "hello my friend!"
]

您可以使用lower方法使搜索不区分大小写，如下所示：

search_words = [word.lower() for word in "hello word world".split(" ")]
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  line = line.lower()
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

注释和信息：

continue关键字可防止您在当前行中搜索多个单词匹配项
enumerate函数允许我们迭代index和当前行
我没有将lower函数用于for循环内的单词，以防止您不必为每个单词匹配和每一行都调用lower
直到检查完成之前，我没有在线呼叫lower，因为没有必要将空行换成小写

祝你好运。

Answer 2

我对一些事情感到困惑-回答可能有助于社区更好地为您提供帮助。具体来说，我无法确定文件的格式（即是您要向其发送请求并解析响应的txt文件还是url）。我也无法确定您是要获取整行，还是URL，还是散列符号后面的位。

尽管如此，您还是说您正在寻找要输出GETBBBBBB , GETAAAAAA , GETCCCCCC的程序，这是一种获取这些特定值的快速方法（假设这些值采用字符串形式）：

search = re.findall(r'#(GET[ABC]{6})>', string)

否则，如果您正在读取txt文件，则可能会有所帮助：

with open('example_file.txt', 'r') as file:
    lst = []
    for line in file:
        search = re.findall(r'#(GET[ABC]{6})', line)
        if search != []: 
            lst += search
    print(lst)

当然，这些只是一些快速的建议，以防它们有所帮助。否则，请回答我在回复开始时提到的问题，也许它可以帮助某人更好地了解您的期望。

从python中的URL读取和处理数据

2 个答案: