Question

我想从文本文件中提取特定单词这是示例文本文件：
https://drive.google.com/file/d/0BzQ6rtO2VN95d3NrTjktMExfNkU/view?usp=sharing
请仔细阅读。
我试图将字符串提取为：

"Name": "the name infront of it"
"Link": "Link infront of it"

从输入文件说，我希望得到这样的输出：

"Name":"JTLnet"
"Link":"http://jtlnet.com"
"Name":"Apache 1.3"
"Link":"http://httpd.apache.org/docs/1.3"
"Name":"Apache"
"Link":"http://httpd.apache.org/"
.
.
.
"Name":"directNIC"
"Link":"http://directnic.com"

如果这些单词在文件中的任何位置，则应将其解压缩到另一个文件中请告诉我如何才能实现这种提取？请将文件视为大文件的一小部分另外，它是文本文件而不是json 请帮助我。

Answer 1

由于文本文件格式不正确，因此唯一的选择是正则表达式。以下代码段适用于给定的示例文件。

请记住，这需要您将整个文件加载到内存中

import re, json
f = open(r'filepath')
textCorpus = f.read()
f.close()
# replace empty strings to non-empty, match regex easily
textCorpus = textCorpus.replace('""', '" "')
lstMatches = re.findall(r'"Name".+?"Link":".+?"', textCorpus)
with open(r'new_file.txt', 'ab+) as wf:
    for eachMatch in lstMatches:
        convJson = "{" + eachMatch + "}"
        json_data = json.loads(convJson)
        wf.write(json_data["Name"] + "\n")
        wf.write(json_data["Link"] + "\n")

Answer 2

使用re.findall()和str.split()函数的简短解决方案：

import re

with open('test.txt', 'r') as fh:
    p = re.compile(r'(?:"Categories":[^,]+,)("Name":"[^"]+"),(?:[^,]+,)("Link":"[^"]+")')
    result = [pair for l in re.findall(p, fh.read()) for pair in l]

print('\n'.join(result))

输出（片段）：

"Name":"JTLnet"
"Link":"http://jtlnet.com"
"Name":"Apache 1.3"
"Link":"http://httpd.apache.org/docs/1.3"
"Name":"Apache"
"Link":"http://httpd.apache.org/"
"Name":"PHP"
....

Answer 3

您的文件是格式错误的json，带有多余的双引号。但这足以让json模块无法加载它。你留下了较低级别的正则表达式解析。

假设：

"Name"或"Link"之后的有趣部分是：
- 通过冒号（:）
- 用双引号（"）括起来，没有包含双引号
文件以行
名称和链接字段始终位于一行（字段中没有新行）

您可以逐行处理文件，每行都有一个简单的re.finditer：

rx = re.compile(r'(("Name":".*?")|("Link":".*?"))')
with open(inputfile) as fd:
    for line in fd:
    l = rx.finditer(line)
        for elt in l:
            print(elt.group(0))

如果您想将数据输出到另一个文件，只需在open(outputfile, "w") as fdout:上面的代码段之前打开它，然后将打印行替换为：

fdout.write(elt.group(0) + "\n")

如何使用python3.6从文件中提取单词部分？

3 个答案: