Question

我正在做作业，并且阅读了类似的主题，在这里发现了一个非常有趣的主题：Find string between two substrings

我的目标是使用python搜索我在文本文件中搜索的3种特定模式，我需要在未分类的文本文件中执行搜索，并且我需要：

1）从关键字“更多信息”（之前绕过信息）开始搜索

2）根据以下内容对文档进行分类： A1）字符串：“大房子”及其价格 A2）字符串：“大房子”找不到价格 B1）字符串：“小房子”及其价格 B2）字符串：“小家”未找到价格 C1）字符串：“大房子”和“小房子”及其价格 C2）字符串：缺少“大房子”和“小房子” D）找不到字符串（大房子还是小房子）

对于A，B，C，查找价格并打印='Big home price 50USD'，如果找不到价格，请注明。

我正在使用python进行文本研究，它返回找到的关键字的分类法，我需要根据上述模式A，B，C和D对文档（文本文件）进行分类

data_train['classi'] = data_train['text'].apply(lambda x: len([x for x in x if x.startswith('classi')]))
data_train[['text','classi']].head()

以下是输出：

text    classi
0   [big home, forrest, suburb, more info,          0
1   [town, pool, more info,                         0
2   [small home,more info,  forrest, suburb         1
3   [big home, more info,  forrest, price 50        1
4   [big home, forrest,  more info,  city           0

我希望： 1）从关键字“更多信息”开始搜索 2）对我在A，B，C，D中搜索的文本文档进行分类（获得带有价格的字符串，如果没有价格说明的话。

任何支持表示高度赞赏！

编辑：

也许在这里使用NLTK很有趣，知道吗？
实际上是在玩https://pythex.org/

Answer 1

我会做类似的事情：

from pathlib import Path
for file in Path("my_folder").glob("*.txt"):
    with file.open('r') as f:
        more_info_flag = False
        for line in f:
            if not more_info_flag:
                if "more info" in line:
                    more_info_flag = True
                else:
                    continue
            if "big_home" in line:
                if "price is" in line:
                    price = int(line.split("price is")[1].split(" ")[0])
                else:
                    price = None
                do_something(price)

我认为这适用于您发布的文件，如果其他格式不同，则需要进行修改...

从文本文件中的特定字符串开始查找字符串，然后进行分类

1 个答案: