Question

我正在尝试从文本文件（file1.txt）中提取特定的字符串值，然后使用提取的字符串（URL地址）创建HTTP GET请求，HTTP响应应另存为新的HTML文件，目录。我要提取的字符串是特定键的值。

例如：“ display_url”：“ test.com”（提取“ test.com”，然后创建http请求）

我的txt文件内容：

{“ created_at”：“ Thu Nov 15 11:35:00 +0000 2018”，“ id”：15292802，“ id_str”：325802”，“ text”：“ test8 https://t.co/ZtCsuk7Ek2 #osining“，”源“：” \ u003ca href = \“ http：//twitter.com \” rel = \“ nofollow \” \ u003eTwitter Web Client \ u003c / a \ u003e“，”截断“：false，” in_reply_to_status_id“：null，” in_reply_to_status_id_str“：null，” in_reply_to_user_id“：null，” in_reply_to_user_id_str“：null，” in_reply_to_user_id_str“：null，” user“：{” id“：961508，＆quot; 961” “：”在线S“，” screen_name“：” osectraining“，”位置“：”以色列“，” url“：” https://www.test.co.il“，”描述“：”测试“，” translator_type“：” none“，” protected“：false，” verified“：false，” followers_count“：2，” friends_count“：51，” listed_count“：0，” favourites_count“：0，” statuss_count“：7，” created_at“：” Thu Feb 08 07:54:39 +0000 2018“，” utc_offset“：null，” time_zone“：null，” geo_enabled“：false，” lang“：” en“，” contributors_enabled“：false，” is_translator“：否，” profile_background_color“：” 000000“，” profile_background_image_url“：” http://abs.twimg.com/images/ themes / theme1 / bg.png“，” profile_background_image_url_https“：” https://abs.twimg.com/images/themes/theme1/bg.png“，” profile_background_tile“：false，” profile_link_color“：” 1B95E0“，” profile_sidebar_border_color“：” 000000“，” profile_sidebar_fill_color“：” 000000“，” profile_text_color“：” 000000“，” profile_use_background_image“：false，” profile_image_url“：” http://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal。 jpg“，” profile_image_url_https“：” https://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal.jpg“，” profile_banner_url“：” https://pbs.twimg.com/profile_banners/961508561217052675/1518076913“，” default_profile”：false，“ default_profile_image”：false，“ following”：null，“ follow_request_sent”：null，“ notifications”：null}，“ geo”：null，“ coordinates”：null，“ place”：null，“ contributors” “：null”“ is_quote_status”：false，“ quote_count”：0，“ reply_count”：0，“ retweet_count”：0，“ favorite_count”：0，“ entities”：{“ hashtags”：[{“ text”：“ osectraining“，”索引“：[33,46]}]，” urls“：[{” url“：” https://t.co/ZtCsuk7Ek2“，” expan ded_url“：” http://test.com“，” display_url “：” test.com“，” indices“：[7,30]}]，” user_mentions“：[]，”符号“：[]}，”收藏夹“：false，”转发“：false，”可能敏感“：false，” filter_level“：” low“，” lang“：” en“，” timestamp_ms“：” 1542281700508“} < / p>

我的代码：

import re
with open('file1') as f:
found = []
for line in f.readlines():
    found += re.findall(r'"display_url":\s(\w+)\s', line)
print(found)

Answer 1

请注意，缩进在Python中至关重要。对我来说不清楚是您在代码缩进中犯了一个错误，还是在格式化所发布问题时犯了一个错误。话虽如此...

您需要做四件事来完成任务：

从磁盘读取file1.txt。
解析文件的内容以找到display_url
调用URL以获得回复
将响应写入磁盘

您的代码尝试执行步骤1和2，但是存在一些问题。第一个问题是您的文本文件中有错误。该键值对"id_str":"325802"缺少右引号。

如果您对此进行了修复，则需要修复代码的缩进，以便在尝试使用f时可用。最后，我认为正则表达式方法并不是真正的解决之道。

您可以阅读文件并将其轻松解析为Python字典。找到所需的信息要求您了解JSON的结构，这是一种实现方法：

import json


with open('./file1.txt', 'r') as f:
    lines = f.readlines()
    text = ''.join(lines)


dictionary = json.loads(text)
entities = dictionary.get('entities')
urls = entities.get('urls')[0]
display_url = urls.get('display_url')
print(display_url)

现在您需要弄清楚步骤3和4，与步骤2相比，这确实是简单的部分。

Answer 2

从您的文本看来，您的文件具有JSON数据。因此，您可以加载JSON而不是读取行，然后可以轻松获取display_url的值。例如。

import json
with open('file1') as f:
    data = json.load(f)

urls = [x["display_url"] for x in data["entities"]["urls"]]

从文本文件中提取特定的字符串，然后创建HTTP请求

2 个答案: