Python替换""

时间:2017-02-27 14:02:23

标签: html python-3.x replace anaconda tweets

我是Python的新手,我试图理解并使用在Python 3.5.2上运行的Anaconda中的this link脚本。我不得不改变一些东西,以便脚本可以在这个版本的Python中运行,因为它是从2013年开始的。脚本(由经验不足的我修改)如下,我的问题在于try块中第html = f.read().replace("</html>", "") + "</html>"行。

我根本无法理解紧密括号后+ "</html>"的原因。从我在replace()方法中发现的是,它至少需要两个参数,旧字符和新字符。实际上,此脚本将跳转到except Exception as e:并打印出a bytes-like object is required, not 'str'

据我所知,这是因为读取是以字节为单位进行的,而replace方法则是字符串。我试图把这一行划分为:

html = f.read
html = str.replace("</html>", "") + "</html>"

但这会引发replace() takes at least 2 arguments (1 given)。我还尝试将html的内容从bytes更改为str,如下所示

html = str(f.read(), 'utf-8')
html = str.replace("</html>", "")

但这也会返回replace() takes two arguments (1 given)的错误。当我完全删除html = str.replace("</html>", "") + "</html>"并因此跳过soup = BeautifulSoup(html)时,我最后发出警告,没有明确指定解析器,后来又发生了NoneType object has no attribute get_dictionary的AttributeError。

任何有关上述生产线需求及其使用原因和使用方法的帮助都将不胜感激。谢谢。

#!/usr/bin/python

import sys
import urllib.request
import re
import json

from bs4 import BeautifulSoup

import socket

socket.setdefaulttimeout(10)

cache = {}

for line in open(sys.argv[1]):
fields = line.rstrip('\n').split('\t')
sid = fields[0]
uid = fields[1]

# url = 'http://twitter.com/%s/status/%s' % (uid, sid)
# print url
tweet = None
text = "Not Available"
if sid in cache:
    text = cache[sid]
else:
    try:
        f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
        print('URL: ', f.geturl())
        # Thanks to Arturo!
        # html = f.read()
        html = f.read().replace("</html>", "") + "</html>"
        soup = BeautifulSoup(html)
        jstt = soup.find_all("p", "js-tweet-text")
        tweets = list(set([x.get_text() for x in jstt]))
        # print len(tweets)
        # print tweets
        if (len(tweets)) > 1:
            continue

        text = tweets[0]
        cache[sid] = tweets[0]

        for j in soup.find_all("input", "json-data", id="init-data"):
            js = json.loads(j['value'])
            if js.has_key("embedData"):
                tweet = js["embedData"]["status"]
                text = js["embedData"]["status"]["text"]
                cache[sid] = text
                break
    except Exception as e:
        print(e)
        # except Exception as e:
        continue

    if tweet is not None and tweet["id_str"] != sid:
        text = "Not Available"
        cache[sid] = "Not Available"
    text = text.replace('\n', ' ', )
    text = re.sub(r'\s+', ' ', text)
    # print json.dumps(tweet, indent=2)
    print("\t".join(fields + [text]).encode('utf-8'))

2 个答案:

答案 0 :(得分:2)

str.replace以静态形式使用replace(从类型类str而不是str对象调用方法。)

str.replace实际上需要 3 参数:要操作的字符串,要替换的字符或字符串以及新的字符串或字符串。

'abcd'.replace('d', 'z')str.replace('abcd', 'd', 'z')等效:

print('abcd'.replace('d', 'z'))
# abcz
print(str.replace('abcd', 'd', 'z'))
# abcz

答案 1 :(得分:0)

我已经接受了@DeepSpace提供的解决方案作为答案,因为它帮助我意识到如何克服我所面临的问题。如果从命令提示符运行,下面的代码现在可以在Python 3下执行,如下所示(请注意我是从Windows命令提示符执行的):

python download_tweets.py inpuot_file.tsv > output_file.tsv。代码如下:

#!/usr/bin/python

import sys
import urllib.request
import re
import json

from bs4 import BeautifulSoup

import socket

socket.setdefaulttimeout(10)

cache = {}

for line in open(sys.argv[1]):
    fields = line.rstrip('\n').split('\t')
    sid = fields[0]
    uid = fields[1]

    tweet = None
    text = "Not Available"
    if sid in cache:
        text = cache[sid]
    else:
        try:
           f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
           # print('URL: ', f.geturl())
           # Thanks to Arturo!
           html = str.replace(str(f.read(), 'utf-8'), "</html>", "")
           # html = f.read().replace("</html>", "") + "</html>" # original line
           soup = BeautifulSoup(html, "lxml") # added "lxml" as it was giving warnings
          jstt = soup.find_all("p", "js-tweet-text")
          tweets = list(set([x.get_text() for x in jstt]))
          # print(len(tweets))
          if (len(tweets)) > 1:
            continue

          text = tweets[0]
          cache[sid] = tweets[0]

          for j in soup.find_all("input", "json-data", id="init-data"):
              js = json.loads(j['value'])
              if "embedData" in js:
                 # if js.has_key("embedData"): # original line
                 tweet = js["embedData"]["status"]
                 text = js["embedData"]["status"]["text"]
                 cache[sid] = text
                 break
        except Exception as e:
            print(e)
            continue

        if tweet is not None and tweet["id_str"] != sid:
            text = "Not Available"
            cache[sid] = "Not Available"
        text = text.replace('\n', ' ', )
        text = re.sub(r'\s+', ' ', text)
        # print(json.dumps("dump: ", tweet, indent=2))
        print(" \t ".join(fields + [text]).encode('utf-8'))