我是Python的新手,我试图理解并使用在Python 3.5.2上运行的Anaconda中的this link脚本。我不得不改变一些东西,以便脚本可以在这个版本的Python中运行,因为它是从2013年开始的。脚本(由经验不足的我修改)如下,我的问题在于try
块中第html = f.read().replace("</html>", "") + "</html>"
行。
我根本无法理解紧密括号后+ "</html>"
的原因。从我在replace()
方法中发现的是,它至少需要两个参数,旧字符和新字符。实际上,此脚本将跳转到except Exception as e:
并打印出a bytes-like object is required, not 'str'
。
据我所知,这是因为读取是以字节为单位进行的,而replace方法则是字符串。我试图把这一行划分为:
html = f.read
html = str.replace("</html>", "") + "</html>"
但这会引发replace() takes at least 2 arguments (1 given)
。我还尝试将html
的内容从bytes
更改为str
,如下所示
html = str(f.read(), 'utf-8')
html = str.replace("</html>", "")
但这也会返回replace() takes two arguments (1 given)
的错误。当我完全删除html = str.replace("</html>", "") + "</html>"
并因此跳过soup = BeautifulSoup(html)
时,我最后发出警告,没有明确指定解析器,后来又发生了NoneType object has no attribute get_dictionary
的AttributeError。
任何有关上述生产线需求及其使用原因和使用方法的帮助都将不胜感激。谢谢。
#!/usr/bin/python
import sys
import urllib.request
import re
import json
from bs4 import BeautifulSoup
import socket
socket.setdefaulttimeout(10)
cache = {}
for line in open(sys.argv[1]):
fields = line.rstrip('\n').split('\t')
sid = fields[0]
uid = fields[1]
# url = 'http://twitter.com/%s/status/%s' % (uid, sid)
# print url
tweet = None
text = "Not Available"
if sid in cache:
text = cache[sid]
else:
try:
f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
print('URL: ', f.geturl())
# Thanks to Arturo!
# html = f.read()
html = f.read().replace("</html>", "") + "</html>"
soup = BeautifulSoup(html)
jstt = soup.find_all("p", "js-tweet-text")
tweets = list(set([x.get_text() for x in jstt]))
# print len(tweets)
# print tweets
if (len(tweets)) > 1:
continue
text = tweets[0]
cache[sid] = tweets[0]
for j in soup.find_all("input", "json-data", id="init-data"):
js = json.loads(j['value'])
if js.has_key("embedData"):
tweet = js["embedData"]["status"]
text = js["embedData"]["status"]["text"]
cache[sid] = text
break
except Exception as e:
print(e)
# except Exception as e:
continue
if tweet is not None and tweet["id_str"] != sid:
text = "Not Available"
cache[sid] = "Not Available"
text = text.replace('\n', ' ', )
text = re.sub(r'\s+', ' ', text)
# print json.dumps(tweet, indent=2)
print("\t".join(fields + [text]).encode('utf-8'))
答案 0 :(得分:2)
str.replace
以静态形式使用replace
(从类型类str
而不是str
对象调用方法。)
str.replace
实际上需要 3 参数:要操作的字符串,要替换的字符或字符串以及新的字符串或字符串。
'abcd'.replace('d', 'z')
与str.replace('abcd', 'd', 'z')
等效:
print('abcd'.replace('d', 'z'))
# abcz
print(str.replace('abcd', 'd', 'z'))
# abcz
答案 1 :(得分:0)
我已经接受了@DeepSpace提供的解决方案作为答案,因为它帮助我意识到如何克服我所面临的问题。如果从命令提示符运行,下面的代码现在可以在Python 3下执行,如下所示(请注意我是从Windows命令提示符执行的):
python download_tweets.py inpuot_file.tsv > output_file.tsv
。代码如下:
#!/usr/bin/python
import sys
import urllib.request
import re
import json
from bs4 import BeautifulSoup
import socket
socket.setdefaulttimeout(10)
cache = {}
for line in open(sys.argv[1]):
fields = line.rstrip('\n').split('\t')
sid = fields[0]
uid = fields[1]
tweet = None
text = "Not Available"
if sid in cache:
text = cache[sid]
else:
try:
f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
# print('URL: ', f.geturl())
# Thanks to Arturo!
html = str.replace(str(f.read(), 'utf-8'), "</html>", "")
# html = f.read().replace("</html>", "") + "</html>" # original line
soup = BeautifulSoup(html, "lxml") # added "lxml" as it was giving warnings
jstt = soup.find_all("p", "js-tweet-text")
tweets = list(set([x.get_text() for x in jstt]))
# print(len(tweets))
if (len(tweets)) > 1:
continue
text = tweets[0]
cache[sid] = tweets[0]
for j in soup.find_all("input", "json-data", id="init-data"):
js = json.loads(j['value'])
if "embedData" in js:
# if js.has_key("embedData"): # original line
tweet = js["embedData"]["status"]
text = js["embedData"]["status"]["text"]
cache[sid] = text
break
except Exception as e:
print(e)
continue
if tweet is not None and tweet["id_str"] != sid:
text = "Not Available"
cache[sid] = "Not Available"
text = text.replace('\n', ' ', )
text = re.sub(r'\s+', ' ', text)
# print(json.dumps("dump: ", tweet, indent=2))
print(" \t ".join(fields + [text]).encode('utf-8'))