Question

这个BeautifulSoup Parser在循环播放数据时可以正常工作。它输出正确的东西。最后一行代码（输出到csv）表示user2没有定义，即使它似乎是......任何想法？（全部谢谢！这是一个缩进错误，现在已经编辑过了。代码有效！）

import csv
from bs4 import BeautifulSoup

# Create output file and write headers
f = csv.writer(open('/Users/xx/Downloads/#parsed.csv', "w"), delimiter = '\t')
f.writerow(["date", "username", "tweet"]) #csv column headings
soup = BeautifulSoup(open("/Users/simonlindgren/Downloads/#raw.html")) #input html document 

tweetdata = soup.find_all("div", class_="content") #find anchors of each tweet
#print tweetdata
for tweet in tweetdata:
    username = tweet.find_all(class_="username js-action-profile-name")
    for user in username:
        user2 = user.get_text()
        #print user2
    date = tweet.find_all(class_="_timestamp js-short-timestamp ")
    for d in date:
        date2 = d.get_text()
        tweet = tweet.find_all(class_="js-tweet-text tweet-text")
        for t in tweet:
            tweet2 = t.get_text().encode('utf-8')
            tweet3 = tweet2.replace('\n', ' ')
            tweet4 = tweet3.replace('\"','')

    f.writerow([date2, user2, tweet4])

Answer 1

问题是user2仅限于循环for user in username:内。一旦该循环结束，user2将无法访问。将代码更改为f.writerow([username, date, tweet])应该可以在没有NameError的情况下工作，但我怀疑此代码不会产生您想要的代码。这是因为这些值仍然包含HTML代码（这就是您使用get_text()从标签中提取数据的原因）。

相反，假设每条推文只有一个用户名，日期和推文文本正文，您可以将代码更改为以下内容：

tweetdata = soup.find_all("div", class_="content") #find anchors of each tweet
#print tweetdata
for tweet in tweetdata:
    # pull out our data
    username = tweet.find_all(class_="username js-action-profile-name")
    date = tweet.find_all(class_="_timestamp js-short-timestamp ")
    text = tweet.find_all(class_="js-tweet-text tweet-text")

    our_data = tuple(username[0].get_text(), date[0].get_text(),
                       text[0].get_text().encode('utf-8'))
    print "User: %s - Date: %s - Text: %s" % our_data

    # write to CSV
    f.writerow(our_data)

这避免了使用不必要的for循环（因为每条推文只有一个用户名，日期和文本正文）。如果您需要将其作为列表写出来，请将our_data从元组更改为列表。

写入csv时的Python NameError

1 个答案: