使用Beautiful Soup从抓取的数据写入CSV文件

时间:2020-02-16 13:13:00

标签: python csv beautifulsoup

这就是我使用Beautifulsoup抓取数据的方式。

comments =[]
users_list = []
users = driver.find_elements_by_class_name('_6lAjh')

for user in users:
    users_list.append(user.text)

i = 0
texts_list = []
texts = driver.find_elements_by_class_name('C4VMK')

for txt in texts:
    texts_list.append(txt.text.split(users_list[i])[1].replace("\r"," ").replace("\n"," "))
    i += 1
    comments_count = len(users_list)

for i in range(1, comments_count):
    user = users_list[i]
    text = texts_list[i]
    print("User ",user)
    print("Text ",text)
    print()
    comments.append(users_list[i])
    comments.append(texts_list[i])
    idxs = [m.start() for m in re.finditer('@', text)]
    for idx in idxs:
        handle = text[idx:].split(" ")[0]

print(handle)

这是我拥有的文本数据,这些数据包括用户名,评论和来自instagram的点赞次数。 'heyyy 3w1 likeReply'->'heyyy'是这里的评论,3w表示评论是3周前写的,1个赞是点赞

print(comments)
['User1',
 ' ? 3w1 likeReply',
 'User2',
 ' ? 3w1 likeReply',
 'User3',
 ' Looking good! Collab, DM "bruteimpact.fashion 3wReply',
 'User4',
 ' heyyy 3w5 likeReply']

我要将其保存到如下所示的CSV文件中(三列-ID,评论,likes_count):

ID  Comments  likes_count
User1 ?       0
User2 ?       1
User3 Looking good! Collab, DM "bruteimpact.fashion  0
User4 heyyy    5

到目前为止,这是我编写的代码,但与我想要获得的结果相去甚远,我根本不知道如何到达最终目的地。另外,我不知道如何通过从我拥有的评论数据中分离喜欢的数量来单独创建“ likes_count”。但是,我对仅包含“ ID”和“文本”列而没有“ likes_count”的CSV文件感到满意。请帮我!

fields = ["User", "Text"]
rows = [comments]
filename = "insta_records.csv"
with open(filename, 'w', encoding='utf-8') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields) 
    csvwriter.writerows(rows) 

1 个答案:

答案 0 :(得分:1)

您拥有固定列表,因此可以使用zip对用户及其评论进行分组

comments = ['User1',
 ' ? 3w1 likeReply',
 'User2',
 ' ? 3w1 likeReply',
 'User3',
 ' Looking good! Collab, DM "bruteimpact.fashion 3wReply',
 'User4',
 ' heyyy 3w5 likeReply']

rows = []
for user, text in zip(comments[::2], comments[1::2]):
    print(user, text)
    #rows.append([user, text])


fields = ["User", "Text"]
filename = "insta_records.csv"
with open(filename, 'w', encoding='utf-8') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields) 
    csvwriter.writerows(rows) 

屏幕结果

User1  ? 3w1 likeReply
User2  ? 3w1 likeReply
User3  Looking good! Collab, DM "bruteimpact.fashion 3wReply
User4  heyyy 3w5 likeReply

并在文件中

User,Text
User1, ? 3w1 likeReply
User2, ? 3w1 likeReply
User3," Looking good! Collab, DM ""bruteimpact.fashion 3wReply"
User4, heyyy 3w5 likeReply

要创建其他列,您必须先编辑注释-split()replace(), 切片[start:end],等等。

rows = []
for user, text in zip(comments[::2], comments[1::2]):
    parts = text.rsplit(' ', 2)#[:-1]
    parts.insert(0, user)
    print(parts)
    rows.append(parts)

屏幕结果

['User1', ' ?', '3w1', 'likeReply']
['User2', ' ?', '3w1', 'likeReply']
['User3', ' Looking good! Collab, DM', '"bruteimpact.fashion', '3wReply']
['User4', ' heyyy', '3w5', 'likeReply']

但是'3wReply'中缺少空格,因此它无法正确拆分,因此需要更多工作才能正确拆分。

顺便说一句:当您拥有3w5时,您可以split('w')来获得['3', '5'],但是在HTML中可以是其他文本而不是w,因此需要更多的工作。也许在BeautifulSoup中使用更复杂的规则,您最好将其拆分。