
时间:2019-06-19 13:11:12

标签: python web-scraping instagram hashtag

我正在尝试根据给定的关键字列表分析趋势。 关键字是椰子,鳄梨和油。 我想构建一个使用关键字为我提供相关标签的脚本(因此对于椰子来说,它将是#coconuttree,#coconutwax等)。 我尝试使用instabot:https://github.com/instabot-py/instabot.py

我点击了以下链接,该人基本上可以从instagram帖子中找到相关的标签。因此,需要做的是修改以下代码,以便不要将instagram帖子用作主题标签的来源,而将关键字用作搜索本身。 这是对方的代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import datetime

driver = webdriver.Chrome()

# Extract description of a post from Instagram link
soup = BeautifulSoup(driver.page_source,"lxml")
desc = " "

for item in soup.findAll('a'):
    desc= desc + " " + str(item.string)

# Extract tag list from Instagram post description
taglist = desc.split()
taglist = [x for x in taglist if x.startswith('#')]
index = 0
while index < len(taglist):
    taglist[index] = taglist[index].strip('#')
    index += 1

# (OR) Copy-paste your tag list manually here
#taglist = ['art', 'instaart', 'iblackwork']


# Define dataframe to store hashtag information
tag_df  = pd.DataFrame(columns = ['Hashtag', 'Number of Posts', 'Posting Freq (mins)'])

# Loop over each hashtag to extract information
for tag in taglist:

    soup = BeautifulSoup(driver.page_source,"lxml")

    # Extract current hashtag name
    tagname = tag
    # Extract total number of posts in this hashtag
    # NOTE: Class name may change in the website code
    # Get the latest class name by inspecting web code
    nposts = soup.find('span', {'class': 'g47SY'}).text

    # Extract all post links from 'explore tags' page
    # Needed to extract post frequency of recent posts
    myli = []
    for a in soup.find_all('a', href=True):

    # Keep link of only 1st and 9th most recent post 
    newmyli = [x for x in myli if x.startswith('/p/')]
    del newmyli[:9]
    del newmyli[9:]
    del newmyli[1:8]

    timediff = []

    # Extract the posting time of 1st and 9th most recent post for a tag
    for j in range(len(newmyli)):
        soup = BeautifulSoup(driver.page_source,"lxml")

        for i in soup.findAll('time'):
            if i.has_attr('datetime'):

    # Calculate time difference between posts
    # For obtaining posting frequency
    datetimeFormat = '%Y-%m-%dT%H:%M:%S.%fZ'
    diff = datetime.datetime.strptime(timediff[0], datetimeFormat)\
        - datetime.datetime.strptime(timediff[1], datetimeFormat)
    pfreq= int(diff.total_seconds()/(9*60))

    # Add hashtag info to dataframe
    tag_df.loc[len(tag_df)] = [tagname, nposts, pfreq]


# Check the final dataframe

# CSV output for hashtag analysis

结果应与代码中的结果相同,因此一个数据帧包含3列,其中包含#标签,帖子数和发布频率。我认为,对于每个关键字,应使用5个相关的主题标签。 我感谢您的宝贵时间。

0 个答案:
