xpath数据拼接错误

时间:2017-10-10 15:48:23

标签: python xpath

def get_user_data(self,start_url):
    html = self.session.get(url=start_url,headers=self.headers,cookies=self.cookies).content
    selector = etree.fromstring(html,etree.HTMLParser(encoding='utf-8'))

    # BEGIN
    if selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[2]'):
        user_id = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[1]/a/@href')
        img = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[2]/a/img/@src')
        praise_num = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[2]/a[3]/text()')
        transmit_num = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[2]/a[4]/text()')
    else:
        user_id = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div/a/@href')
        img = ''
        praise_num = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div/a[3]/text()')
        transmit_num = selector.xpath('//div[contains(@class,"c") and contains(@id,"M")]/div[2]/a[4]')
        # OVER

    contents = selector.xpath('//span[@class="ctt"]/text()')
    times = selector.xpath('//span[@class="ct"]/text()')
    for each_text, each_time in zip(contents,times):
        data = {}
        data['content'] = each_text.encode().decode('utf-8').replace('\u200b','')
        try:
            if re.search('from',each_time.encode().decode('utf-8')):
                month_day, time, device = each_time.split(maxsplit=2)
                data['mobile_phone'] = device
            else:
                month_day, time = each_time.split(maxsplit=1)
                data['mobile_phone'] = ''
            data['create_time'] = month_day +' '+ time
            data['crawl_time'] = datetime.strftime(datetime.now(),'%Y-%m-%d %H:%M:%S')

这是用户上传的HTML

<div class="c" id="M_Fp01sdJgm">
    <div>
        <a class="nk" href="https://weibo.cn/thebs">figre</a>
            <img src="https://h5.sinaimg.cn/upload/2016/05/26/319/5338.gif" alt="V"/>
            <img src="https://h5.sinaimg.cn/upload/2016/05/26/319/donate_btn_s.png" alt="M"/>
      <span class="ctt">
                    ":"resampling
                    <span class="kt">resampling</span>
                    ":Cleantech entrepreneurs are splicing genes in the search for greener fuels
                ​</span>&nbsp;
                [<a href="https://weibo.cn/mblog/picAll/Fp01sdJgm?rl=2">2 pieces of the package</a>
                </div>
    <div>
        <a href="https://weibo.cn/mblog/pic/Fp01sdJgm?rl=1">
          <img src="http://wx1.sinaimg.cn/wap180/3ed2e6e8gy1fk7hohl2i5j219s0ps4qp.jpg" alt="images" class="ib" />
        </a>&nbsp;
        <a href="https://weibo.cn/mblog/oripic?id=Fp01sdJgm&amp;u=3ed2e6e8gy1fk7hohl2i5j219s0ps4qp">image</a>&nbsp;
        <a href="https://weibo.cn/attitude/Fp01sdJgm/add?uid=5757914684&amp;rl=1&amp;st=7b15a6">praise[28094]</a>&nbsp;
        <a href="https://weibo.cn/repost/Fp01sdJgm?uid=1054009064&amp;rl=1">transmit[1164]</a>&nbsp;
        <a href="https://weibo.cn/comment/Fp01sdJgm?uid=1054009064&amp;rl=1#cmtfrm" class="cc">comment[4097]</a>&nbsp;<a href="https://weibo.cn/fav/addFav/Fp01sdJgm?rl=1&amp;st=7b15a6">save</a>
        "<!---->&nbsp;"
        <span class="ct">10月05日 20:08&nbsp;from iPhone 7 Plus

用户未上传图片

<div class="c" id="M_Fp01sdJgm">
    <div>
        <a class="nk" href="https://weibo.cn/thebs">figre</a>
        <a href="https://weibo.cn/mblog/pic/Fp01sdJgm?rl=1">
          <img src="http://wx1.sinaimg.cn/wap180/3ed2e6e8gy1fk7hohl2i5j219s0ps4qp.jpg" alt="images" class="ib" />
        </a>&nbsp;
        <a href="https://weibo.cn/mblog/oripic?id=Fp01sdJgm&amp;u=3ed2e6e8gy1fk7hohl2i5j219s0ps4qp">image</a>&nbsp;
        <a href="https://weibo.cn/attitude/Fp01sdJgm/add?uid=5757914684&amp;rl=1&amp;st=7b15a6">praise[28094]</a>&nbsp;
        <a href="https://weibo.cn/repost/Fp01sdJgm?uid=1054009064&amp;rl=1">transmit[1164]</a>&nbsp;
        <a href="https://weibo.cn/comment/Fp01sdJgm?uid=1054009064&amp;rl=1#cmtfrm" class="cc">comment[4097]</a>&nbsp;<a href="https://weibo.cn/fav/addFav/Fp01sdJgm?rl=1&amp;st=7b15a6">save</a>
        "<!---->&nbsp;"
        <span class="ct">10月05日 20:08&nbsp;from iPhone 7 Plus

我尝试用HTML获取数据并通过zip()将它们放在一起。 我相信#BEGIN和#OVER之间的代码是错误的。

我需要确定用户是否已上传图像,然后解析HTML。然后将字符串保存到MySQL。现在我需要得到的是user_id,img,praise_num和transmit_num,然后把它们放在一起。我该怎么写呢?

0 个答案:

没有答案