使用正则表达式拉取数据并插入.csv文件

时间:2017-12-06 21:28:47

标签: python regex

所以我使用正则表达式从网页中提取数据。完成。

现在我正在尝试将此数据插入.csv文件中。没问题吧?

所以我无法从我创建的循环中提取数据以插入.csv文件。看起来最好的方法是创建一个列表,然后以某种方式将数据插入列表并将数据写入csv文件。但是如何使用我当前的设置呢?

import re
import sqlite3 as lite
import mysql.connector
import urllib.request
from bs4 import BeautifulSoup
import csv

#We're pulling info on socks from e-commerce site Aliexpress

url="https://www.aliexpress.com/premium/socks.html?SearchText=socks&ltype=wholesale&d=y&tc=ppc&blanktest=0&initiative_id=SB_20171202125044&origin=y&catId=0&isViewCP=y"

req = urllib.request.urlopen(url)
soup = BeautifulSoup(req, "html.parser")
div = soup.find_all("div", attrs={"class":"item"})

for item in div:
    title_pattern = '<img alt="(.*?)\"'
    comp = re.compile(title_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        print(x)

    price_pattern = 'itemprop="price">(.*?)<'
    comp = re.compile(price_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        print(x)

    seller_pattern = '<a class="store j-p4plog".*?>(.*?)<'
    comp = re.compile(seller_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        print(x)

    orders_pattern = '<em title="Total Orders">.*?<'
    comp = re.compile(orders_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        print(x[32:-1])

    feedback_pattern = '<a class="rate-num j-p4plog".*?>(.*)<'
    comp = re.compile(feedback_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        print(x)

# Creation and insertion of CSV file

# csvfile = "aliexpress.csv"
# csv = open(csvfile, "w")
# columnTitleRow = "Title,Price,Seller,Orders,Feedback,Pair"
# csv.write(columnTitleRow)
#
# for stuff in div:
#     title = 
#     price = 
#     seller = 
#     orders = 
#     feedback = 
#     row = title + "," + price + "," + seller + "," + orders + "," + feedback + 
"," + "\n"
#     csv.write(row)

我希望能够按行打印这些列表。

1 个答案:

答案 0 :(得分:0)

  

看起来征服此问题的最佳方法是创建一个列表,并以某种方式将数据插入列表并将数据写入csv文件。但是如何使用我当前的设置来做到这一点?

是的,你是对的。将您的打印语句替换为append s到列表:

data = []
for item in div:
    title_pattern = '<img alt="(.*?)\"'
    comp = re.compile(title_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        data.append(x)

    price_pattern = 'itemprop="price">(.*?)<'
    comp = re.compile(price_pattern)
    href = re.findall(comp, str(item))
    for x in href:
        data.append(x)

然后是

csv.writerow(data)

根据我的记忆,csv.write无论如何都会获取列表而不是呈现的CSV字符串。这就是重点,它需要原始数据并正确地逃避它并为您添加逗号。

编辑:正如评论中所解释的那样,我错误地记得了csv编写器的界面。 writerow列出一个列表,而不是write。更新。