我希望此输出通过CSV编写
['https://www.lendingclub.com/loans/personal-loans''6.16%至35.89%'] ['https://www.lendingclub.com/loans/personal-loans''1%至6%'] ['https://www.marcus.com/us/en/personal-loans''6.99%至24.99%'] ['https://www.marcus.com/us/en/personal-loans''6.99%至24.99%'] ['https://www.marcus.com/us/en/personal-loans''6.99%至24.99%'] ['https://www.marcus.com/us/en/personal-loans''6.99%至24.99%'] ['https://www.marcus.com/us/en/personal-loans''6.99%至24.99%'] ['https://www.discover.com/personal-loans/''6.99%to 24.99%']
但是,当我运行代码将输出写入CSV时,我只会将最后一行写入CSV文件:
['https://www.discover.com/personal-loans/''6.99%至24.99%']
是因为我的打印输出没有逗号分隔吗?我试图通过使用空格作为定界符来避免必须在其中添加逗号。让我知道你的想法。希望能对此有所帮助,因为我最不愿意重塑这些收集的数据。
plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
'https://www.marcus.com/us/en/personal-loans',
'https://www.discover.com/personal-loans/']
#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
cdate = datetime.date.today()
l = r.get(link)
l.encoding = 'utf-8'
data = l.text
soup = bs(data, 'html.parser')
#captures Discover's rate perfectly but catches too much for lightstream/prosper
paragraph = soup.find_all(text=re.compile('[0-9]%'))
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
try:
irate = str(matches[0])
array = np.asarray(irate)
array2 = np.append(link,irate)
array2 = np.asarray(array2)
print(array2)
#with open('test.csv', "w") as csv_file:
# writer = csv.writer(csv_file, delimiter=' ')
# for line in test:
# writer.writerow(line)
except IndexError:
pass
答案 0 :(得分:1)
在使用csv文件时,熊猫很方便。
import datetime
import requests as r
from bs4 import BeautifulSoup as bs
import numpy as np
import regex as re
import pandas as pd
plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
'https://www.marcus.com/us/en/personal-loans',
'https://www.discover.com/personal-loans/']
df = pd.DataFrame({'Link':[],'APR Rate':[]})
#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
cdate = datetime.date.today()
l = r.get(link)
l.encoding = 'utf-8'
data = l.text
soup = bs(data, 'html.parser')
#captures Discover's rate perfectly but catches too much for lightstream/prosper
paragraph = soup.find_all(text=re.compile('[0-9]%'))
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
irate = ''
try:
irate = str(matches[0])
df2 = pd.DataFrame({'Link':[link],'APR Rate':[irate]})
df = pd.concat([df,df2],join="inner")
except IndexError:
pass
df.to_csv('CSV_File.csv',index=False)
我已将每个链接及其irate值存储在数据帧df2
中,并将其链接到父数据帧df
。
最后,我将父数据帧df
写入一个csv文件。
答案 1 :(得分:0)
我认为问题在于您正在以写入模式打开文件("w"
中的open('test.csv', "w")
),这意味着Python会覆盖文件中已写入的内容。我认为您正在寻找附加模式:
# open the file before the loop, and close it after
csv_file = open("test.csv", 'a') # change the 'w' to an 'a'
csv_file.truncate(0) # clear the contents of the file
writer = csv.writer(csv_file, delimiter=' ') # make the writer beforehand for efficiency
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
try:
irate = str(matches[0])
array = np.asarray(irate)
array2 = np.append(link,irate)
array2 = np.asarray(array2)
print(array2)
for line in test:
writer.writerow(line)
except IndexError:
pass
# close the file
csv_file.close()
如果这不起作用,请告诉我!