Python将get_text项与列表项进行比较

时间:2017-02-20 15:38:11

标签: python csv web-scraping comparison string-conversion

继续我的python项目,但我偶然发现了一个令人沮丧的阶段。

我没有从论坛找到最后发布日期的代码片段,将其保存在临时变量(希望用于检查每个日期)和公共/全局的代码中以供进一步使用范围。

但是,我尝试使用的方法是从论坛中获取所有最后发布日期,并将它们与.csv文件中已有的日期进行比较,以查看是否有任何新帖子,如果没有,只需要#&# 39; t刮取/挖掘数据。

然而,这是我正在努力的确切部分,无法将我的mined(get_text)元素与.csv列表中的项目进行比较。

任何想法都会得到应用,尝试多种方法,将其留在下面的最后一个方法仍然不起作用。

代码:

#Preparing csv file to be read through to check if dates match
storedDates = open(os.path.expanduser("PostDates.csv"))
csv_storedDates = csv.reader(storedDates)
dateRow = list(csv_storedDates) #Storing all the dates as a "List" object
listLength = len(dateRow) #Grabbing the csv List length
startingDate = 0 #Variable for looping through each date for each post.

lPostDate = lPostDate2 = ""

#Looping through 6 times (As that's how many pages each forum has, and collecting Next Page Link,Each Thread Title, It's Link
#.. last post date (To know how recent it is) and assigning next page link to current url, and continuing loop.
while number < 6:
    for postDate in soup.find_all(title=re.compile("^Replies:")):
        tempData = ""
        tempData += (postDate.get_text("\n", strip=True)[0:10] + "\n")
        lPostDate += (postDate.get_text("\n", strip=True)[0:10] + "\n")
        if any(tempData in s for s in dateRow[startingDate]):
            print("Matched a date" + tempData + "to one from database" + dateRow[startingDate])
            startingDate +=1
        else :
            startingDate += 1
            print("Date " + tempData + "was not matched to anything" + str(dateRow[startingDate]))

这只是代码的一部分,但这是我目前唯一想要的工作。假设PostDates.csv已经包含了信息。此外,这是输出的样子:

Date 02-11-2017
was not matched to anything['02-11-2017']
Date 01-10-2017
was not matched to anything['01-10-2017']
Date 02-12-2017
was not matched to anything['02-12-2017']
Date 10-01-2016
was not matched to anything['10-01-2016']
Date 09-30-2016
was not matched to anything['09-30-2016']
Date 08-10-2016
was not matched to anything['08-10-2016']
Date 10-01-2015
was not matched to anything['10-01-2015']
Date 10-01-2015
was not matched to anything['10-01-2015']
Date 08-29-2015
was not matched to anything['08-29-2015']
Date 03-16-2015
was not matched to anything['03-16-2015']
Date 07-16-2014
was not matched to anything['07-16-2014']
Date 07-13-2014
was not matched to anything['07-13-2014']
Date 02-11-2014
was not matched to anything['02-11-2014']
Date 07-02-2013
was not matched to anything['07-02-2013']
Date 06-28-2013
was not matched to anything['06-28-2013']
Date 04-22-2013
was not matched to anything['04-22-2013']
Date 05-28-2012
was not matched to anything['05-28-2012']
Date 05-25-2012
was not matched to anything['05-25-2012']
Date 05-09-2012
was not matched to anything['05-09-2012']
Date 06-10-2010
was not matched to anything['06-10-2010']
Date 01-18-2010
was not matched to anything['01-18-2010']
Date 01-18-2010
was not matched to anything['01-18-2010']
Date 12-29-2009
was not matched to anything['12-29-2009']
Date 06-08-2009
was not matched to anything['06-08-2009']
Date 02-02-2009
was not matched to anything['02-02-2009']
Date 11-24-2008
was not matched to anything['11-24-2008']
Date 09-02-2008
was not matched to anything['09-02-2008']
Date 08-07-2008
was not matched to anything['08-07-2008']
Date 06-05-2008
was not matched to anything['06-05-2008']
Date 05-22-2008
was not matched to anything['05-22-2008']
Date 04-21-2008
was not matched to anything['04-21-2008']
Date 03-29-2008
was not matched to anything['03-29-2008']
1
Date 02-11-2017
was not matched to anything['02-11-2017']
Date 01-10-2017
was not matched to anything['01-10-2017']
Date 11-07-2007
was not matched to anything['11-07-2007']
Date 11-07-2007
was not matched to anything['11-07-2007']
Date 09-19-2007
was not matched to anything['09-19-2007']
Date 09-01-2007
was not matched to anything['09-01-2007']
Date 08-31-2007
was not matched to anything['08-31-2007']
Date 08-31-2007
was not matched to anything['08-31-2007']
Date 08-30-2007
was not matched to anything['08-30-2007']
Date 08-24-2007
was not matched to anything['08-24-2007']
Date 08-19-2007
was not matched to anything['08-19-2007']
Date 08-08-2007
was not matched to anything['08-08-2007']
Date 08-03-2007
was not matched to anything['08-03-2007']
Date 07-29-2007
was not matched to anything['07-29-2007']
Date 07-18-2007
was not matched to anything['07-18-2007']
Date 06-26-2007
was not matched to anything['06-26-2007']
Date 06-26-2007
was not matched to anything['06-26-2007']
Date 01-12-2007
was not matched to anything['01-12-2007']
Date 12-05-2006
was not matched to anything['12-05-2006']
Date 11-16-2006
was not matched to anything['11-16-2006']
Date 11-05-2006
was not matched to anything['11-05-2006']
Date 11-05-2006
was not matched to anything['11-05-2006']
Date 11-03-2006
was not matched to anything['11-03-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-19-2006
was not matched to anything['09-19-2006']
Date 09-12-2006
was not matched to anything['09-12-2006']
Date 08-17-2006
was not matched to anything['08-17-2006']
Date 08-07-2006
was not matched to anything['08-07-2006']
Date 08-02-2006
was not matched to anything['08-02-2006']
Date 07-16-2006
was not matched to anything['07-16-2006']
Date 07-07-2006
was not matched to anything['07-07-2006']

我不再在第2页之后粘贴输出,因为它的6页很长,所以有很多数据。

这就是它之前被刮掉并存储在.csv文件中的样子(dateRow变量):

Date,
02-11-2017
01-10-2017
02-12-2017
10-01-2016
09-30-2016
08-10-2016
10-01-2015
10-01-2015
08-29-2015
03-16-2015
07-16-2014
07-13-2014
02-11-2014
07-02-2013
06-28-2013
04-22-2013
05-28-2012
05-25-2012
05-09-2012
06-10-2010
01-18-2010
01-18-2010
12-29-2009
06-08-2009
02-02-2009
11-24-2008
09-02-2008
08-07-2008
06-05-2008
05-22-2008
04-21-2008
03-29-2008
02-11-2017
01-10-2017
11-07-2007
11-07-2007
09-19-2007
09-01-2007
08-31-2007
08-31-2007

任何建议如何处理它以便找到匹配日期将非常感谢,谢谢!

1 个答案:

答案 0 :(得分:1)

总结我们在评论中的对话: 您键入any(tempData in s for s in dateRow[startingDate]),我认为它必须是类型不匹配。事实证明是。这是因为any()定义如下:

  

any(iterable)如果iterable的任何元素为true,则返回True。如果   iterable为空,返回False。相当于:

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

分开时你的代码就是这样的:

>>> # Curly brackets make it syntactically correct
>>> iterable = (tempData in s for s in dateRow[startingDate]) 
>>> any(iterable)
False

但是它真的可以迭代吗?让我们看看:

>>> type(iterable)
<class 'generator'>

不是!哈!但是这个:

>>> type([tempData in s for s in dateRow[startingDate]])
<class 'list'>

可迭代!

>>> hasattr([tempData in s for s in dateRow[startingDate]], '__iter__')
True

问题解决了,只记得在生成器周围添加一些括号,使其成为可迭代的!