如何检查一个csv中的单词是否在另一个csv文件的另一列中

时间:2019-09-08 13:18:38

标签: python-3.x pandas csv

我有2个csv文件,一个是dictionary.csv,其中包含单词列表,另一个是story.csv。在story.csv中,有很多列,其中一列中包含许多单词,称为news_story。我想检查在news_story列中是否存在dictionary.csv中的单词列表。之后,我想在一个名为New.csv

的新csv文件中打印news_story列中包含dictionary.csv单词列表中的单词的所有行。

这些是我到目前为止尝试过的代码

import csv
import pandas as pd

news=pd.read_csv("story.csv")
dictionary=pd.read_csv("dictionary.csv")

pattern = '|'.join(dictionary)

exist=news['news_story'].str.contains(pattern)
for CHECK in exist:
    if not CHECK:
        news['NEWcolumn']='NO'
    else:
        news['NEWcolumn']='YES'

news.to_csv('New.csv')

尽管应该有一些真相,我还是一直保持沉默

story.csv

news_url news_title news_date news_story
goog.com functional 2019      This story is about a functional requirement
live.com pbandJ     2001      I made a sandwich today
key.com  uAndI      1992      A code name of a spy
dictionary.csv
red
tie
lace
books
functional
New.csv
news_url news_title news_date news_story
goog.com functional   2019    This story is about a functional requirement

1 个答案:

答案 0 :(得分:1)

首先将列转换为带有header=None的Series,以避免删除read_csv中带有squeeze=True的第一个值:

dictionary=pd.read_csv("dictionary.csv", header=None, squeeze=True)
print (dictionary)
0           red
1           tie
2          lace
3         books
4    functional
Name: 0, dtype: object

pattern = '|'.join(dictionary)
#for avoid match substrings use words boundaries
#pattern = '|'.join(r"\b{}\b".format(x) for x in dictionary)

最后用boolean indexing过滤:

exist = news['news_story'].str.contains(pattern)
news[exist].to_csv('New.csv')

详细信息

print (news[exist])
   news_url  news_title  news_date  \
0  goog.com  functional       2019   

                                     news_story  
0  This story is about a functional requirement