我有2个csv文件,一个是dictionary.csv,其中包含单词列表,另一个是story.csv。在story.csv中,有很多列,其中一列中包含许多单词,称为news_story。我想检查在news_story列中是否存在dictionary.csv中的单词列表。之后,我想在一个名为New.csv
的新csv文件中打印news_story列中包含dictionary.csv单词列表中的单词的所有行。这些是我到目前为止尝试过的代码
import csv
import pandas as pd
news=pd.read_csv("story.csv")
dictionary=pd.read_csv("dictionary.csv")
pattern = '|'.join(dictionary)
exist=news['news_story'].str.contains(pattern)
for CHECK in exist:
if not CHECK:
news['NEWcolumn']='NO'
else:
news['NEWcolumn']='YES'
news.to_csv('New.csv')
尽管应该有一些真相,我还是一直保持沉默
story.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
live.com pbandJ 2001 I made a sandwich today
key.com uAndI 1992 A code name of a spy
dictionary.csv
red
tie
lace
books
functional
New.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
答案 0 :(得分:1)
首先将列转换为带有header=None
的Series,以避免删除read_csv
中带有squeeze=True
的第一个值:
dictionary=pd.read_csv("dictionary.csv", header=None, squeeze=True)
print (dictionary)
0 red
1 tie
2 lace
3 books
4 functional
Name: 0, dtype: object
pattern = '|'.join(dictionary)
#for avoid match substrings use words boundaries
#pattern = '|'.join(r"\b{}\b".format(x) for x in dictionary)
最后用boolean indexing
过滤:
exist = news['news_story'].str.contains(pattern)
news[exist].to_csv('New.csv')
详细信息:
print (news[exist])
news_url news_title news_date \
0 goog.com functional 2019
news_story
0 This story is about a functional requirement