我将使用tm
进行文本挖掘。但是,在我使用r中的dput
函数后,我的文件CSV文件已经过了.Below是read.table
。有三栏谎言,情绪和评论。然而,第四个库包含没有列名的评论。我是R和Text挖掘的新手。如果我使用read.csv
,则会给我一个错误。请建议更好的方法来阅读csv文件。
更新
> dput(head(df))
structure(list(V1 = c("lie,sentiment,review", "f,n,'Mike\\'s",
"f,n,'i", "f,n,'After", "f,n,'Olive", "f,n,'I"), V2 = c("", "Pizza",
"really", "I", "Oil", "went"), V3 = c("", "High", "like", "went",
"Garden", "to"), V4 = c("", "Point,", "this", "shopping", "was",
"the"), V5 = c("", "NY", "buffet", "with", "very", "Chilis"),
V6 = c("", "Service", "restaurant", "some", "disappointing.",
"on"), V7 = c("", "was", "in", "of", "I", "Erie"), V8 = c("",
"very", "Marshall", "my", "expect", "Blvd"), V9 = c("", "slow",
"street.", "friend,", "good", "and"), V10 = c("", "and",
"they", "we", "food", "had"), V11 = c("", "the", "have",
"went", "and", "the"), V12 = c("", "quality", "a", "to",
"good", "worst"), V13 = c("", "was", "lot", "DODO", "service",
"meal"), V14 = c("", "low.", "of", "restaurant", "(at", "of"
), V15 = c("", "You", "selection", "for", "least!!)", "my"
), V16 = c("", "would", "of", "dinner.", "when", "life."),
V17 = c("", "think", "american,", "I", "I", "We"), V18 = c("",
"they", "japanese,", "found", "go", "arrived"), V19 = c("",
"would", "and", "worm", "out", "and"), V20 = c("", "know",
"chinese", "in", "to", "waited"), V21 = c("", "at", "dishes.",
"one", "eat.", "5"), V22 = c("", "least", "we", "of", "The",
"minutes"), V23 = c("", "how", "also", "the", "meal", "for"
), V24 = c("", "to", "got", "dishes", "was", "a"), V25 = c("",
"make", "a", ".'", "cold", "hostess,"), V26 = c("", "good",
"free", "", "when", "and"), V27 = c("", "pizza,", "drink",
"", "we", "then"), V28 = c("", "not.", "and", "", "got",
"were"), V29 = c("", "Stick", "free", "", "it,", "seated"
), V30 = c("", "to", "refill.", "", "and", "by"), V31 = c("",
"pre-made", "there", "", "the", "a"), V32 = c("", "dishes",
"are", "", "waitor", "waiter"), V33 = c("", "like", "also",
"", "had", "who"), V34 = c("", "stuffed", "different", "",
"no", "was"), V35 = c("", "pasta", "kinds", "", "manners",
"obviously"), V36 = c("", "or", "of", "", "whatsoever.",
"in"), V37 = c("", "a", "dessert.", "", "Don\\'t", "a"),
V38 = c("", "salad.", "the", "", "go", "terrible"), V39 = c("",
"You", "staff", "", "to", "mood."), V40 = c("", "should",
"is", "", "the", "We"), V41 = c("", "consider", "very", "",
"Olive", "order"), V42 = c("", "dining", "friendly.", "",
"Oil", "drinks"), V43 = c("", "else", "it", "", "Garden.",
"and"), V44 = c("", "where.'", "is", "", "\nf,n,", "it"),
V45 = c("", "", "also", "", "The", "took"), V46 = c("", "",
"quite", "", "Seven", "them"), V47 = c("", "", "cheap", "",
"Heaven", "15"), V48 = c("", "", "compared", "", "restaurant",
"minutes"), V49 = c("", "", "with", "", "was", "to"), V50 = c("",
"", "the", "", "never", "bring"), V51 = c("", "", "other",
"", "known", "us"), V52 = c("", "", "restaurant", "", "for",
"both"), V53 = c("", "", "in", "", "a", "the"), V54 = c("",
"", "syracuse", "", "superior", "wrong"), V55 = c("", "",
"area.", "", "service", "beers"), V56 = c("", "", "i", "",
"but", "which"), V57 = c("", "", "will", "", "what", "were"
), V58 = c("", "", "definitely", "", "we", "barely"), V59 = c("",
"", "coming", "", "experienced", "cold."), V60 = c("", "",
"back", "", "last", "Then"), V61 = c("", "", "here.'", "",
"week", "we"), V62 = c("", "", "", "", "was", "order"), V63 = c("",
"", "", "", "a", "an"), V64 = c("", "", "", "", "disaster.",
"appetizer"), V65 = c("", "", "", "", "The", "and"), V66 = c("",
"", "", "", "waiter", "wait"), V67 = c("", "", "", "", "would",
"25"), V68 = c("", "", "", "", "not", "minutes"), V69 = c("",
"", "", "", "notice", "for"), V70 = c("", "", "", "", "us",
"cold"), V71 = c("", "", "", "", "until", "southwest"), V72 = c("",
"", "", "", "we", "egg"), V73 = c("", "", "", "", "asked",
"rolls,"), V74 = c("", "", "", "", "him", "at"), V75 = c("",
"", "", "", "4", "which"), V76 = c("", "", "", "", "times",
"point"), V77 = c("", "", "", "", "to", "we"), V78 = c("",
"", "", "", "bring", "just"), V79 = c("", "", "", "", "us",
"paid"), V80 = c("", "", "", "", "the", "and"), V81 = c("",
"", "", "", "menu.", "left."), V82 = c("", "", "", "", "The",
"Don\\'t"), V83 = c("", "", "", "", "food", "go.'"), V84 = c("",
"", "", "", "was", ""), V85 = c("", "", "", "", "not", ""
), V86 = c("", "", "", "", "exceptional", ""), V87 = c("",
"", "", "", "either.", ""), V88 = c("", "", "", "", "It",
""), V89 = c("", "", "", "", "took", ""), V90 = c("", "",
"", "", "them", ""), V91 = c("", "", "", "", "though", ""
), V92 = c("", "", "", "", "2", ""), V93 = c("", "", "",
"", "minutes", ""), V94 = c("", "", "", "", "to", ""), V95 = c("",
"", "", "", "bring", ""), V96 = c("", "", "", "", "us", ""
), V97 = c("", "", "", "", "a", ""), V98 = c("", "", "",
"", "check", ""), V99 = c("", "", "", "", "after", ""), V100 = c("",
"", "", "", "they", ""), V101 = c("", "", "", "", "spotted",
""), V102 = c("", "", "", "", "we", ""), V103 = c("", "",
"", "", "finished", ""), V104 = c("", "", "", "", "eating",
""), V105 = c("", "", "", "", "and", ""), V106 = c("", "",
"", "", "are", ""), V107 = c("", "", "", "", "not", ""),
V108 = c("", "", "", "", "ordering", ""), V109 = c("", "",
"", "", "more.", ""), V110 = c("", "", "", "", "Well,", ""
), V111 = c("", "", "", "", "never", ""), V112 = c("", "",
"", "", "more.", ""), V113 = c("", "", "", "", "\nf,n,",
""), V114 = c("", "", "", "", "I", ""), V115 = c("", "",
"", "", "went", ""), V116 = c("", "", "", "", "to", ""),
V117 = c("", "", "", "", "XYZ", ""), V118 = c("", "", "",
"", "restaurant", ""), V119 = c("", "", "", "", "and", ""
), V120 = c("", "", "", "", "had", ""), V121 = c("", "",
"", "", "a", ""), V122 = c("", "", "", "", "terrible", ""
), V123 = c("", "", "", "", "experience.", ""), V124 = c("",
"", "", "", "I", ""), V125 = c("", "", "", "", "had", ""),
V126 = c("", "", "", "", "a", ""), V127 = c("", "", "", "",
"YELP", ""), V128 = c("", "", "", "", "Free", ""), V129 = c("",
"", "", "", "Appetizer", ""), V130 = c("", "", "", "", "coupon",
""), V131 = c("", "", "", "", "which", ""), V132 = c("",
"", "", "", "could", ""), V133 = c("", "", "", "", "be",
""), V134 = c("", "", "", "", "applied", ""), V135 = c("",
"", "", "", "upon", ""), V136 = c("", "", "", "", "checking",
""), V137 = c("", "", "", "", "in", ""), V138 = c("", "",
"", "", "to", ""), V139 = c("", "", "", "", "the", ""), V140 = c("",
"", "", "", "restaurant.", ""), V141 = c("", "", "", "",
"The", ""), V142 = c("", "", "", "", "person", ""), V143 = c("",
"", "", "", "serving", ""), V144 = c("", "", "", "", "us",
""), V145 = c("", "", "", "", "was", ""), V146 = c("", "",
"", "", "very", ""), V147 = c("", "", "", "", "rude", ""),
V148 = c("", "", "", "", "and", ""), V149 = c("", "", "",
"", "didn\\'t", ""), V150 = c("", "", "", "", "acknowledge",
""), V151 = c("", "", "", "", "the", ""), V152 = c("", "",
"", "", "coupon.", ""), V153 = c("", "", "", "", "When",
""), V154 = c("", "", "", "", "I", ""), V155 = c("", "",
"", "", "asked", ""), V156 = c("", "", "", "", "her", ""),
V157 = c("", "", "", "", "about", ""), V158 = c("", "", "",
"", "it,", ""), V159 = c("", "", "", "", "she", ""), V160 = c("",
"", "", "", "rudely", ""), V161 = c("", "", "", "", "replied",
""), V162 = c("", "", "", "", "back", ""), V163 = c("", "",
"", "", "saying", ""), V164 = c("", "", "", "", "she", ""
), V165 = c("", "", "", "", "had", ""), V166 = c("", "",
"", "", "already", ""), V167 = c("", "", "", "", "applied",
""), V168 = c("", "", "", "", "it.", ""), V169 = c("", "",
"", "", "Then", ""), V170 = c("", "", "", "", "I", ""), V171 = c("",
"", "", "", "inquired", ""), V172 = c("", "", "", "", "about",
""), V173 = c("", "", "", "", "the", ""), V174 = c("", "",
"", "", "free", ""), V175 = c("", "", "", "", "salad", ""
), V176 = c("", "", "", "", "that", ""), V177 = c("", "",
"", "", "they", ""), V178 = c("", "", "", "", "serve.", ""
), V179 = c("", "", "", "", "She", ""), V180 = c("", "",
"", "", "rudely", ""), V181 = c("", "", "", "", "said", ""
), V182 = c("", "", "", "", "that", ""), V183 = c("", "",
"", "", "you", ""), V184 = c("", "", "", "", "have", ""),
V185 = c("", "", "", "", "to", ""), V186 = c("", "", "",
"", "order", ""), V187 = c("", "", "", "", "the", ""), V188 = c("",
"", "", "", "main", ""), V189 = c("", "", "", "", "course",
""), V190 = c("", "", "", "", "to", ""), V191 = c("", "",
"", "", "get", ""), V192 = c("", "", "", "", "that.", ""),
V193 = c("", "", "", "", "Overall,", ""), V194 = c("", "",
"", "", "I", ""), V195 = c("", "", "", "", "had", ""), V196 = c("",
"", "", "", "a", ""), V197 = c("", "", "", "", "bad", ""),
V198 = c("", "", "", "", "experience", ""), V199 = c("",
"", "", "", "as", ""), V200 = c("", "", "", "", "I", ""),
V201 = c("", "", "", "", "had", ""), V202 = c("", "", "",
"", "taken", ""), V203 = c("", "", "", "", "my", ""), V204 = c("",
"", "", "", "family", ""), V205 = c("", "", "", "", "to",
""), V206 = c("", "", "", "", "that", ""), V207 = c("", "",
"", "", "restaurant", ""), V208 = c("", "", "", "", "for",
""), V209 = c("", "", "", "", "the", ""), V210 = c("", "",
"", "", "first", ""), V211 = c("", "", "", "", "time", ""
), V212 = c("", "", "", "", "and", ""), V213 = c("", "",
"", "", "I", ""), V214 = c("", "", "", "", "had", ""), V215 = c("",
"", "", "", "high", ""), V216 = c("", "", "", "", "hopes",
""), V217 = c("", "", "", "", "from", ""), V218 = c("", "",
"", "", "the", ""), V219 = c("", "", "", "", "restaurant",
""), V220 = c("", "", "", "", "which", ""), V221 = c("",
"", "", "", "is,", ""), V222 = c("", "", "", "", "otherwise,",
""), V223 = c("", "", "", "", "my", ""), V224 = c("", "",
"", "", "favorite", ""), V225 = c("", "", "", "", "place",
""), V226 = c("", "", "", "", "to", ""), V227 = c("", "",
"", "", "dine.", ""), V228 = c("", "", "", "", "\nf,n,",
""), V229 = c("", "", "", "", "I", ""), V230 = c("", "",
"", "", "went", ""), V231 = c("", "", "", "", "to", ""),
V232 = c("", "", "", "", "ABC", ""), V233 = c("", "", "",
"", "restaurant", ""), V234 = c("", "", "", "", "two", ""
), V235 = c("", "", "", "", "days", ""), V236 = c("", "",
"", "", "ago", ""), V237 = c("", "", "", "", "and", ""),
V238 = c("", "", "", "", "I", ""), V239 = c("", "", "", "",
"hated", ""), V240 = c("", "", "", "", "the", ""), V241 = c("",
"", "", "", "food", ""), V242 = c("", "", "", "", "and",
""), V243 = c("", "", "", "", "the", ""), V244 = c("", "",
"", "", "service.", ""), V245 = c("", "", "", "", "We", ""
), V246 = c("", "", "", "", "were", ""), V247 = c("", "",
"", "", "kept", ""), V248 = c("", "", "", "", "waiting",
""), V249 = c("", "", "", "", "for", ""), V250 = c("", "",
"", "", "over", ""), V251 = c("", "", "", "", "an", ""),
V252 = c("", "", "", "", "hour", ""), V253 = c("", "", "",
"", "just", ""), V254 = c("", "", "", "", "to", ""), V255 = c("",
"", "", "", "get", ""), V256 = c("", "", "", "", "seated",
""), V257 = c("", "", "", "", "and", ""), V258 = c("", "",
"", "", "once", ""), V259 = c("", "", "", "", "we", ""),
V260 = c("", "", "", "", "ordered,", ""), V261 = c("", "",
"", "", "our", ""), V262 = c("", "", "", "", "food", ""),
V263 = c("", "", "", "", "came", ""), V264 = c("", "", "",
"", "out", ""), V265 = c("", "", "", "", "cold.", ""), V266 = c("",
"", "", "", "I", ""), V267 = c("", "", "", "", "ordered",
""), V268 = c("", "", "", "", "the", ""), V269 = c("", "",
"", "", "pasta", ""), V270 = c("", "", "", "", "and", ""),
V271 = c("", "", "", "", "it", ""), V272 = c("", "", "",
"", "was", ""), V273 = c("", "", "", "", "terrible", ""),
V274 = c("", "", "", "", "-", ""), V275 = c("", "", "", "",
"completely", ""), V276 = c("", "", "", "", "bland", ""),
V277 = c("", "", "", "", "and", ""), V278 = c("", "", "",
"", "very", ""), V279 = c("", "", "", "", "unappatizing.",
""), V280 = c("", "", "", "", "I", ""), V281 = c("", "",
"", "", "definitely", ""), V282 = c("", "", "", "", "would",
""), V283 = c("", "", "", "", "not", ""), V284 = c("", "",
"", "", "recommend", ""), V285 = c("", "", "", "", "going",
""), V286 = c("", "", "", "", "there,", ""), V287 = c("",
"", "", "", "especially", ""), V288 = c("", "", "", "", "if",
""), V289 = c("", "", "", "", "you\\'re", ""), V290 = c("",
"", "", "", "in", ""), V291 = c("", "", "", "", "a", ""),
V292 = c("", "", "", "", "hurry!'", "")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11",
"V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20",
"V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29",
"V30", "V31", "V32", "V33", "V34", "V35", "V36", "V37", "V38",
"V39", "V40", "V41", "V42", "V43", "V44", "V45", "V46", "V47",
"V48", "V49", "V50", "V51", "V52", "V53", "V54", "V55", "V56",
"V57", "V58", "V59", "V60", "V61", "V62", "V63", "V64", "V65",
"V66", "V67", "V68", "V69", "V70", "V71", "V72", "V73", "V74",
"V75", "V76", "V77", "V78", "V79", "V80", "V81", "V82", "V83",
"V84", "V85", "V86", "V87", "V88", "V89", "V90", "V91", "V92",
"V93", "V94", "V95", "V96", "V97", "V98", "V99", "V100", "V101",
"V102", "V103", "V104", "V105", "V106", "V107", "V108", "V109",
"V110", "V111", "V112", "V113", "V114", "V115", "V116", "V117",
"V118", "V119", "V120", "V121", "V122", "V123", "V124", "V125",
"V126", "V127", "V128", "V129", "V130", "V131", "V132", "V133",
"V134", "V135", "V136", "V137", "V138", "V139", "V140", "V141",
"V142", "V143", "V144", "V145", "V146", "V147", "V148", "V149",
"V150", "V151", "V152", "V153", "V154", "V155", "V156", "V157",
"V158", "V159", "V160", "V161", "V162", "V163", "V164", "V165",
"V166", "V167", "V168", "V169", "V170", "V171", "V172", "V173",
"V174", "V175", "V176", "V177", "V178", "V179", "V180", "V181",
"V182", "V183", "V184", "V185", "V186", "V187", "V188", "V189",
"V190", "V191", "V192", "V193", "V194", "V195", "V196", "V197",
"V198", "V199", "V200", "V201", "V202", "V203", "V204", "V205",
"V206", "V207", "V208", "V209", "V210", "V211", "V212", "V213",
"V214", "V215", "V216", "V217", "V218", "V219", "V220", "V221",
"V222", "V223", "V224", "V225", "V226", "V227", "V228", "V229",
"V230", "V231", "V232", "V233", "V234", "V235", "V236", "V237",
"V238", "V239", "V240", "V241", "V242", "V243", "V244", "V245",
"V246", "V247", "V248", "V249", "V250", "V251", "V252", "V253",
"V254", "V255", "V256", "V257", "V258", "V259", "V260", "V261",
"V262", "V263", "V264", "V265", "V266", "V267", "V268", "V269",
"V270", "V271", "V272", "V273", "V274", "V275", "V276", "V277",
"V278", "V279", "V280", "V281", "V282", "V283", "V284", "V285",
"V286", "V287", "V288", "V289", "V290", "V291", "V292"), row.names = c(NA,
6L), class = "data.frame")
数据集:
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
提前致谢,
答案 0 :(得分:2)
我不知道您为什么从原始帖子@Yes Boss中删除了该文件,但此答案基于此文件,而不是您的dput
输出。该文件基本上有两个问题,为什么你无法读取它.1。你的引用字符是'
而不是更常见的"
; 2. '
也用在review
列中,这对于base来说有点太多了(它试图在这些实例中拆分成新列)。幸运的是,数据包data.table更加智能,可以解决问题#2:
library(data.table)
df <- fread(file = "deception.csv", quote="\'")
生成的对象将是data.table而不是data.frame:
> str(df)
Classes ‘data.table’ and 'data.frame': 92 obs. of 3 variables:
$ lie : chr "f" "f" "f" "f" ...
$ sentiment: chr "n" "n" "n" "n" ...
$ review : chr "Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at"| __truncated__ "i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, an"| __truncated__ "After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes ." "Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat."| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
您可以通过在data.table = FALSE
中设置fread()
来关闭此行为(如果您愿意,我建议您学习如何使用data.table)。
个人观点说明:如果您想进入文本挖掘,请查看quanteda包而不是tm。它速度更快,并且对许多任务采用更现代的方法。
答案 1 :(得分:0)
对于此特定文本文件,您需要查看quote
参数。在read.table()
中,默认的quote
参数是单引号或双引号。在这里你只需要一个引用:
df <- read.table("filename", header = TRUE, quote = "\'")
str(df)
# 'data.frame': 9 obs. of 3 variables:
# $ lie : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1
# $ sentiment: Factor w/ 1 level "n": 1 1 1 1 1 1 1 1 1
# $ review : Factor w/ 9 levels "After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .",..: 6 2 1 7 9 5 3 4 8
那应该为你做。
我建议您阅读read.table()
的帮助文件(一直到此)。有很多事要考虑。