将字符向量压缩为R中的字符列

时间:2016-10-18 08:05:35

标签: r dataframe

我正在对一些字幕进行分析,我已设法清理并计算频率。现在我想删除所有的停用词(随“tm”包一起提供)。

以下是数据示例:

words2 <- c("a", "be", "am", "you", "lannister", "wolf", "angry", "scandals", "should", "me")
frequency2 <- c(12,10,15, 20, 5, 10,8,3,9,20)
stopwordslst <- c("i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","it","its","they","them","thei","theirs","themselves", "what",
"those","am","is","are","be","been","being","have","has","does","did","doing","would","should")

所以我尝试制作一个for循环,想法是制作逻辑然后删除所有真实的。但我无法找到正确的方法,因此它将错误保存在data.frame中的相同结构中。

以下是我的尝试:

for(i in words){
if(i == stopwordslst[]){
  (data1[-i,])
 }
}

预期结果与数据帧相同但是像这样:

words       frequency
lannister   5
wolf        10
angry       8
scandals    3 

提前致谢

2 个答案:

答案 0 :(得分:0)

迭代删除stopwordlst中出现的df = data.frame(words=words2,frequency=frequency2) df = df[(sapply(c(1:nrow(df)),FUN = function(x){sum(df$words[x]==stopwordslst)})==0),] > df words frequency 5 lannister 5 6 wolf 10 7 angry 8 8 scandals 3 字样对我有效。

<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:id="@+id/layout_root"
android:orientation="vertical"
android:layout_width="fill_parent"
android:layout_height="fill_parent"
android:padding="10dp"
>
<RelativeLayout
    android:orientation="vertical"
    android:layout_width="fill_parent"
    android:layout_height="wrap_content"
    android:paddingTop="3dip" >
    <ImageView android:id="@+id/close"
        android:layout_width="30dip"
        android:layout_height="30dip"
        android:layout_alignParentRight="true"
        android:layout_marginRight="3dp"
        android:src="@drawable/ic_cancel_black_24dp"
        />
    <TextView android:id="@+id/text1"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_centerHorizontal="true"
        android:layout_alignParentLeft="true"
        android:layout_marginLeft="3dp"
        android:textColor="#FFF"
        android:textSize="20dip"
        android:text="Choose Categories"/>
</RelativeLayout>

<GridView
    android:id="@+id/gridview"
    android:layout_width="fill_parent"
    android:layout_height="fill_parent"
    android:columnWidth="90dp"
    android:numColumns="3"
    android:verticalSpacing="10dp"
    android:horizontalSpacing="10dp"
    android:stretchMode="columnWidth"
    android:gravity="center"/>

答案 1 :(得分:0)

正如@Sotos所提到的,您可以使用%in%!来获取要包含的字词,并使用相同的索引来选择频率。

df <- data.frame(words = words2[!words2 %in% stopwordslst],
                 frequency = frequency2[!words2 %in% stopwordslst])
df
# words frequency
#1         a        12
#2 lannister         5
#3      wolf        10
#4     angry         8
#5  scandals         3

注意:你没有&#39; a&#39;在stopwordslst中,因此包含在内。

或者一点清洁,

idx <- !words2 %in% stopwordslst
df <- data.frame(words = words2[idx],frequency = frequency2[idx])