想要从熊猫数据框中删除数字并实现CountVectorizer

时间:2019-09-12 20:06:20

标签: python pandas dataframe nlp

我有以下格式的数据:

import cv2

capture = cv2.VideoCapture(0)

#start position for y coordinate
start_y = 150
line_y = 150
#ending position for y coordinate
end_y = 50
#x position (bottom left and bottom right of the box)
x_left = 60
x_right = 140
#speed of scanning
speed = 3

#main loop
while True:

    #read frame
    ret, img=capture.read()

    #create bounding box
    cv2.rectangle(img, (60, 50), (140, 150), (255,0,0), 2)
    #draw a line
    cv2.line(img, (x_left,line_y),(x_right, line_y), (0, 255, 0), thickness=3, lineType=8)

    #x always stays the same for the line but y decreases so the line goes up (in opencv y goes from up to down)
    #speed makes the line go faster or slower (you can adjust it as you want)
    line_y -= speed

    #if the line gets to the top of the bounding box get the y value back to the bottom
    #of the bounding box so the line goes back down
    if line_y <= end_y:
        line_y = start_y

    #show image
    cv2.imshow("face",img)
    k = cv2.waitKey(10)
    #if press ESC stop everything
    if k == 27:
        break

capture.release()
cv2.destroyAllWindows()

我尝试了以下操作来清除文本:

    author  text
0   garyvee     A lot of people misunderstand Gary’s message o...
1   jasonfried  "I can’t remember having a goal. An actual goa...
2   biz         "Tools that can create media that looks and so...

我得到了输出,但其中包含我不希望用于文本分析的数字

text_data.loc[:,"text"] = text_data.text.apply(lambda x : str.lower(x))
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.findall('[\w]+',x)))

但要删除文本字符串中的数字:

0    a lot of people misunderstand gary s message o...
1    i can t remember having a goal an actual goal ...
2    tools that can create media that looks and sou...
Name: text, dtype: object

我得到了输出:

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

如何避免呢?如何实现CountVectorizer?

1 个答案:

答案 0 :(得分:0)

我实际上在这个阶段犯了错误:

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

应该是

text_data.loc[:,"text"] = text_data.text.apply(lambda x : re.sub('^[0-9\.]*$','',x))