将输入txt分开错误

时间:2016-11-17 01:10:02

标签: python python-3.x pycharm

我正在尝试编写一个程序,输入用户声明的两个txt文件,获取关键字文件并将其拆分为单词和值,然后获取推文文件,将其拆分为某个位置和推文/时间。

关键字文件示例(单行间距.txt文件):

*爱,10

像,5

最好的,10

恨,1

洛尔,10

更好,10 *

tweets文件示例(注意这只显示四个,实际的.txt文件中实际上有几百行):

[41.298669629999999,-81.915329330000006] 6 2011-08-28 19:02:36工作需要飞过...我很高兴看到Spy Kids 4然后热爱我的生活...... ARREIC

[33.702900329999999,-117.95095704000001] 6 2011-08-28 19:03:13今天将是我生命中最美好的一天。聘请在我最好的朋友的父母50周年纪念日拍照。 60个老人。呜。

[38.809954939999997,-77.125144050000003] 6 2011-08-28 19:07:05我只是把我的生活放在5个手提箱里

[27.994195699999999,-82.569434900000005] 6 2011-08-28 19:08:02 @ Miss_mariiix3是对生命的热爱

到目前为止,我的程序看起来像是:

#prompt the user for the file name of keywords file
keywordsinputfile = input("Please input file name: ")
tweetsinputfile = input ("Please input tweets file name: ")

#try to open given input file
try:
    k=open(keywordsinputfile, "r")
except IOError:
    print ("{} file not found".format(keywordsinputfile))
try:
    t=open(tweetsinputfile, "r")
except IOError:
    print ("{} file not found".format(tweetsinputfile))
    exit()

def main ():   #main function
    kinputfile = open(keywordsinputfile, "r")         #Opens File for keywords
    tinputfile = open(tweetsinputfile, "r")           #Opens file for tweets
    HappyWords = {}
    HappyValues = {}
    for line in kinputfile:                           #splits keywords
        entries = line.split(",")
        hvwords = str(entries[0])
        hvalues = int(entries[1])
        HappyWords["keywords"] = hvwords           #stores Happiness keywords
        HappyValues["values"] = hvalues            #stores Happiness Values
    for line in tinputfile:
        twoparts = line.split("]")  #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
        startlocation = (twoparts[0])   #takes the first part (the locations)
    def testing(startlocation):
        for line in startlocation:     
            intlocation = line.split("[")      #then gets rid of the "[" at the beginning of the locations
            print (intlocation)
    testing(startlocation)

main()

我希望摆脱这一点(对于无数行,实际文件包含的方式多于上面显示的四个)

41.298669629999999, -81.915329330000006
33.702900329999999, -117.95095704000001
38.809954939999997, -77.125144050000003
27.994195699999999, -82.569434900000005

我得到的是:

['', '']
['2']
['7']
['.']
['9']
['9']
['4']
['1']
['9']
['5']
['6']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
[',']
[' ']
['-']
['8']
['2']
['.']
['5']
['6']
['9']
['4']
['3']
['4']
['9']
['0']
['0']
['0']
['0']
['0']
['0']
['0']
['5']

所以换句话说,它只处理txt文件的最后一行并将其单独拆分。

在此之后,我必须以这样的方式存储它们,我可以将它们再次分成一个列表中的第一部分而另一个列表中的第二部分 (例如:

for line in locations:
    entries = line.split(",")
    latitude = intr(entries[0])
    longitude = int(entries[1])

提前致谢!

2 个答案:

答案 0 :(得分:0)

您只需要坚持一些跟踪打印语句来显示正在发生的事情。我是这样做的:

for line in tinputfile:
    twoparts = line.split("]")  #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
    startlocation = (twoparts[0])   #takes the first part (the locations)
    print ("-----------")
    print ("twoparts", twoparts) 
    print ("startlocation", startlocation)
def testing(startlocation):
    for line in startlocation:     
        print ("line", line)
        intlocation = line.split("[")      #then gets rid of the "[" at the beginning of the locations
        print ("intlocation", intlocation)
testing(startlocation)

......并开始追踪:

-----------
twoparts ['[41.298669629999999, -81.915329330000006', " 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC\n"]
startlocation [41.298669629999999, -81.915329330000006
-----------
twoparts ['[33.702900329999999, -117.95095704000001', " 6 2011-08-28 19:03:13 Today is going to be the greatest day of my life. Hired to take pictures at my best friend's gparents 50th anniversary. 60 old people. Woo.\n"]
startlocation [33.702900329999999, -117.95095704000001
-----------
twoparts ['[38.809954939999997, -77.125144050000003', ' 6 2011-08-28 19:07:05 I just put my life in like 5 suitcases\n']
startlocation [38.809954939999997, -77.125144050000003
-----------
twoparts ['[27.994195699999999, -82.569434900000005', ' 6 2011-08-28 19:08:02 @Miss_mariiix3 is the love of my life\n']
startlocation [27.994195699999999, -82.569434900000005
line [
intlocation ['', '']
line 2
intlocation ['2']
line 7

<强>分析:

有两个基本问题:

  1. 您的处理语句测试(startlocation)在循环之外,因此它仅使用最后一个输入行。
  2. 正如您在“twoparts”的输出中所看到的,您所需的坐标仍为 string 格式,而不是浮点列表。您需要剥离括号并将它们分开。 然后将它们转换为float。在当前形式中,当您遍历 intlocation 时,您将遍历字符串的字符,而不是通过两个浮点数。
  3. 另外:为什么在循环中定义函数?这会在每次执行时重新定义函数。在主程序之前移动它;这是表现良好的功能的地方。 : - )

    添加了关于第2点的信息:

    让我们使用最后一行示例输入逐步执行您的代码。 从循环的顶部开始获取tinputfile中的行

    twoparts = line.split("]")
    

    twoparts 现在是一对元素,两个字符串:

    ['[27.994195699999999, -82.569434900000005',
     ' 6 2011-08-28 19:08:02 @Miss_mariiix3 is the love of my life\n']
    

    然后将 startlocation 设置为第一个元素:

    '[27.994195699999999, -82.569434900000005'
    

    然后是函数测试的冗余重新定义,它不会产生任何变化。下一个语句称为测试;我们进入例程。

    testing(startlocation)
    for line in startlocation:
    

    这里的重要部分是 startlocation 字符串

    '[27.994195699999999, -82.569434900000005'
    

    ...所以当你执行那个循环时,你会遍历字符串,一次一个字符。

    <强>校正:

    说实话,我不知道应该做什么测试。 看起来您需要做的就是去掉那个领先的括号:

    intlocation = startlocation.split('[')
    

    ......或者只是

    intlocation = startlocation[1:]
    

    相反,如果你想将 float 值作为一个双元素列表,(a)如上所述敲掉括号,将元素拆分为逗号,并转换为float:

    intlocation = [ float(x) for x in startlocation[1:].split(',') ]
    

答案 1 :(得分:0)

看起来真正需要的是numericInput

ast.literal_eval

但是你仍然可以使用for line in tinputfile: twoparts = line.split("]") startlocation = ast.literal_eval(twoparts[0] + ']') # add the ']' back in # startlocation is now a list of two coordinates.

re

那么,发生了什么?

正则表达式(> import re > example = '[27.994195699999999, -82.569434900000005] 6 2011-08-28 19:02:36 text text text text' > fmt = re.split(r'\[(-?[0-9.]+),\s?(-?[0-9.]+).\s*\d\s*(\d{4}-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2})',example) > fmt ['', '27.994195699999999', '-82.569434900000005', '2011-08-28 19:02:36', ' text text text text'] > location = (float(fmt[1]), float(fmt[2])) > time = fmt[3] > text = fmt[4] 模块)中的每个(...)告诉re&#34;将此作品设为自己的索引&#34;。

第一个和第二个是re.split。这意味着匹配任何可能带有减号后跟数字和小数位的东西(我们可能更严格,但你真的不需要)。

下一组-?[0-9.]匹配任何日期:()表示&#34;四位数&#34;。 \d{4}表示&#34;一个或两个数字&#34;。

或者,您可以同时使用它们:

\d{1,2}