好吧,这就是问题所在。我有一些文本文件,其中有14,000多个单词,但它们都在一行,如果你使用没有自动换行功能的编辑器,你就无法读取文本文件。因此,我希望在至少1000个字之后以及下次出现returns
时将newline
或"."
个字符添加到我的文件中。我的第一个想法是计算线条然后将其加起来,当它达到1000时插入一个\n
字符,但它全部在1行。这让事情变得更加艰难,我无法找到实现我想要的方法。没有我,我自己,浏览文本文件并自己添加换行符。这违背了我的目标只是运行python脚本自动为我做的目的。这可能吗?或者我这么想是疯了吗?提前感谢您提供的任何帮助!我在下面提供了各种尝试。
在此尝试中,代码按预期工作,但不是打印Word Count is over 1000
大约14次。因为,这个文本文件的字数是14,000和一些东西。它只打印一次,因为它只有一行可供阅读。
text_file = "textfile.txt"
numLines = 0
numWords = 0
numChars = 0
with open(text_file, 'r') as file:
for line in file:
wordsList = line.split()
numLines +=1
numWords += len(wordsList)
numChars += len(line)
if numWords > 1000:
print("Word Count is over 1000.")
在接下来的尝试中,我没有类似的东西,但仍然得到与上面相同的结果。它没有看到它将\n\n\n\n
写入文本文件约14次,而是仅在文件末尾发生一次。
def oldWordCounter(input_file):
word_count = 0
with open(input_file, 'r') as f:
for line in f:
word_count = len(line.split(' '))
print("Word count = %s \n" % word_count)
if word_count > 1000:
with open(input_file, 'a') as f:
f.write("\n\n\n\n")
我确信我只是错过了一些简单的东西,但我对python很新。即使它让我在这里问一个问题。我在我的智慧结束,似乎没有比这更进一步。所以再次非常感谢你在这个问题上提供的任何帮助!
同样在下面我提供了我计划在下一个时段发生后添加新行的方式。不确定这是否会对你有所帮助,但可能会帮助你更多地了解我想要完成的任务。
def splitOnPeriod(input_file):
with open(input_file,"r") as f:
for line in f:
searchPhrase = "."
if searchPhrase in line:
file = open(input_file, "a")
file.write("\n\n\n\n")
print("found it\n")
以下是我正在处理的文字的一小部分......
World headquarters, only business Google without bada bing bada boom, guess who's back inside your room. It is the Thrive time show on your radio. My name is Clay Clark, the former and recovering disc jockey. I am joined today Inside the Box rocks with with a guy. He sees he's on telling you what he's he's back in Tulsa for at least the foreseeable future, maybe maybe for several days several minutes. It'S dr. Robert zoellner, sir welcome back. I am so fired up today. I am in such a great mood and right now I could see Marshall and I could see his reaction as I get to announce why I'm so happy all really. Yes, I glorious thing happen this weekend. You'Re discovering more hair is growing and I like you're, going with that by the way this happened to do with a little support. We Americans love so much call football Hurricane football. Absolutely I mean the world. I have waited a year to get the world right again and in my Oklahoma, Sooners go up to Columbus and whoop. I mean now. Let'S talk about the facts here, cuz there's a lot of people listening. This is a business, show its business school without the BS to keep it relevant to make sure that understand this Oklahoma. If I'm correct was right, number 5 correct and I believe that Ohio state was ranked number 2. Yes, why you leave in the box of rocks? Do is In-N-Out Marshall to the drivers who don't know Marshall, for business coaches in Ohio from Ohio and he's not so he really cares about Ohio. Yes, fifth-ranked Boomer Sooners went up there and beat him was a close. Now. It wasn't even close, really really good, and so then I'm so that was Saturday and then Sunday this last weekend and I've been waiting to have Marshall in the Box, because I can't make this announcement without you really here to sit on that till Wednesday. Clear the clear that kind of thing I didn't seem last couple things on Sunday, the Dallas Cowboys won the double bonus. Can I will quick on this and I've loved the Patriots and Jonathan are off as he hates the Patriots, and so whenever his Giants lose, I almost feel better about their loss. I almost feel better about their loss, then actual win for the Patriots and when I saw the Cowboys just turn it on I'm like this is great. I don't care what team it is as long as they're playing the Giants. I am I'm almost. I wouldn't make a prayer chain, but I will be on the verge of making your prayer chain for your team excited to see, but I don't care who it is they beat. The Giants is a great thing for American I'm a Little Lamb lunch Wagers. I am going to whenever he pays off on The Chew very slowly and enjoy every moment of tizers have reserved, but I'll have to I'll. Have I don't normally do it, but since you're paying for it Marshall, I think I will now on Today Show we're breaking down to six books that every entrepreneur should read the six books at every entrepreneur should read, and a book number one was thinking, Grow. Rich book number to you can actually get that book for free. It is start here the book The we put together the documents, our business cyst shamelessly. So if you want to learn how to grow successful company to start here to 550 page book, it's absolutely free to download it Thrive time show. And we just hit the amazon.com best sellers list on that. So if you go to Amazon now and you type in like business Consulting into the search bar, that book actually comes up in the top five books now, and so that's a book that you can get there for free to ebook, it's absolutely free for you. We move on now to book number 3, which is Titan now. Titan is the book that documents, the Life, The Life and Times of John D Rockefeller, who actually grew up like everybody else, use Easy. You start somewhere. He grew up poor and at the age of 16 he began working to support his mother because his father was an absent father and actually decided to leave his family and get married to another woman without telling his current wife it's breaking down some notable quotables from That book and I'm going to go ahead and give you the first notable quotable. This is John D. Rockefeller Miss. Is it from the book tighten the author writes he had a great generals, ability to focus on his goals and a brush aside obstacles as Petty distractions. He wants said you can abuse me.
答案 0 :(得分:1)
此代码将每1000行拆分一次,当它到达.
时重置:
words = s.split()
new_text = ""
word_count = 0
for word in words:
new_text += word + " "
word_count += 1
if word_count == 1000 or "." in word:
new_text += "\n"
word_count = 0
其中s
是从文件中读取的字符串。
之后只需将new_text
写入文件即可。
答案 1 :(得分:0)
阅读所有要列出的单词&追加' \ n'提交每1000个单词或有句号的单词。
AllWords = []
for line in open("data_words.txt"):
row = line.split(' ')
AllWords+=list(row)
line_breaker=1000
i=1
with open("/home/kiran/km/km_hadoop/data/data_wordcount_op.txt", 'a') as file:
for word in AllWords:
if("." in word or i==line_breaker):
file.write(word.strip('\n')+"\n")
i=0
else:
file.write(word.strip('\n')+" ")
i+=1
答案 2 :(得分:0)
为了回答你的第一个问题,我定义了一个linewrapper函数,它接受一个文件和你想要的包装长度。使用模运算符,我们将迭代器除以wrap_length减去1,因为索引从0开始。模运算符允许我们确定它是否可被100整除。例如,如果wrap_length是97且i是96,我们将得到一个余数在0以外的值中。如果没有余数则该值将为0.我们需要检查i是否为0,因为0除以任何值将导致无余数。您可以在此处详细了解如何应用该运算符:https://docs.python.org/3.3/reference/expressions.html#binary-arithmetic-operations
def linewrapper(input_file, wrap_length):
with open(input_file, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
words = line.split()
for i in range(0, len(words)):
output_file.write('%s ' % words[i])
if i != 0 and i % (wrap_length - 1) == 0:
output_file.write("\n")
linewrapper('input.txt', 100)