我在文本文件中有一个10k字的列表,如下所示:
G15 KDN C30A 行动标准 气刷 空气稀释
我正在尝试使用此代码将它们转换为较低的套装标记,以便使用GenSim进行后续处理:
data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
texts = [[word for word in data.lower().split()] for word in data]
我得到以下回调:
AttributeErrorTraceback (most recent call last)
<ipython-input-84-33bbe380449e> in <module>()
1 data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
----> 2 texts = [[word for word in data.lower().split()] for word in data]
3
AttributeError: 'list' object has no attribute 'lower'
对于我做错了什么以及如何纠正它的任何建议将不胜感激!!!谢谢!!
答案 0 :(得分:11)
尝试:
data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
texts = [[word.lower() for word in text.split()] for text in data]
您试图将.lower()应用于数据,这是一个列表。
.lower()只能应用于字符串。
答案 1 :(得分:1)
你需要
texts = [[word.lower() for word in line.split()] for line in data]
line
(data
)中每个[... for line in data]
的代码生成一个小写字词列表([word.lower() for word in line.split()]
)。每个str line
将包含一系列以空格分隔的单词。line.split()
会将此序列转换为列表。 word.lower()
会将每个单词转换为小写。
答案 2 :(得分:0)
你做错了是,为列表调用字符串方法(lower()
)(在你的情况下,数据)
data = [line.strip() for line in open('corpus.txt', 'r')]
获取行作为列表条目后应该做什么
texts = [[words for words in sentences.lower().split()] for sentences in data]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^*********^^^^^^^^^^^^^^^^^^^^^^*********^^^^
#you should call lower on iter. value - in our case it is "sentences"
这将为您提供列表清单。每个列表包含由小写字组成的行。
$ tail -n 10 corpus.txt
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> data = [line.strip() for line in open('corpus.txt', 'r')]
>>> texts = [[words for words in sentences.lower().split()] for sentences in data]
>>> texts[:5]
[['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution']]
>>>
确保你可以扁平化或保持原样。
>>> flattened = reduce(lambda x,y: x+y, texts)
>>> flattened[:30]
['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a']
>>>
答案 3 :(得分:0)
我们可以将列表转换成小的后者。
>>> words = ["PYTHON", "PROGRAMMING"]
>>> type((words))
>>> for i in words:
print(i.lower())
输出:
python编程