输入数据:
[{"is_sarcastic": 1, "headline": "thirtysomething scientists unveil doomsday clock of hair loss", "article_link": "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205"},
{"is_sarcastic": 0, "headline": "dem rep. totally nails why congress is falling short on gender, racial equality", "article_link": "https://www.huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207"}
]
预期输出:
["thirtysomething scientists unveil doomsday clock hair loss",
"dem rep totally nails why congress is falling short on gender racial equality"]
我能够在随后的代码段中获得预期的输出。
stop_words = ["a", "about", "above", "after", "again", "..."]
_corpus, _result = [], []
for text in data:
text_clean = [word for word in re.split('\W+', text['headline'])if word.lower() not in stop_words and len(word) > 2]
_corpus.append(' '.join(text_clean))
_result.append(text['is_sarcastic'])
我只是为了学习目的而试图使其简洁,但是无法使用下面列出的代码段复制相同的结果。
_corpus, _result = map(list, zip(
*[(''.join(word), text['is_sarcastic']) for text in data for word in re.split('\W+', text['headline'])
if word.lower() not in stop_words and len(word) > 2]))
我得到单词列表而不是字符串列表。例如:['thirtysomething', 'scientists', ...]
。
我没有正确使用join
方法。我该如何工作?
编辑1:我的目标是获取字符串列表而不是单词列表。
编辑2:我没有包括整个数据集,因为我认为它与问题无关。
编辑3:请忽略这篇文章,我很难清楚地交流。感谢您为我提供的帮助。
编辑4:重新格式化问题。
答案 0 :(得分:0)
您要将以下代码段转换为列表理解:
stop_words = ["a", "about", "above", "after", "again", "..."]
_corpus, _result = [], []
for text in data:
text_clean = [word for word in re.split('\W+', text['headline']) if word.lower() not in stop_words and len(word) > 2]
_corpus.append(' '.join(text_clean))
_result.append(text['is_sarcastic'])
这不是一个好主意,因为代码已经不容易阅读了!您应该从一个函数开始:
def clean(headline):
return [word for word in re.split('\W+', headline) if word.lower() not in stop_words and len(word) > 2]
_corpus, _result = [], []
for text in data:
_corpus.append(' '.join(clean(text['headline'])))
_result.append(text['is_sarcastic'])
如果您要理解列表,请使用一个列表存储对:
_ret = []
for text in data:
_ret.append((' '.join(clean(text['headline'])), text['is_sarcastic']))
# [('thirtysomething scientists unveil doomsday clock hair loss', 1), ('dem rep totally nails why congress falling short gender racial equality', 0)]
此循环将很容易转换为列表理解。要获得结果,请zip
元素以重新创建两个元组:
_corpus, _result = zip(*_ret)
# ('thirtysomething scientists unveil doomsday clock hair loss', 'dem rep totally nails why congress falling short gender racial equality') (1, 0)
或者,就像您一样:
_corpus, _result = map(list, zip(*_ret))
# ['thirtysomething scientists unveil doomsday clock hair loss', 'dem rep totally nails why congress falling short gender racial equality'] [1, 0]
完整代码:
import re
stop_words = ["a", "about", "above", "after", "again", "..."]
_ret = [(' '.join(clean(text['headline'])), text['is_sarcastic']) for text in data]
_corpus, _result = map(list, zip(*_ret))
print (_corpus, _result)
# ['thirtysomething scientists unveil doomsday clock hair loss', 'dem rep totally nails why congress falling short gender racial equality'] [1, 0]
与您写的内容相距不远,但是text['is_sarcastic']
的位置不正确。