在Hinglish(印地文+英文)twitter数据上训练GPT2语言模型

时间:2020-03-11 09:10:11

标签: nlp pytorch huggingface-transformers

我正在为Twitter数据构建一些NLP应用程序。首先,我正在构建一个推特生成器,该推特生成器针对一组特定的用户进行培训。我正在使用此仓库ru_transformers作为参考。非常感谢Mikhail Grankin分享他的工作并提供了如此详尽的文章。 到目前为止,我已经在非常小的数据集(约20 MB)上对其进行了训练。想法是要有一个过拟合的模型,以便在进入完整数据集训练之前,我可以看到所有部分工作正常。 在训练数据集中,我在每一行上都有一条推文,而各条推文之间为空。我正在按照本文中的说明使用YTTM标记程序。到目前为止,我只进行了很少的预处理,删除了非常短的推文。除此之外,我想保持所有其他信息不变。大多数推文都是用兴格利什语(印度语中的印度语意思是英语)书写的,其中也有很好的本地印地语单词以及许多表情符号。

以下是在这个极小的数据集上生成模型的示例。 我指定的是提示,模型返回3个样本。

Prompt: "we need to"
{
    "replies": [
        "ting to the law of the state and state leaders. Our judiciary needs water listing our government also as our cooperation.\" <|n|<|n| @ pradip103 these guys will be closed and still such subjects who are alive & amp; good in state forever.",
        " started trending # terrorism <|n| <|n| Next year we are begging congress # Hindus <|n| <|n| only indians are telling and respect for others and what we are working <|n|n| Many happy returns of the day @ sard",
        " Woman ... At least approximately Indians have been almost 25% Muslim population percentage in south India and is all Indians including 30%. Only game is now."
    ]
}

Prompt: "we need to"
{
    "replies": [
        " mouga kabvan? Kisse Owaisi Ko sikhate hein?",
        " sir Mr.",
        "ఏ turned out to create mayhem against Islamism and population of India. Else how will it be chief of that India chief left?\" <|n|<|n| Khan is punching towards Suit. Including his Congi IT cell workout."
    ]
}

Prompt: "we need to"
{
    "replies": [
        " ने सोशल मीडिया पर कब्ज़ा किया था| # HinduRashtra # HDL\" <|n|||||n| @ upma23 जन्मदिन की हार्दिक शुभकामनाएं । भगवान श्री कृष्ण? <|n||||n| @ ashish_prataps धन्यवाद! Taged Champion!",
        " this might be so apt about this. My part is right in Mumbai. Jai Hind Jai Bharat?? <|n|||<|n| And yet to cry # Pigs the inhumanity.",
        "-a journo of sexual slavery.# India # ExitPoll # OlaHuUber ?? <|n| <|n|| # OlaHuUber ?? Israel-e- Medina ?? <|n| <|n| OlaHuUber ?? Media is a loser Bollywood Funny person."
    ]
}

我需要解决的一个明显问题是这些“ <| n |”字符。关于如何解决的任何想法?还有其他我做不正确的事情,在继续训练完整模型之前,我应该意识到这些事情吗? YTTM是令牌生成器的不错选择。在我看来,它做得很好,但是想确定。欢迎任何评论/建议。

0 个答案:

没有答案