Question

我想从文件的每一行中读取2个字符串和一个整数。文件内空白的位置未知。第一个字符串的长度不确定，第二个字符串的长度为2个字符。代码如下：

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None, steps=[('vec', Doc2vec())])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

输出始终类似于：

class Doc2vec(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
    return self


def vec(data):
    print('starting')
    SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

    alldocs = []

    for line_no, line in data.iterrows():
        #tokens = gensim.utils.to_unicode(line).split()
        words = gensim.utils.simple_preprocess(line['post'])
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test'][line_no//1200]  # 25k train, 25k test, 25k extra
        if gensim.utils.simple_preprocess(line['type']) == ['depression']:
                        sentiment = (1.0)
        else:
                sentiment = (0.0)
        alldocs.append(SentimentDocument(words, tags, split, sentiment))



    train_docs = [doc for doc in alldocs if doc.split == 'train']
    test_docs = [doc for doc in alldocs if doc.split == 'test']

    #print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))
    from random import shuffle
    doc_list = alldocs[:]  
    shuffle(doc_list)
    cores = multiprocessing.cpu_count()
    assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

    simple_models = [

        # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
        Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0, 
                epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05')
    ]

    for model in simple_models:
        model.build_vocab(train_docs)
        #print("%s vocabulary scanned & state initialized" % model)

    models_by_name = OrderedDict((str(model), model) for model in simple_models)
    model.train(train_docs, total_examples=len(train_docs), epochs=model.epochs)
    train_targets, train_regressors = zip(*[(doc.words, doc.sentiment) for doc in train_docs])
    import numpy as np
    X = []
    for i in range(len(train_targets)):
        X.append(model.infer_vector(train_targets[i]))
    train_x = np.asarray(X)
    print(type(train_x))
    return(train_x)


class I_counter(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self



  def transform(self, data):
    def i_count(name):
        tokens = nltk.word_tokenize(name)
        count = tokens.count("I")
        count2 = tokens.count("i")
        return(count+count2)
    vecfunc = np.vectorize(i_count)
    data =  np.transpose(np.matrix(data['post']))
    result = vecfunc(data)
    return result

Answer 1

如果要读取50个字符的字符串，则数组必须能够容纳51个字符。这是因为在C语言中，字符串始终以\0字符结尾。因此，将数组声明更改为此行：

char firstString[50], secondString[2];

代码中的另一个问题是，您需要在&语句的number前面放置fscanf。像这样：

while (fscanf(fp, "%50s %2s %1d", firstString, secondString, &number) == 3)

检查fopen和fscanf的返回值非常好。您应该做的另一件事是打开编译器警告。使用-Wall -Wextra编译代码会产生以下警告：

warning: format ‘%d’ expects argument of type ‘int *’, but argument 5 has type ‘int’ [-Wformat=]
     while (fscanf(fp, "%50s %2s %1d", firstString, secondString, number) == 3)
                                   ^
warning: unused parameter ‘argc’ [-Wunused-parameter]
 int main (int argc, char **argv)
               ^~~~

因此，正如您所看到的，警告为您提供了number问题的提示。

另一个警告不是很重要。不使用argc并不总是错误的，但是您可以使用以下方法使代码更安全：

if(argc < 2) {
     printf("No input file given as argument.\n");
     exit(0);
}

将错误消息打印到stderr而不是stdout也是一个好主意。使用perror是一种方法。与printf非常相似的另一种方法是只使用fprintf。实际上，fprintf(stdout, ... )的含义与printf( ... )相同。因此，只需使用fprintf并指定stderr。您的第一个错误打印输出可能是fprintf(stderr, "Error in opening file: %s\n", argv[1])

Answer 2

我认为您可能在firstString中得到一个空字符串，因为secondString的长度不足以容纳2个字符和一个空字节，因此发生了这种情况也就是说，在装有编译器的计算机上，secondString存储在内存中firstString之前。当fscanf()将AB和一个空字节复制到secondString中时，该空字节会使firstString的第一个字节无效，因此它看起来是空的。

通过使用&firstString[1]打印"[%s]"并查看其中包含的大部分内容，可以证明这一点。

标准不保证该布局；这只是一个合理的猜测，可以解释您所看到的。您要么需要使用%2c而不是%2s来使AB进入secondString（但是名称'string'是用词不当；它不是null-终止字符串）。或者，您需要将secondString的大小增加到至少3个字节-允许终端为null。 firstString有一个类似的大小问题。代码中数组的大小（例如N个字节）与scanf()格式的字符串的大小（N-1个字节）之间的“一对一”差异是令人讨厌的，但受到了数年的困扰传统（因此是在1978年，第七版Unix中）。不幸的是，现在更改它比一个人呆着更糟糕。

您还需要将呼叫固定到fscanf()，以便传递&number而不是number。

为什么fscanf会忽略文本文件中的第一个字符串？

2 个答案: