我想用keras作为作者归属.我有一个(文本,标签)列表.我试图使用keras内置矢量化器,但我收到以下错误:
向量化序列数据...回溯(最近一次调用最后一次):文件"",第1行,在文件"/home/angelo/org/courses/corpusling/finalproject/src/neuralnet.py",第46行,在X_train中= tokenizer.texts_to_matrix(X_train,mode ='binary')文件"/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py",第166行,在texts_to_matrix sequences = self.texts_to_sequences(texts)文件"/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py",第131行,在self.texts_to_sequences_generator(文本)中的vect的texts_to_sequences中:文件"/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py",第150行,在texts_to_sequences_generator中i = self.word_index.get(w)AttributeError:'Tokenizer'对象没有属性'word_index'
以下是我目前的代码:
import glob import os import pandas as pd import numpy as np from keras.models import Sequential from keras.layers import Dense, Activation from keras.preprocessing.text import Tokenizer from keras.utils import np_utils def get_label(filename): tmp = os.path.split(filename)[0] label = os.path.basename(tmp) return label def read_file(filename): with open(filename) as f: text = f.read() return text traindocs = "../data/C50/C50train/*/*.txt" testdocs = "../data/C50/C50test/*/*.txt" documents_train = (read_file(f) for f in glob.iglob(traindocs)) labels_train = (get_label(f) for f in glob.iglob(traindocs)) documents_test = (read_file(f) for f in glob.iglob(testdocs)) labels_test = (get_label(f) for f in glob.iglob(testdocs)) df_train = pd.DataFrame([documents_train, labels_train]) df_train = df_train.transpose() df_train.rename(columns={0: 'text', 1: 'author'}, inplace=True) df_test = pd.DataFrame([documents_test, labels_test]) df_test = df_test.transpose() df_test.rename(columns={0: 'text', 1: 'author'}, inplace=True) max_words = 1000 print('Vectorizing sequence data...') tokenizer = Tokenizer(nb_words=max_words) X_train, Y_train = df_train.text, df_train.author X_test, Y_test = df_test.text, df_test.author X_train = tokenizer.texts_to_matrix(X_train, mode='binary') X_test = tokenizer.texts_to_matrix(X_test, mode='binary') nb_classes = np.max(Y_train) + 1 print('Convert class vector to binary class matrix (for use with categorical_crossentropy)') Y_train = np_utils.to_categorical(Y_train, nb_classes) Y_test = np_utils.to_categorical(Y_test, nb_classes) model = Sequential() model.add(Dense(output_dim=512, input_dim=(max_words,))) model.add(Activation("relu")) model.add(Dense(output_dim=(np.max(Y_train)+1))) model.add(Activation("softmax")) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) model.fit(X_train, Y_train, nb_epoch=5, batch_size=32) loss_and_metrics = model.evaluate(X_test, Y_test, batch_size=32)
indraforyou.. 12
您需要在使用tokenizer.fit_on_texts(texts)
前使用tokenizer.texts_to_matrix()
这texts
是文本数据列表(列车和测试).
fit_on_texts()
用它来构建word_index
.它只是数字映射的唯一字.此映射稍后用于生成矩阵.
您需要在使用tokenizer.fit_on_texts(texts)
前使用tokenizer.texts_to_matrix()
这texts
是文本数据列表(列车和测试).
fit_on_texts()
用它来构建word_index
.它只是数字映射的唯一字.此映射稍后用于生成矩阵.