神经机器翻译技术为谷歌翻译提供帮助
简单的制作神经机器翻译技术为谷歌翻译提供帮助
机器翻译:
两种思想流派:
- 使用统计方法的经验试验和误差方法
- 理论方法涉及基础语言研究
Babel's Fish: 1997年由AltaVista搜索引擎推出的世界上第一个网络翻译工具。
Seq2Seq建模简介:用于NLP任务,如文本摘要,语音识别,DNA序列建模等。
典型的seq2seq模型有两个主要成分:
a)编码器
b)解码器
使用Keras在Python中实现:
我们将使用来自http://www.manythings.org/anki/的中文 - 英语句子对数据。和文件夹名称是cmn-eng,将包含cmn.txt。
1、导入所需的库:
import string import re from numpy import array, argmax, random, take import pandas as pd from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding, RepeatVector from keras.preprocessing.text import Tokenizer from keras.callbacks import ModelCheckpoint from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from keras import optimizers import matplotlib.pyplot as plt %matplotlib inline pd.set_option('display.max_colwidth', 200)
2、阅读IDE中的数据:
# function to read raw text file def read_text(filename): # open the file file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() file.close() return text # split a text into sentences def to_lines(text): sents = text.strip().split(' ') sents = [i.split(' ') for i in sents] return sents data = read_text("cmn.txt") cmn_eng = to_lines(data) cmn_eng = array(cmn_eng)
我们使用50000个句子对来减少模型的训练时间。
cmn_eng = cmn_eng[:50000,:]
文本预处理:
大多数数据都是非结构化的
a)文本清理
# Remove punctuation cmn_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in cmn_eng[:,0]] cmn_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in cmn_eng[:,1]] cmn_eng
b)文本到序列转换:
我们将分别在两个单独的英语和汉语列表中捕获所有句子的长度。
# empty lists eng_l = [] cmn_l = [] # populate the lists with sentence lengths for i in cmn_eng[:,0]: eng_l.append(len(i.split())) for i in cmn_eng[:,1]: cmn_l.append(len(i.split())) length_df = pd.DataFrame({'eng':eng_l, 'cmn':cmn_l}) length_df.hist(bins = 30) plt.show()
接下来,我们使用Keras Tokenizer()类对文本数据进行向量化。
# function to build a tokenizer def tokenization(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # prepare english tokenizer eng_tokenizer = tokenization(cmn_eng[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = 8 print('English Vocabulary Size: %d' % eng_vocab_size) # prepare Deutch tokenizer cmn_tokenizer = tokenization(cmn_eng[:, 1]) cmn_vocab_size = len(cmn_tokenizer.word_index) + 1 cmn_length = 8 print('Deutch Vocabulary Size: %d' % cmn_vocab_size) # encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences seq = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values seq = pad_sequences(seq, maxlen=length, padding='post') return seq
3. 模型构建:
现在,我们将数据分成训练和测试集,用于模型训练和评估。
from sklearn.model_selection import train_test_split # split data into train and test set train, test = train_test_split(cmn_eng, test_size=0.2, random_state = 12) # prepare training data trainX = encode_sequences(cmn_tokenizer, cmn_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) # prepare validation data testX = encode_sequences(cmn_tokenizer, cmn_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
现在我们定义我们的Seq2Seq模型架构:
- 对于编码器,我们将使用嵌入层和LSTM层
- 对于解码器,我们将使用另一个LSTM层,然后是密集层
# build NMT model def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps,units): model = Sequential() model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True)) model.add(LSTM(units)) model.add(RepeatVector(out_timesteps)) model.add(LSTM(units, return_sequences=True)) model.add(Dense(out_vocab, activation='softmax')) return model # model compilation model = define_model(cmn_vocab_size, eng_vocab_size, cmn_length, eng_length, 512)
使用RMSprop优化器:
rms = optimizers.RMSprop(lr=0.001) model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')
sparse_categorical_crossentropy用作损失函数。
4. 训练我们的模型
我们将训练30个周期,batch为512,验证分为20%。80%的数据将用于训练模型,其余数据用于评估模型。我们还将使用ModelCheckpoint()函数来保存验证损失最小的模型。
filename = 'model.h1.24_jan_19' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # train model history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1), epochs=30, batch_size=512, validation_split = 0.2,callbacks=[checkpoint], verbose=1)
5. 加载保存的模型并对不可见的data-testX进行预测。
model = load_model('model.h1.24_jan_19') preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1])))
相关推荐
baijingjing 2020-10-27
WhiteHacker 2020-10-11
dflyzx 2020-03-04
saluzirobot 2020-11-05
熊Ok 2020-10-26
lgblove 2020-10-23
guojin0 2020-10-08
lirika 2020-09-27
saluzirobot 2020-09-25
hxq 2020-09-23
熊Ok 2020-09-22
randy0 2020-09-17
MakeRoomFor 2020-08-30
mxs 2020-08-10
qilixuening 2020-07-18
NgCafai 2020-06-28
bensonrachel 2020-06-21
womystery 2020-06-17
purgle 2020-06-01