TensorFlow 2之开源Tensorflow-Recommenders组件

TensorFlow 是一个端到端开源机器学习平台。它拥有一个全面而灵活的生态系统，其中包含各种工具、库和社区资源，可助力研究人员推动先进机器学习技术的发展，并使开发者能够轻松地构建和部署由机器学习提供支持的应用。

常见机器学习问题的解决方案：

新手：图像分类

中级：推荐系统，预测用户行为

高级：生成对抗网络，如使用Keras Subclassing API训练生成对抗网络生成手写数字图像

Google开源了基于Tensorflow的推荐器，一个新的开源Tensorflow包。它的特点可以总结为下面四个：

它有助于开发和评估灵活的候选nomination模型；

它可以很容易地将商品、用户和上下文信息合并到推荐模型中；

它可以训练多任务模型，帮助优化多个推荐目标；

它使用TensorFlow Serving为最终模型提供服务。

TensorFlow Recommenders是使用TensorFlow构建推荐系统模型的库。它有助于构建推荐系统的完整工作流程，包括：数据准备、模型制定、模型训练、模型评估和部署等。

该模型是建立在Keras之上的，更加便于构建复杂模型。

Tensorflow Recommenders

TensorFlow Recommenders支持：

建立并评估灵活的推荐检索模型

自由地将item、user和上下文信息合并到推荐模型中

联合训练多目标推荐的多任务模型

TensorFlow Recommenders模块：

datasets：数据集模块

examples：示例中使用的功能模块

layers：图层模块

losses：损失函数模块

metrics：指标模块

models：模型模块

tasks：任务库模块

安装TFRS和数据集

1 2	pip install tensorflow-recommenders pip install tensorflow-datasets

Recall Model

不考虑用户打分，将<用户ID，电影标题>作为训练数据预测用户Top N的电影。

from typing import Dict, Text

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import numpy as np

# 模型定义 继承于tfrs.Model
# 实现loss 函数
class MovieLensModel(tfrs.Model):
    # We derive from a custom base class to help reduce boilerplate. Under the hood,
    # these are still plain Keras Models.

    def __init__(
            self,
            user_model: tf.keras.Model,
            movie_model: tf.keras.Model,
            task: tfrs.tasks.Retrieval):
        super().__init__()

        # Set up user and movie representations.
        self.user_model = user_model
        self.movie_model = movie_model

        # Set up a retrieval task.
        self.task = task

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # Define how the loss is computed.

        user_embeddings = self.user_model(features["user_id"])
        movie_embeddings = self.movie_model(features["movie_title"])

        return self.task(user_embeddings, movie_embeddings)


if __name__ == '__main__':
    # 获取配置
    # Ratings data.
    ratings = tfds.load("movie_lens/100k-ratings", split="train")
    # Features of all the available movies.
    movies = tfds.load("movie_lens/100k-movies", split="train")
    print("movie_id,movie_title,user_gender,user_id,user_rating")
    for line in ratings.take(3):
        # print(line)
        res = [line["movie_id"].numpy(), line["movie_title"].numpy(), line["user_gender"].numpy(),
               line["user_id"].numpy(), line["user_rating"].numpy()]
        print(res)
    print("movie_id,movie_title")
    for line in movies.take(3):
        res = [line["movie_id"].numpy(), line["movie_title"].numpy()]
        print(res)

    # 选择基础特征
    ratings = ratings.map(lambda x: {
        "movie_title": x["movie_title"],
        "user_id": x["user_id"]
    })
    movies = movies.map(lambda x: x["movie_title"])

    # 特征编号
    user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
    user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

    movie_titles_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
    movie_titles_vocabulary.adapt(movies)

    # 定义用户模型和物品模型.
    user_model = tf.keras.Sequential([
        user_ids_vocabulary,
        tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
    ])
    movie_model = tf.keras.Sequential([
        movie_titles_vocabulary,
        tf.keras.layers.Embedding(movie_titles_vocabulary.vocab_size(), 64)
    ])

    # Define your objectives.
    task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
        movies.batch(128).map(movie_model)
    )
    )

    # Create a retrieval model.
    model = MovieLensModel(user_model, movie_model, task)
    model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

    # Train for 3 epochs.
    model.fit(ratings.batch(4096), epochs=3)

    # Use brute-force search to set up retrieval using the trained representations.
    index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
    index.index(movies.batch(100).map(model.movie_model), movies)

    # Get some recommendations.
    _, titles = index(np.array(["42"]))
    print(f"Top 3 recommendations for user 42: {titles[0, :3]}")

1、tfrs.tasks.Retrieval模型：a retrieval model, retrieving O(thousands) candidates from a corpus of O(millions) candidates. 召回模型
  # In this case, our metrics are top-k metrics: given a user and a known
  # watched movie, how highly would the model rank the true movie out of
  # all possible movies?
2、tfrs.tasks.Ranking模型：a ranker model, scoring the candidates retrieved by the retrieval model to return a ranked shortlist of a few dozen candidates. 排序模型

Rank Model

排序模型计算复杂度相对低，计算几十或者几百个物品的排序结果，因此排序模型相对会复杂，模型和网络层数会复杂一些。

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs


class RankingModel(tf.keras.Model):

    def __init__(self):
        super().__init__()
        embedding_dimension = 32

        # Compute embeddings for users.
        self.user_embeddings = tf.keras.Sequential([
            tf.keras.layers.experimental.preprocessing.StringLookup(
                vocabulary=unique_user_ids, mask_token=None),
            tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
        ])

        # Compute embeddings for movies.
        self.movie_embeddings = tf.keras.Sequential([
            tf.keras.layers.experimental.preprocessing.StringLookup(
                vocabulary=unique_movie_titles, mask_token=None),
            tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
        ])

        # Compute predictions.
        self.ratings = tf.keras.Sequential([
            # Learn multiple dense layers.
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(64, activation="relu"),
            # Make rating predictions in the final layer.
            tf.keras.layers.Dense(1)
        ])

    def call(self, inputs):
        user_id, movie_title = inputs

        user_embedding = self.user_embeddings(user_id)
        movie_embedding = self.movie_embeddings(movie_title)

        return self.ratings(tf.concat([user_embedding, movie_embedding], axis=1))


class MovielensModel(tfrs.models.Model):

    def __init__(self):
        super().__init__()
        self.ranking_model: tf.keras.Model = RankingModel()
        self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()]
        )

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        rating_predictions = self.ranking_model(
            (features["user_id"], features["movie_title"]))

        # The task computes the loss and the metrics.
        return self.task(labels=features["user_rating"], predictions=rating_predictions)


if __name__ == '__main__':
    ratings = tfds.load("movielens/100k-ratings", split="train")
		
    ratings = ratings.map(lambda x: {
        "movie_title": x["movie_title"],
        "user_id": x["user_id"],
        "user_rating": x["user_rating"]
    })

    tf.random.set_seed(42)
    shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

    train = shuffled.take(80_000)
    test = shuffled.skip(80_000).take(20_000)

    movie_titles = ratings.batch(1_000_000).map(lambda x: x["movie_title"])
    user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

    unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
    unique_user_ids = np.unique(np.concatenate(list(user_ids)))

    model = MovielensModel()
    model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

    cached_train = train.shuffle(100_000).batch(8192).cache()
    cached_test = test.batch(4096).cache()

    model.fit(cached_train, epochs=3)
    model.evaluate(cached_test, return_dict=True)

    test_ratings = {}
    test_movie_titles = ["M*A*S*H (1970)", "Dances with Wolves (1990)", "Speed (1994)"]
    for movie_title in test_movie_titles:
        test_ratings[movie_title] = model.ranking_model((
            tf.constant(["42"]),
            tf.constant([movie_title]))).numpy()

    print("Ratings:")
    for title, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
        print(f"{title}: {score}")

Base Function

Dataset可以用来表示输入管道元素集合（张量的嵌套结构）和“逻辑计划“对这些元素的转换操作。在Dataset中元素可以是向量，元组或字典等形式。
Dataset需要配合另外一个类Iterator进行使用，Iterator对象是一个迭代器，可以对Dataset中的元素进行迭代提取。

1、batch操作：batch size更新梯度使用的样本个数。为数据增加batch维度。

# 创建0-10的数据集，每个batch取个数6。
dataset = tf.data.Dataset.range(10).batch(6)
iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
next_element = iterator.get_next()

with tf.compat.v1.Session() as sess:
    for i in range(2):
        value = sess.run(next_element)
        print(value)
# [0 1 2 3 4 5]
# [6 7 8 9]

2、repeat操作：将整个数据重复多次

dataset = tf.data.Dataset.range(10).batch(6)
dataset = dataset.repeat(2)
iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
next_element = iterator.get_next()

with tf.compat.v1.Session() as sess:
  for i in range(4):
    value = sess.run(next_element)
    print(value)
    # [0 1 2 3 4 5]
    # [6 7 8 9]
    # [0 1 2 3 4 5]
    # [6 7 8 9]

3、shuffle操作：数据打乱，参数buffer_size，表示打乱时使用的buffer的大小，不设置会报错。buffer_size=1不打乱顺序，既保持原序，buffer_size越大，打乱程度越大。也就是说程序会维持一个buffer_size大小的缓存，每次都会随机在这个缓存区抽取一定数量的数据。

dataset = tf.data.Dataset.range(10).shuffle(20).batch(6)
dataset = dataset.repeat(2)
iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
next_element = iterator.get_next()

with tf.compat.v1.Session() as sess:
  for i in range(4):
    value = sess.run(next_element)
    print(value)

    # buffer_size = 1
    # [0 1 2 3 4 5]
    # [6 7 8 9]
    # [0 1 2 3 4 5]
    # [6 7 8 9]

    # buffer_size = 2
    # [1 0 2 4 5 3]
    # [6 7 8 9]
    # [1 2 0 4 3 5]
    # [7 6 9 8]

    # buffer_size = 10
    # [6 3 0 9 4 7]
    # [2 8 1 5]
    # [5 7 6 8 9 2]
    # [4 0 1 3]

    # buffer_size = 20
    # [3 9 1 0 7 5]
    # [2 8 4 6]
    # [2 8 0 3 5 7]
    # [9 1 6 4]

shuffle的顺序很重要，应该先shuffle再batch，如果先batch后shuffle的话，那么此时就只是对batch进行shuffle，而batch里面的数据顺序依旧是有序的，那么随机程度会减弱(实际并未shuffle)

训练中常见的方法，shuffle—>batch->repeat

相当于把所有数据先打乱，然后打包成batch输出，整体数据重复多个epoch

4、keras.Model定义两种方式：

（1）函数式API** With the “Functional API”, where you start from Input,you chain layer calls to specify the model’s forward pass,and finally you create your model from inputs and outputs:

import tensorflow as tf
from tensorflow import keras

inputs = tf.keras.Input(shape=(128,))  # 构建一个输入张量
x = layers.Dense(256, activation='relu')(inputs)
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
predictions = tf.nn.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs=inputs, outputs=predictions)
# 编译模型
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels, batch_size=32, epochs=5)

（2）模型子类化 需要实现两个方法(init,call) By subclassing the Model class: in that case, you should define your layers in __init__ and you should implement the model’s forward pass in call

通过对 tf.keras.Model 进行子类化并定义自己的前向传播来构建完全可自定义的模型。

import os
import tensorflow as tf
import numpy as np
from tensorflow import keras

def conv3x3(channels, stride=1, kernel=(3, 3)):
    return keras.layers.Conv2D(channels, kernel, strides=stride, padding='same',
                               use_bias=False,
                               kernel_initializer=tf.random_normal_initializer())

class ResnetBlock(keras.Model):

    def __init__(self, channels, strides=1, residual_path=False):
        super().__init__()

        self.channels = channels
        self.strides = strides
        self.residual_path = residual_path

        self.conv1 = conv3x3(channels, strides)
        self.bn1 = keras.layers.BatchNormalization()
        self.conv2 = conv3x3(channels)
        self.bn2 = keras.layers.BatchNormalization()

        if residual_path:
            self.down_conv = conv3x3(channels, strides, kernel=(1, 1))
            self.down_bn = tf.keras.layers.BatchNormalization()

    def call(self, inputs, training=None):
        residual = inputs

        x = self.bn1(inputs, training=training)
        x = tf.nn.relu(x)
        x = self.conv1(x)
        x = self.bn2(x, training=training)
        x = tf.nn.relu(x)
        x = self.conv2(x)

        if self.residual_path:
            residual = self.down_bn(inputs, training=training)
            residual = tf.nn.relu(residual)
            residual = self.down_conv(residual)

        x = x + residual
        return x

class ResNet(keras.Model):

    def __init__(self, block_list, num_classes, initial_filters=16, **kwargs):
        super().__init__(**kwargs)

        self.num_blocks = len(block_list)
        self.block_list = block_list

        self.in_channels = initial_filters
        self.out_channels = initial_filters
        self.conv_initial = conv3x3(self.out_channels)

        self.blocks = keras.models.Sequential(name='dynamic-blocks')

        for block_id in range(len(block_list)):
            for layer_id in range(block_list[block_id]):

                if block_id != 0 and layer_id == 0:
                    block = ResnetBlock(self.out_channels,
                                        strides=2, residual_path=True)
                else:
                    if self.in_channels != self.out_channels:
                        residual_path = True
                    else:
                        residual_path = False
                    block = ResnetBlock(self.out_channels,
                                        residual_path=residual_path)

                self.in_channels = self.out_channels

                self.blocks.add(block)

            self.out_channels *= 2

        self.final_bn = keras.layers.BatchNormalization()
        self.avg_pool = keras.layers.GlobalAveragePooling2D()
        self.fc = keras.layers.Dense(num_classes)

    def call(self, inputs, training=None):
        out = self.conv_initial(inputs)
        out = self.blocks(out, training=training)
        out = self.final_bn(out, training=training)
        out = tf.nn.relu(out)
        out = self.avg_pool(out)
        out = self.fc(out)
        return out

if __name__ == "__main__":
    model = ResNet([2, 2, 2], 10)
    model.build(input_shape=(None, 28, 28, 1))
    model.summary()

在 init 方法中创建层并将它们设置为类实例的属性

在 call 方法中定义前向传播

5、map，可以将map_func函数映射到数据集，map接收一个函数，Dataset中的每个元素都会被当作这个函数的输入，并将函数返回值作为新的Dataset：

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

if __name__ == '__main__':

    dataset = tf.data.Dataset.range(10).shuffle(1)
    dataset = dataset.map(lambda x: x+10)
    iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    next_element = iterator.get_next()

    with tf.compat.v1.Session() as sess:
        for i in range(5):
            value = sess.run(next_element)
            print(value)