今度はGCSにuploadした画像を読み込んで自前スクリプトをDocker化して、GCRにpush。
それからAIplatformでトレーニングジョブを実行してみる。
おおおまかにここのチュートリアルを参考にしてDockerをGCRにpushしてから、トレーニングジョブを実行してみた。
まずは「AI Platform Training & Prediction, Compute Engine and Container Registry API 」を有効にしてから、
GCPのVMインスタンス上でジョブを実行。
目次
1.GCSに画像をupload
2.環境変数の設定
3.トレーニング用pythonスクリプト
4.DockerコンテナをGCRにpushして作成
5.AI Platformでジョブを実行
1.GCSに画像をupload
AIPlatformでノートブックインスタンスを作成。regionはus-central1。
ログインして、画像upload用のバケット「mlops-test-bakura」作成
$ gsutil mb gs://mlops-test-bakura/
>>>
Creating gs://mlops-test-bakura/...
画像をGUIでフォルダ「right」をupload
# GCSに「right」フォルダ内に画像があることを確認 $ gsutil ls gs://mlops-test-bakura/right/*.jpg >>> 〜 gs://mlops-test-bakura/right/ml_670008765.jpg gs://mlops-test-bakura/right/nm_78009843.jpg gs://mlops-test-bakura/right/kj_78009847.jpg
2.環境変数の設定
# output用バケットの作成と環境変数の設定 export BUCKET_ID=output-aiplatform gsutil mb gs://$BUCKET_ID/ export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
3.トレーニング用pythonスクリプト
# コードをgit clone $ git clone https://github.com/GoogleCloudPlatform/cloudml-samples $ cd cloudml-samples/tensorflow/con*/un*/ # 訓練用スクリプト $ tree >>> ├── Dockerfile ├── data_utils.py ├── model.py └── task.py
ここで訓練用スクリプトを自分用に書き換える
model.py
自前ネットワーク
from tensorflow.keras import Sequential from tensorflow.keras.layers import * from tensorflow.keras.applications.vgg16 import VGG16 from tensorflow.keras.models import Model def sonar_model(): base_model = VGG16(weights='imagenet', include_top=True, input_tensor=Input(shape=(224,224,3))) x = base_model.get_layer(index=-5).output x = Dropout(rate=0.3)(x) x = GlobalAveragePooling2D()(x) o = Dense(3, activation='softmax')(x) model = Model(inputs=base_model.input, outputs=o) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) return model
data_utils.py
さっきのGCSから画像を読み込む。
import datetime from google.cloud import storage import tempfile import os, cv2 import numpy as np from sklearn.model_selection import train_test_split import tensorflow from tensorflow.keras.utils import to_categorical client = storage.Client() BUCKET_NAME = "mlops-test-bakura" FOLDER_NAME = "right" NUM_CLS = 2+1 def load_label(y, num_classes=3): return to_categorical(y, num_classes=num_classes) def download(bucket_name, folder_name): images = [] labels = [] c = 0 for blob in client.list_blobs(bucket_name, prefix=folder_name): _, _ext = os.path.splitext(blob.name) _, temp_local_filename = tempfile.mkstemp(suffix=_ext) blob.download_to_filename(temp_local_filename) img = cv2.imread(temp_local_filename) images.append(cv2.resize(img, (224, 224))) if len(images)==200: c += 1 elif len(images)==400: c += 1 labels.append(c) #print(f"Blob {blob_name} downloaded to {temp_local_filename}.") return np.array(images)/255, np.array(labels) def load_data(args): imgs, labels = download(BUCKET_NAME, FOLDER_NAME) labels = load_label(labels, num_classes=NUM_CLS) print(imgs.shape, labels.shape) train_f, test_f, train_l, test_l = train_test_split( imgs, labels, test_size=args.test_split, random_state=args.seed) return train_f, test_f, train_l, test_l def save_model(model_dir, model_name): """Saves the model to Google Cloud Storage""" bucket = storage.Client().bucket(model_dir) blob = bucket.blob('{}/{}'.format( datetime.datetime.now().strftime('sonar_%Y%m%d_%H%M%S'), model_name)) blob.upload_from_filename(model_name)
task.py
import argparse import data_utils import model def train_model(args): train_features, test_features, train_labels, test_labels = \ data_utils.load_data(args) sonar_model = model.sonar_model() sonar_model.fit(train_features, train_labels, epochs=args.epochs, batch_size=args.batch_size) score = sonar_model.evaluate(test_features, test_labels, batch_size=args.batch_size) print(score) # Export the trained model sonar_model.save(args.model_name) if args.model_dir: # Save the model to GCS data_utils.save_model(args.model_dir, args.model_name) def get_args(): parser = argparse.ArgumentParser(description='Keras Sonar Example') parser.add_argument('--model-dir', type=str, help='Where to save the model') parser.add_argument('--model-name', type=str, default='sonar_model.h5', help='What to name the saved model file') parser.add_argument('--batch-size', type=int, default=4, help='input batch size for training (default: 4)') parser.add_argument('--test-split', type=float, default=0.2, help='split size for training / testing dataset') parser.add_argument('--epochs', type=int, default=1, help='number of epochs to train (default: 10)') parser.add_argument('--seed', type=int, default=42, help='random seed (default: 42)') args = parser.parse_args() return args def main(): args = get_args() train_model(args) if __name__ == '__main__': main()
Dockerfile
FROM tensorflow/tensorflow:nightly WORKDIR /root ENV DEBIAN_FRONTEND=noninteractive # Installs pandas, google-cloud-storage, and scikit-learn # scikit-learn is used when loading the data RUN pip install pandas google-cloud-storage scikit-learn RUN apt-get install -y python-opencv python3-opencv # Install curl RUN apt-get update; apt-get install curl -y # The data for this sample has been publicly hosted on a GCS bucket. # Download the data from the public Google Cloud Storage bucket for this sample RUN curl https://storage.googleapis.com/cloud-samples-data/ml-engine/sonar/sonar.all-data --output ./sonar.all-data # Copies the trainer code to the docker image. COPY model.py ./model.py COPY data_utils.py ./data_utils.py COPY task.py ./task.py # Set up the entry point to invoke the trainer. ENTRYPOINT ["python", "task.py"]
3.DockerコンテナをGCRにpushして作成
次にGCRにDockerのカスタムコンテナを作成# gcloudでdockerを認証
sudo docker run busybox date
gcloud auth configure-docker
# Docker iamgeのbuild REGION=us-central1 export IMAGE_REPO_NAME=sonar_tf_nightly_container export IMAGE_TAG=sonar_tf # IMAGE_URI: the complete URI location for Cloud Container Registry export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG export JOB_NAME=custom_container_tf_nightly_job_$(date +%Y%m%d_%H%M%S) # docker ビルド (docker build -t gcr.io/[project id]/[app]:latest .) sudo docker build -f Dockerfile -t $IMAGE_URI ./ # 正常に動作してるか確認 sudo docker run $IMAGE_URI --epochs 1 >>>> 〜〜〜〜 553467904/553467096 [==============================] - 3s 0us/step 2021-03-13 03:46:29.032178: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimi zation Passes are enabled (registered 2) 2021-03-13 03:46:29.032776: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2299995000 Hz 30/30 [==============================] - 103s 3s/step - loss: 0.2690 - accuracy: 0.8701 8/8 [==============================] - 7s 847ms/step - loss: 4.4660e-04 - accuracy: 1.0000 [0.0004466005484573543, 1.0] # imageをGCRにpush # docker push gcr.io/[project id]/[app]:latest sudo docker push $IMAGE_URI
GCRにdocker imageがpushされてる
4.AI Platformでジョブを実行
# ジョブを実行 $ gcloud components install beta $ gcloud beta ai-platform jobs submit training $JOB_NAME --region $REGION --master-image-uri $IMAGE_URI --scale-tier BASIC -- --model-dir=$BUCKET_ID --epochs=1 # ジョブテータスとストリームログをモニタリング gcloud ai-platform jobs describe $JOB_NAME gcloud ai-platform jobs stream-logs $JOB_NAME >>>> 〜〜〜〜〜 INFO 2021-03-03 07:28:04 +0000 master-replica-0 Test set: Average loss: 0.0516, Accuracy: 9 839/10000 (98%) INFO 2021-03-03 07:28:04 +0000 master-replica-0 INFO 2021-03-03 07:30:30 +0000 service Job completed successfully.
# GCSにモデルが保存されてるか確認 $ gsutil ls gs://$BUCKET_ID/sonar_* >>> gs://output-aiplatform/sonar_20210313_055918/sonar_model.h5
ちゃんとジョブが成功してGCSにh5の重みが保存されてた。
参考サイト
・スタートガイド: カスタム コンテナを使用したトレーニング・AI Platform(GCP)でGPU 100個同時に使いテンションあがった
・github:cloudML-sample