2021-05-24

顔から性別・年齢推定アルゴリズム(age gender estimation model)の工夫まとめ【機械学習】

人気の「age　gender　estimation」とかいう、人間の顔から性別と年齢を予測するモデルを作った時の、テクニックを備忘録として忘れないようにまとめとく。

目次
1.Age-gender-estimation model本体
2.後処理での工夫
3.予測結果

1. Age-gender-estimation model本体

Inceptionv3を使った。ラストの部分以外、特に変わった工夫はしてない。

from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import *

EPOCHS = 5
BATCH_SIZE = 8
HEIGHT = WIDTH = 299

def load_model(gender_cls=2, generation_cls=6, identity_cls=21):
    adam = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, decay=0.0)
    input_shape = (HEIGHT, WIDTH, 3)
    base_model = InceptionV3(input_shape=input_shape, weights='imagenet', include_top=True)
    bottle = base_model.get_layer(index=-3).output
    bottle = Dropout(rate=0.3)(bottle)
    bottle = GlobalAveragePooling2D()(bottle)
    # gender
    gender_output = Dense(units=gender_cls, activation='softmax', name='gender_output')(bottle)
    
    # generation
    generation_output = Dense(units=generation_cls, activation='softmax', name='generation_output')(bottle)
    
    # identity age
    identity_outout = Dense(units=identity_cls, activation='softmax', name='identity_outout')(bottle)

    model = Model(inputs=base_model.input, outputs=[generation_output, identity_outout, gender_output])
    model.compile(optimizer=adam,
                  loss={'generation_output': 'categorical_crossentropy',
                      'identity_outout': 'categorical_crossentropy',
                      'gender_output': 'binary_crossentropy'},
                  #loss_weights={'realage_output': 1, 'gender_output': 10},
                  metrics={'generation_output':'accuracy',
                           'identity_outout':'accuracy',
                           'gender_output': 'accuracy'})

    model.summary()
    return model

if __name__=='__main__':
    model = load_model()

2.後処理での工夫

人間はそれぞれ個性があるので顔自体は２〜３歳でもほとんど変化しないという性質を利用。

下は人間の顔のしわの数が年齢毎に増加するのをグラフにしたもの。

f:id:trafalbad:20210524111813p:plain

歳をとるごとにしわが増えているのがわかる。

そこでまずは全年齢を予測するのは難しいので、以下の手順で問題を細分化して簡単にした。

1.generation(世代)=6カテゴリ、identity-age（大まかな年齢）=21カテゴリを予測対象
2.generationとidentity-ageを予測

f:id:trafalbad:20210524111852p:plain

3.予測したgenerationから、identity-ageの範囲を絞り、そこから年齢を求める

f:id:trafalbad:20210524111940p:plain

というふうに難しい問題を簡単に分割して予測誤差を減らした。

こうすれば100歳分の予測やカテゴリ分類よりは簡単かつ正確にできる。

後処理コード

# 世代で0~7, 7~15, 15~25, 25~45....の6カテゴリに分類

def return_generation(age):
    if age <7:
        return 0
    elif age >=7 and age<15:
        return 1
    elif age >=15 and age < 25:
        return 2
    elif age >=25 and age < 45:
        return 3
    elif age >=45 and age <70:
        return 4
    elif age >=70:
        return 5

6クラスで世代を予測。　正解率は 83％

identity ageを21クラスで予測。　正解率は 46％

まず世代を予測して、indeentity-ageの範囲を絞り、次にidentity-ageとnp.argmaxで大まかな実年齢を求める。

難しい問題を簡単な問題に分割してやることで劇的に正解率が向上した。

class PostProcess(object):
    def __init__(self, pred_generation, pred_identity):
        self.pred_generation = pred_generation
        self.pred_identity = pred_identity
    
    def generation2identity(self, generation):
        if generation==0:
            return 0, 3
        elif generation==1:
            return 3, 5
        elif generation==2:
            return 6, 9
〜〜〜〜略〜〜〜〜
        
        
    def post_age_process(self):
　　
        # generation(予測した世代)からidentity-ageの範囲のindexを取り出す
        lowidx, largeidx = self.generation2identity(np.argmax(self.pred_generation))
        print("lowidx, largeidx from generation", lowidx, largeidx, np.argmax(self.pred_generation))

　　 # identity-ageの範囲を絞る
        slice_pred_identity = self.pred_identity[0][lowidx:largeidx]
        print('pred_identity', self.pred_identity)
        print('list', slice_pred_identity)
        print("pred identity idx", np.argmax(slice_pred_identity)+lowidx)

　　 # identity-ageを求めて、実年齢に変換
        a = np.argmax(slice_pred_identity)+lowidx
        if a==0:
            return 2
        elif a==1:
            return 4
        elif a==2:
            return 6
        elif a==3:
            return 9
〜〜〜略〜〜〜

3.予測結果

顔1

f:id:trafalbad:20210524164046j:plain

予測年齢：21歳　Famale
f:id:trafalbad:20210524164130p:plain

顔2

f:id:trafalbad:20210524164118j:plain

予測年齢：2歳　Female

f:id:trafalbad:20210524164205p:plain

かなりうまく予測できてる感じする。

全体コード

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

import sys
import time
import cv2
import numpy as np

from models import model, load, identity_age
from models.load import to_mean_pixel, MEAN_AVG

x = 100
y = 250
savepath='/Users/~/desktop/'

age_estimate_model = model.load_model()
age_estimate_model.load_weights('weights/best.hdf5')

def draw_label(image, point, label, font=cv2.FONT_HERSHEY_PLAIN,
               font_scale=1.5, thickness=2):
    text_color = (255, 255, 255)
    cv2.putText(image, label, point, font, font_scale, text_color, thickness, lineType=cv2.LINE_AA)


img_path='image1.jpg'
img = cv2.imread(img_path)
img = cv2.resize(img, (299, 299))
cimg = img.copy()
img = to_mean_pixel(img, MEAN_AVG)
img = img.astype(np.float32)/255
img = np.reshape(img, (1, 299, 299, 3))

generation, iden, gender = age_estimate_model.predict(img)
pred_identity = PostProcess(generation, iden).post_age_process()
    
pred_gender = "Male" if np.argmax(gender) < 0.5 else "Female"
print('gender is {0} and predict age is {1}'.format(pred_gender, pred_identity))
label = "{0} age  {1}".format(int(pred_identity), "Male" if np.argmax(gender) < 0.5 else "Female")

draw_label(cimg, (x, y), label)
cv2.imwrite('prediction.png', cimg)

こういう誤差をなくすように工夫することで、重いベンチマークのモデルを使わなくても簡単なモデルでかなりいい具合の予測ができた。

人間の脳と同じで問題がむずければ完璧主義はやめて、園児が解けるレベルに簡単にしても十分性能の良いモデルができる。

2021-05-10

Google openImage Datasetでyolov4のデータセットをdownload & annotationファイルの作成

今回はgoogle open Image datasetのyolov4データをdownloadする方法。

google open Image datasetは物体検出からセグメンテーションまで良質なデータが揃ってtて、v1〜v6まである。

直でdownloadすると割と面倒。（調べるのがめんどい）

なので今回は物体検出の特定のclassのデータをdownloadする方法のメモ。

データのDownload

OIDv4_ToolKitを使う。

Open Images Dataset V4 の任意のクラスだけの画像とアノテーションデータをダウンロードすることができる

アノテーションデータは、物体検出のみ。セグメンテーションは対応していない。

bbox は [name_of_the_class, left, top, right, bottom] の .txt フォーマットで得られるため、場合によっては変換が必要

$ git clone https://github.com/EscVM/OIDv4_ToolKit.git
$ cd OIDv4_ToolKit
$ pip3 install -r requirements.txt

今回はclass==Knife をdownload。
classはGoogle open Image Datsetの検索欄から見れる。

--type_csvで[train/validation/test/all]と選択可能。
全部欲しいのでallを指定。

引数はgithubに書いてある通りに指定できる

・IsOccluded: Indicates that the object is occluded by another object in the image.
・ IsTruncated: Indicates that the object extends beyond the boundary of the image.
・ IsGroupOf: Indicates that the box spans a group of objects (e.g., a bed of flowers or a crowd of people). We asked annotators to use this tag for cases with more than 5 instances which are heavily occluding each other and are physically touching.
・ IsDepiction: Indicates that the object is a depiction (e.g., a cartoon or drawing of the object, not a real physical instance).
・IsInside: Indicates a picture taken from the inside of the object (e.g., a car interior or inside of a building).
・n_threads: Select how many threads you want to use. The ToolKit will take care for you to download multiple images in parallel, considerably speeding up the downloading process.
・limit: Limit the number of images being downloaded. Useful if you want to restrict the size of your dataset.
・y： Answer yes when have to download missing csv files.

$ python3 main.py downloader --classes Knife --type_csv all

全部 [Y]で進み、download。

フォルダ構造

$ tree OID
>>>>

OID
|-- Dataset
|   |-- test
|   |   |
|   |   |
|   |   |-- Knife
            |--〜.jpg　（Knife画像）
            -- Label
                 |-- ~.txt　（box用label text）
|   |-- train
|   |   |
|   |   |
|   |   |-- Knife
            |--〜.jpg　（Knife画像）
            -- Label
                 |-- ~.txt　（box用label text）
|   |-- validation
|   |   |
|   |   |
|   |   |-- Knife
            |--〜.jpg　（Knife画像）
            -- Label
                 |-- ~.txt　（box用label text）
`-- csv_folder
    |-- class-descriptions-boxable.csv
    |-- test-annotations-bbox.csv
    |-- train-annotations-bbox.csv
    `-- validation-annotations-bbox.csv

データの情報・中身

# OIDv4_ToolKit/OID/Dataset/train/Knife/Labelのtxtファイルの中身
# validationとtestも同じ

$ cat 870eb1cdddbcce5a.txt
Knife 24.320256 23.04 767.360256 849.92

# OIDv4_ToolKit/OID/Dataset/train/Knife/Labelのデータ数
$ ls -l | wc -l
611

# OIDv4_ToolKit/OID/Dataset/train/Knifeの画像枚数
$ ls |wc -l 
611

# OIDv4_ToolKit/OID/Dataset/test/Knifeの画像とラベル数
161
# OIDv4_ToolKit/OID/Dataset/validation/Knifeの画像とラベル数
56

# csv_folderのフィイルの中身
$ cat class-descriptions-boxable.csv

~~
/m/0pcr,Alpaca
/m/0pg52,Taxi
/m/0ph39,Canoe
/m/0qjjc,Remote control
/m/0qmmr,Wheelchair
/m/0wdt60w,Rugby ball
/m/0xfy,Armadillo
/m/0xzly,Maracas
/m/0zvk5,Helmet

$ cat test-annotations-bbox.csv
>>>
fffc6543b32da1dd,freeform,/m/0jbk,1,0.013794,0.999996,0.388438,0.727906,0,0,1,0,0
fffd0258c243bbea,freeform,/m/01g317,1,0.000120,0.999896,0.000000,1.000000,1,0,1,0,0

$ cat validation-annotations-bbox.csv

>>>
ffff21932da3ed01,freeform,/m/0c9ph5,1,0.540223,0.624863,0.493633,0.577892,1,0,1,0,0
ffff21932da3ed01,freeform,/m/0cgh4,1,0.002521,1.000000,0.000000,0.998685,0,0,0,0,1

Knifeの画像データ

f:id:trafalbad:20210510112838j:plain

データをyolov4で読み込ませる

dataフォルダにKnifeフォルダを入れる。

そんでyolov4用のtextファイルの作成

classes = ['Knife']
classes_dicts = {key:idx for idx, key in enumerate(classes)}

def main(label_path, jpg_path_name, save_filetxt_name):
    with open(save_filetxt_name, 'w') as f:
        for path in os.listdir(label_path):
            filename = path.replace('txt', 'jpg')
            f.write(os.path.join(jpg_path_name, filename))
            
            loadf = open(os.path.join(label_path, path), 'r', encoding='utf-8')
            for line in loadf.readlines():
                cls, x_min, y_min, x_max, y_max = line.split(" ")
                ## rewrite
                y_max = y_max.rstrip('\n')
                x_min, y_min, x_max, y_max = int(float(x_min)), int(float(y_min)), int(float(x_max)), int(float(y_max))
                cls = classes_dicts[cls]
                box_info = " %d,%d,%d,%d,%d" % (
                x_min, y_min, x_max, y_max, int(cls))
                f.write(box_info)
            f.write('\n')
            
if __name__=='__main__':
    data_type='test'
    assert data_type in ['train', 'validation', 'test'], 'corecct word from [train, validation, test]'
    jpg_path_name = 'data/Knife/Dataset/{}/Knife'.format(data_type)
    save_filetxt_name = 'data/pytorch_yolov4_{}.txt'.format(data_type)
    label_path = 'data/Knife/Dataset/{}/Knife/Label'.format(data_type)
    main(label_path, jpg_path_name, save_filetxt_name)

出来上がった、「pytorch_yolov4_validation.txt」を開いてみる。

# load用関数
data_type='validation'
save_filetxt_name = 'data/pytorch_yolov4_{}.txt'.format(data_type)
lable_path = save_filetxt_name

def open_txtfile(label_path):
    truth = {}
    f = open(lable_path, 'r', encoding='utf-8')
    for line in f.readlines():
        data = line.split(" ")
        truth[data[0]] = []
        for i in data[1:]:
            truth[data[0]].append([int(float(j)) for j in i.split(',')])
            print(truth)

open_txtfile(label_path)

>>>

data/Knife/Dataset/validation/Knife/2497ac78d31d89d5.jpg 15,166,942,489,0
data/Knife/Dataset/validation/Knife/09a9a9d1fe0a592a.jpg 55,313,333,1024,0
data/Knife/Dataset/validation/Knife/f2a2a1a0095f5d79.jpg 108,481,1024,636,0
data/Knife/Dataset/validation/Knife/4b6d3c391753e5ce.jpg 225,59,372,219,0 539,242,1024,292,0 611,478,1024,720,0 776,179,1024,244,0
data/Knife/Dataset/validation/Knife/4b1fc77d58646a7e.jpg 65,66,983,744,0
〜〜〜

一応githubのREADME.mdのやつと同じにできてる。これでyolov4用のannotation txt fileができた。

yolov4ファイルの変更ポイント

読み込ませるには以下の点を変更した。

dataset.pyのYolo_datasetクラスのimageをloadするときのos.path.joinを消した。

# dataset.py
class Yolo_dataset(Dataset):
〜〜
　　def __getitem__(self, index):
        if not self.train:
            return self._get_val_item(index)
        img_path = self.imgs[index]
        bboxes = np.array(self.truth.get(img_path), dtype=np.float)
        img_path = img_path
        use_mixup = self.cfg.mixup
        if random.randint(0, 1):
            use_mixup = 0

　　for i in range(use_mixup + 1):
            if i != 0:
                img_path = random.choice(list(self.truth.keys()))
                bboxes = np.array(self.truth.get(img_path), dtype=np.float)
                img_path = img_path
            img = cv2.imread(img_path)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

メモリ調整のためtrain.pyの引数にnum_workerを追加

def train(model, device, config, epochs=5, batch_size=1, save_cp=True, num_worker = 0, log_step=20, img_scale=0.5):
    train_dataset = Yolo_dataset(config.train_label, config, train=True)
    val_dataset = Yolo_dataset(config.val_label, config, train=False)
   〜〜〜〜
　writer.close()

#  train.py　=> 実行
# num_workerでメモリ加減を調整

try:
    train(model=model,
          device=device,
          config=cfg,
          epochs=cfg.TRAIN_EPOCHS,
          num_worker=0)
except KeyboardInterrupt:
    torch.save(model.state_dict(), 'INTERRUPTED.pth')
    logging.info('Saved interrupt')
    try:
        sys.exit(0)
    except SystemExit:
        os._exit(0)

>>>>>

2021-05-11 07:57:09,935 <ipython-input-3-6cfc1c1d5a28>[line:36] INFO: Starting training:
        Epochs:          300
        Batch size:      64
        Subdivisions:    16
        Learning rate:   0.001
        Training size:   610
        Validation size: 56
        Checkpoints:     True
        Device:          cpu
        Images size:     608
        Optimizer:       adam
        Dataset classes: 1
        Train label path:data/pytorch_yolov4_train.txt
        Pretrained:
    
Epoch 1/300:   0%|       | 0/610 [00:07<?, ?img/s]

無事動いた。

参考サイト

・はじめての Google Open Images Dataset V6

・OIDv4_ToolKit

2021-04-06

M1 MacBook Air のsetupの記録

インフラ

M1のMac、2021/04の時点で、brewはいかれてるは、tensorflowはinstallできないはで普通に使えない。
試行錯誤した時のメモ。

時系列順に実行した記録。

・仮想環境用ubuntu-18.04-arm64.iso

f:id:trafalbad:20210406003747j:plain

python3とpip3のinstall

python3.9にtensorflow非対応なので、python3.8に下げる。

まずIntelと混ざるのを防ぐため、brewのpython3を消す

$ brew uninstall --ignore-dependencies python3
$ python3 --version
>>>
Python 3.8.2

$ python3 -m pip install --upgrade pip --user
$ pip3 --version   
>>>>   
pip 21.0.1 from /Users/ha~/Library/Python/3.8/lib/python/site-packages/pip (python 3.8)


$ which python3
>>>>>
/usr/bin/python3

＊＊＊＊＊＊pay attention
M1のMac(2021/04時点)では「/opt」以下にanacondaとかbrewがinstallされてます。M1 Macでのhomebrewは公式のドキュメントで /opt/homebrew にインストールすることが推奨されています(Intel版との衝突を避けて共存のため)。

condaのinstall

まず普通にanacondaのdownload。

$ conda
>>>>
zsh: command not found: conda

エラー吐くクソ野郎なので、正常にinstallされてるか確認

$ /opt/anaconda3/bin/conda init zsh

エラーが出なければOK

$ /opt/anaconda3/bin/conda --version
>>>
conda 4.9.2

やっぱりこの類のエラーかよ。

めんどいのでcondaのショートカットを作成

オリジナルのコマンド「conde」でcondaを使えるようにした。

# コマンドを追加
$ sudo vi ~/.bashrc
$ sudo vi ~/.zshenv

>>>>>
alias conde='/opt/anaconda3/bin/conda'

# パスを通す
$ sudo vi ~/.bash_profile
>>>
source ~/.bashrc
source ~/.zshenv


$ source ~/.bash_profile

# 確認
$ conde --version
>>>>
conda 4.9.2

できた。

アーキテクチャの確認・切替

# 買った時のモード
$ uname -m
>>>>arm64
$ arch 
>>>>arm64

archコマンドでアーキテクチャの切替

$ arch -x86_64 bash

# Rosetta2(Intel)で動いている多分
$ arch ($uname -m)
i386

元に戻す。
$ arch -arm64 bash

tensorflowのinstall

pip3でinstallしたtensorflowを実行するとzsh illegal hardwareとエラーが出る。
もうcondaでinstallするしか、方法がわかりませんでした。

# pip3 でinstallしたtensorflowをuninstall
$ pip3 uninstall tensorflow

# condaでtensorflowをinstall
$ (/opt/anaconda3/bin/conda) conde install tensorflow

>>>>

Specifications:

  - tensorflow -> python[version='2.7.*|3.7.*|3.6.*|3.5.*']

Your python: python=3.8

python 3.6か3.5にしろと言われた。

$ (/opt/anaconda3/bin/conda) conde install python=3.6

$ (/opt/anaconda3/bin/conda) conde install tensorflow

実行できるか確かめる

# tf.py
import tensorflow
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.models import Model
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import RMSprop, Adam, SGD

$ python3 tf.py

エラーが出ないのでtensorflowをやっと実行できた。

＊＊＊＊Pay attention
最初はM1 macbookairはpython3.9だったので、python3.9をサポートしてないtensorflowはpip3でinstallできませんでした。
最終手段でもうcondaでinstallするしかなかったです。

参考

・M1 Mac+tensorflow-macosでディープラーニングする

・M1 Mac買ったので行ったセットアップを書いていく

2021-01-28

コーディング試験-思考力錬成用-応用問題 from Codility at 2021/01

ノウハウ・テクニック

Codility problems

2021/01月の記録です。

Codilityの難易度
「PAINLESS」＜「RESPECTABLE」＜「AMBITIOUS」
の順でむずくなってる

まずは分割統治法で簡単なの解いてみて、test=>　汎用的なコード書くこと
エラーはpythonでも、javaとかc++でもググって応用してみる。特にjavaは回答が充実してるのでjavaからの応用はおすすめ

f:id:trafalbad:20210128211805p:plain

Iteration：BinaryGap（PAINLESS）

my solution

def reset(ones, zeros):
    ones = 1
    zeros = 0
    return ones, zeros
    
def solution(N):
    binari = bin(N)[2:]
    ones, zeros = 0, 0
    lenth = []
    for i, val in enumerate(binari):
        if val==str(1):
            ones+=1
        else:
            zeros+=1          
        if ones==2:
            lenth.append(zeros)
            ones, zeros = reset(ones, zeros)
    return max(lenth) if lenth else 0

smart code solution

def solution(N):
    N = str(bin(N)[2:])
    count = False
    gap = 0
    max_gap = 0
    for i in N:
        if i == '1' and count==False:
            count = True
        if i == '0' and count == True:
            gap += 1
        if i == '1' and count == True:
            max_gap = max(max_gap, gap)
            gap = 0
    return max_gap

Array:CyclicRotation（PAINLESS）

My solution

def solution(A, K):
    n = len(A)
    for _ in range(K):
        last = A[-1]
        del A[-1]
        A=[last]+A
    return A

smart solution

def solution(A, K):
    # write your code in Python 2.7
    l = len(A)
    if l < 2:
        return A
    elif l == K:
        return A
    else:
        B = [0]*l
        for i in range(l):
            B[(i+K)%l] = A[i]
        return B

Time Complexity：TapeEquilibrium（PAINLESS）

My solution

def solution(A):
    diff = float('inf')
    for i in range(1, len(A)-1):
        s1 = sum(A[:i])
        s2 = sum(A[i:])
        diff = min(diff, abs(s1-s2))
    return diff

smart solution

def solution(A):
    total, minimum, left = sum(A), float('inf'), 0
    for a in A[:-1]:
        left += a
        minimum = min(abs(total - left - left), minimum)
    return minimum

Counting Elements：MaxCounters（RESPECTABLE）

My solution

def solution(N, A):
    arr = [0]*N
    maxim = max(A)
    for val in A:
        if val == maxim:
            arr = [max(arr)]*N
        else:
            arr[val-1]+=1
    return arr

smart solution

def solution2(N, A):
    counters = [0] * N
    for el in A:
        if el <= N:
            counters[el - 1] += 1
        else:
            counters = [max(counters)] * N
    return counters

CoderByte

CoderByte Challenge Libarary

f:id:trafalbad:20210130212935p:plain

Easy & Algorithm

Find Intersection

FindIntersection(strArr) read the array of strings stored in strArr which will contain 2 elements: the first element will represent a list of comma-separated numbers sorted in ascending order, the second element will represent a second list of comma-separated numbers (also sorted). Your goal is to return a comma-separated string containing the numbers that occur in elements of strArr in sorted order. If there is no intersection, return the string false.

Input: ["1, 3, 4, 7, 13", "1, 2, 4, 13, 15"] 
Output: 1,4,13

def FindIntersection(strArr):
    st1 = list(map(int, strArr[0].split(', ')))
    st2 = list(map(int, strArr[1].split(', ')))
    string = []
    for s in st1:
        if s in st2:
            string.append(str(s))
    return ','.join(string) if string else False

Codeland Username Validation

Have the function CodelandUsernameValidation(str) take the str parameter being passed and determine if the string is a valid username according to the following rules:

1. The username is between 4 and 25 characters.
2. It must start with a letter.
3. It can only contain letters, numbers, and the underscore character.
4. It cannot end with an underscore character.

If the username is valid then your program should return the string true, otherwise return the string false.

# sample1
input: "aa_" 
Output: false

#sample2
Input: "u__hello_world123" 
Output: true

def CodelandUsernameValidation(strParam):
    stack = []
    if len(strParam)<4 or len(strParam)>25:
        return 'false'
    if not strParam[0].isalpha() or strParam[-1]=='_':
        return 'false'
    for x in list(strParam):
        if x.isalpha() or x=='_' or x.isdigit():
            stack.append(x)
    return 'true' if stack else 'false'

Questions Marks

Have the function QuestionsMarks(str) take the str string parameter, which will contain single digit numbers, letters, and question marks, and check if there are exactly 3 question marks between every pair of two numbers that add up to 10. If so, then your program should return the string true, otherwise it should return the string false. If there aren't any two numbers that add up to 10 in the string, then your program should return false as well.

For example: if str is "arrb6???4xxbl5???eee5" then your program should return true because there are exactly 3 question marks between 6 and 4, and 3 question marks between 5 and 5 at the end of the string.

# sample1
Input: "aa6?9" 
Output: false

#sample2
Input: "acc?7??sss?3rr1??????5" 
Output: true

def QuestionsMarks(strParam):
  question = []
  total = 0
  for s in list(strParam):
    if s.isdigit() and len(question)<3:
      total += int(s)
    elif s.isdigit() and len(question)>3:
      total += int(s)
      if total==10:
        return 'true'
    elif s=='?':
       question.append(s)
  return 'false'

Longest Word

Have the function LongestWord(sen) take the sen parameter being passed and return the largest word in the string. If there are two or more words that are the same length, return the first word from the string with that length. Ignore punctuation and assume sen will not be empty.

# sample1
Input: "fun&!! time" 
Output: time

#sample2
Input: "I love dogs" 
Output: love

def LongestWord(sen):
    stack = {}
    string=''
    for s in list(sen):
        if s.isalpha():
            string +=s
        else:
            stack[string]=len(string)
            string=''
    stack[string]=len(string)
    return max(stack, key=stack.get)

First Factorial

Have the function FirstFactorial(num) take the num parameter being passed and return the factorial of it. For example: if num = 4, then your program should return (4 * 3 * 2 * 1) = 24. For the test cases, the range will be between 1 and 18 and the input will always be an integer.

# sample1
Input: 4 
Output: 24

#sample2
Input: 8 
Output: 40320

def FirstFactorial(num):
  factorial = 1
  for i in range(num, 0, -1):
    factorial *= i
  return factorial

Min Window Substring(Mediam)

#algorithm #Facebok

MinWindowSubstring(strArr) take the array of strings stored in strArr, which will contain only two strings, the first parameter being the string N and the second parameter being a string K of some characters, and your goal is to determine the smallest substring of N that contains all the characters in K. For example: if strArr is ["aaabaaddae", "aed"] then the smallest substring of N that contains the characters a, e, and d is "dae" located at the end of the string. So for this example your program should return the string dae.

Another example: if strArr is ["aabdccdbcacd", "aad"] then the smallest substring of N that contains all of the characters in K is "aabd" which is located at the beginning of the string. Both parameters will be strings ranging in length from 1 to 50 characters and all of K's characters will exist somewhere in the string N. Both strings will only contains lowercase alphabetic characters.

Input: ["ahffaksfajeeubsne", "jefaa"] 
Output: aksfaje

def MinWindowSubstring(strArr):
    N = list(strArr[0])
    K = list(strArr[1])
    Ks = list(strArr[1]).copy()
    string = ''
    for i, s in enumerate(N):
        if s in K:
            K.remove(s)
            string +=s
            if not K:
                break
        elif string:
            string +=s
    submit = ''
    for r in string[::-1]:
        if r in Ks:
            Ks.remove(r)
        submit += r
        if not Ks:
            return submit[::-1]

2021-01-27

コーディング試験用基礎問 from Letcode

Letcode problems

f:id:trafalbad:20210127075001p:plain

1. Two Sum

# exactly one solution
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Output: Because nums[0] + nums[1] == 9, we return [0, 1].

class Solution:
    def twoSum(self, nums: List[int], target: int) -> List[int]:
        # stack = {}
        stack = []
        for idx, p in enumerate(nums):
            if p in stack:
                # idx2 = stack[p]
                idx2 = stack.index(p)
                return [idx2, idx]
            else:
                # stack[target-p]=idx
                stack.append(target-p)

7. Reverse Integer

Given a signed 32-bit integer x, return x with its digits reversed. If reversing x causes the value to go outside the signed 32-bit integer range [-231, 231 - 1], then return 0.

Assume the environment does not allow you to store 64-bit integers (signed or unsigned).

Input: x = 123
Output: 321

Input: x = 120
Output: 21

class Solution:
    def reverse(self, x: int) -> int:
        reverse_str = str(int(abs(x)))[::-1]
        submit = int(reverse_str)
        # must check first 
        if submit>= 2** 31 -1 or submit<= -2** 31:
            return 0
        elif x<0:
            return -submit
        else:
            return submit

9. Palindrome Number

Given an integer x, return true if x is palindrome integer.

An integer is a palindrome when it reads the same backward as forward. For example, 121 is palindrome while 123 is not.

Input: x = 121
Output: true

Input: x = -121
Output: false

class Solution:
    def isPalindrome(self, x: int) -> bool:
        return str(x)==str(x)[::-1]

13. Roman to Integer

Input: s = "MCMXCIV"
Output: 1994
Explanation: M = 1000, CM = 900, XC = 90 and IV = 4.

class Solution:
    def romanToInt(self, s: str) -> int:
        d = {'M': 1000,'D': 500 ,'C': 100,'L': 50,'X': 10,'V': 5,'I': 1}
        total = 0
        for i in range(0, len(s)-1):
            if d[s[i]]>=d[s[i+1]]:
                total += d[s[i]]
            else:
                total -= d[s[i]]
        # last facter dose not be included in above loop
        total += d[s[-1]]
        return total

20. Valid Parentheses

Given a string s containing just the characters '(', ')', '{', '}', '[', ']', determine if the input string is valid.
An input string is valid if:

・Open brackets must be closed by the same type of brackets.
・Open brackets must be closed in the correct order.

# sample1
Input: s = "{[]}"
Output: true

# sample2
Input: s = "()[]{}"
Output: true


# sample3
Input: s = "([)]"
Output: false

class Solution(object):
    def isValid(self, s):
        stack = []
        mapping = {")": "(", "}": "{", "]": "["}
        for char in s:
            if char in mapping.keys():
                # when else, stack is empty
                c = stack.pop() if stack else '#'
                if mapping[char] != c:
                    return False
            else:
                stack.append(char)
        # for cases like '['
        return not stack

26. Remove Duplicates from Sorted Array

Given a sorted array nums, remove the duplicates in-place such that each element appears only once and returns the new length.

Do not allocate extra space for another array, you must do this by modifying the input array in-place with O(1) extra memory.

# 訳：新しいlistを使わずに重複要素を削除して、listの長さをreturn しな
Input: nums = [0,0,1,1,1,2,2,3,3,4]
Output: 5, nums = [0,1,2,3,4]
Explanation: Your function should return length = 5, with the first five elements of nums being modified to 0, 1, 2, 3, and 4 respectively. It doesn't matter what values are set beyond the returned length.

class Solution:
    def removeDuplicates(self, nums: List[int]) -> int:
        i=0
        n = len(nums)
        for _ in range(n-1):
            if nums[i]==nums[i+1]:
                del nums[i]
            else:
                i +=1
        return len(nums)

35. Search Insert Position

Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.

# sample1
Input: nums = [1,3,5,6], target = 5
Output: 2
# sample2
Input: nums = [1,3,5,6], target = 2
Output: 1

class Solution:
    def searchInsert(self, nums: List[int], target: int) -> int:
        for i,n in enumerate(nums):
            if nums[i] >= target:
                return i
            elif i == len(nums) - 1:
                return len(nums)

53. Maximum Subarray

Given an integer array nums, find the contiguous subarray (containing at least one number) which has the largest sum and return its sum.

Input: nums = [-2,1,-3,4,-1,2,1,-5,4]
Output: 6
Explanation: [4,-1,2,1] has the largest sum = 6.

class Solution:
    def maxSubArray(self, nums: List[int]) -> int:
        if not nums:
            return 0
       # curSum ： save add subarray
       # maxSum ：maximum add subarray in loop
        curSum = maxSum = nums[0]
        for num in nums[1:]:
            curSum = max(num, curSum + num)
            maxSum = max(maxSum, curSum)

        return maxSum

167. Two Sum II - Input array is sorted

Given an array nums of size n, return the majority element.

The majority element is the element that appears more than ⌊n / 2⌋ times. You may assume that the majority element always exists in the array.

# sample1
Input: nums = [3,2,3]
Output: 3
# sample2
Input: nums = [2,2,1,1,1,2,2]
Output: 2

class Solution:
    def majorityElement(self, nums: List[int]) -> int:
        d={}
        for val in nums:
            if val in d:
                d[val] +=1
            else:
                d[val] =1
        return max(d, key=d.get)　# dictのvalueの最も大きいkeyをgetできる

171. Excel Sheet Column Number（解決法：ググリ力）

Given a column title as appear in an Excel sheet, return its corresponding column number.

For example:

A -> 1
    B -> 2
    C -> 3
    ...
    Z -> 26
    AA -> 27
    AB -> 28 
    ...
# sample1
Input: "A"   Output: 1
# sample 2
Input: "ZY"  Output: 701

解決法：「アルファベット　数字 python」でググった。

class Solution:
    def titleToNumber(self, alpha: str) -> int:
        num=0
        for index, item in enumerate(list(alpha)):
            num += pow(26,len(alpha)-index-1)*(ord(item)-ord('A')+1)
        return num

2021-01-27

コーディング試験用基礎問 from HackerRank

ノウハウ・テクニック

HackerRank Interview Preparation Kit

f:id:trafalbad:20210127054815j:plain

Type : Array

Arrays: Left Rotation

Explanation
When we perform left rotations, the array undergoes the following sequence of changes:

Sample Input

5 4
1 2 3 4 5

Sample Output

5 1 2 3 4

Solution

def rotLeft(a, d):
    return a[d:] + a[:d]

2D Array - DS

6×6のarrayのうちhourglassは下の位置要素で16こ存在する。

a b c
  d
e f g

maximum hourglass sumを求めよ
Solution

def hourglass_sums(arr):
    sums=[]
    for w in range(4):
        for h in range(4):
            hourglass = arr[h][w]+arr[h][w+1]+arr[h][w+2]+arr[h+1][w+1]+arr[h+2][w]+arr[h+2][w+1]+arr[h+2][w+2]
            sums.append(hourglass)
    return max(sums)

New Year Chaos

Sample Input

STDIN       Function
-----       --------
2           t = 2
5           n = 5
2 1 5 3 4   q = [2, 1, 5, 3, 4]
5           n = 5
2 5 1 3 4   q = [2, 5, 1, 3, 4]

Sample Output

3
Too chaotic

Solution

def minimumBribes(q):
    bribes = 0
    q = [i-1 for i in q]
    # reverse loop
    for i in range(len(q)-1,-1,-1):
        if q[i] - i > 2:
            print('Too chaotic')
            return
        # get specified value in loop
        for j in range(max(0, q[i] - 2),i):
            if q[j] > q[i]:
                bribes+=1
    print(bribes)

Type：Dictionaries and Hashmaps

Two Strings

Sample Input

2
hello
world
hi
world

sample output

YES
NO

Solution

def twoStrings(s1, s2):
    for s in s1:
        if s in s2:
            return 'YES'
    return 'NO'

Count Triplets

For example, sample input

len=5 ratio=5
1 5 5 25 125

Sample Output

The triplets satisfying are index (0, 1,3), (0,2,3), (1,3,4), (2,3,4)

Solution

from collections import Counter

def countTriplets(arr, r):
    r2 = Counter()
    r3 = Counter()
    count = 0
    for p in arr:
        if p in r3:
            count += r3[p]
        if p in r2:
            r3[p*r] += r2[p]
        r2[p*r] +=1
    return count

type：Sorting

Mark and Toys

Prices = [1, 2, 3,4 ]
k=7

The budget is 7 units of currency. He can buy items that cost [1, 2, 3]for 6, or [3, 4]for 7units. The maximum is 3 items.
Sample input

7 50
1 12 5 111 200 1000 10]

Sample outout

He can buy only 4 toys at most. These toys have the following prices: .[1, 12,5, 10]

Solution

def maximumToys(prices, k):
    total = 0
    count = 0
    prices = sorted(prices)
    for p in prices:
        if p+total <= k:
            total += p
            count += 1
        else:
            return count

Fraudulent Activity Notifications

Sample Input 1

lens=5 days lens=4
1 2 3 4 4

Sample Output

There are 4 days of data required so the first day a notice might go out is 5 day . Our trailing expenditures are [1,2,3,4] with a median of The client spends 4 which is less than 2✖️2.5(median of [1,2,3,4]) so no notification is sent.

Solution

import bisect as bs
def index(arr, x):
    return bs.bisect_left(arr, x)


def median(days, d):
    half = len(days)//2
    if d%2==0:
        med = (days[half]+days[half-1])/2
    else:
        med = days[half]
    return med

def activityNotifications(expenditure, d):
    notifications = 0
    days = sorted(expenditure[:d])
    for i in range(d, len(expenditure)-1): 
        med = median(days, d)
        if expenditure[i]>=med*2:
            notifications+=1
        del days[index(days, expenditure[i-d])]
        idx = bs.bisect_left(days, expenditure[i])
        days.insert(idx, expenditure[i])
    return notifications

Greedy Algorithms

Minimum Absolute Difference in an Array

Given an array of integers, find the minimum absolute difference between any two elements in the array.

Sample input

5
1 -3 71 68 17

Sample output

Explanation
The minimum absolute difference is |71-68|=3

Solution

def minimumAbsoluteDifference(arr):
    arr = sorted(arr)
    minabs = abs(arr[0] - arr[1])
    for i in range(0, len(arr)-1):
        if abs(arr[i] - arr[i+1])<minabs:
            minabs = abs(arr[i] - arr[i+1])
    return minabs

2020-12-21

弟4回エッジAIコンペ(セグメンテーション) レポート・log【ハードウェア】

機械学習ハードウェア

SIGNATEの第4回AIエッジコンペに参加したので、そのレポートもかねたログを書こうと思う。

機械学習だけじゃなくて、ハードウェアもガチのコンペでした。

目次
1.ネットワークについて
2.C++のアプリケーションコードの工夫について
3.ハードウェアプラットフォームについて

1.ネットワークについて

1.1 使ったmodelと戦略

ModelはカスタマイズしやすいUnetを使った。ライブラリはkerasとtensorflowで、量子化前の変換作業のために以下のversionを使用。

・Keras==2.2.4
・tensorflow-gpu==1.13.1

Unetを選択したのはpretrainからfinetuneへのネットワークのカスタマイズとか、精度向上のためのカスタマイズがしやすかったから。

深さは512。処理速度が遅くならないようにモデル容量を少なめにしたので、メモリサイズは「14,067,237」。

このモデルでベンチマークを超える戦略をとった。

理由はこれでベンチマークを越えられれば、工夫・処理速度とかで、他の参加者とかなりの差別化になって、アドバンテージがとれると思ったから。

Yolov3とのモデルサイズの参考比較

Yolov3	62,002,753
Yolov3-tyny	8,861,918
今回のUnet	14,067,237

深さ512と1024の容量の比較

深さ	容量
512	14,067,237
1024	31,055,557

f:id:trafalbad:20201226171430p:plain
今回のUnetのネットワーク図

他には深さ1024(メモリサイズ：31,055,557)にpruningなどの軽量化テクを使う方法も考えた。

あと、今回のコンペは、セグメンテーションタスクや前処理とかで、ハードウェアのPS側の演算も多くなると考えたので、softmaxを最終層に使った。

このおかげでハードウェア側でsoftmax演算IPを使って、DPUの使用率を多くできた。

採用しなかったアプローチ

採用しなかったアプローチは1024以上の深さのmodelを作り、pruningやDistillation(蒸留)でmodelを軽量化するアプローチ。

このアプローチはpruningやDistilliationなどの技術がハードウェア特有の色が濃いため、習熟度・難易度の面で時間的・開発コストがかかりすぎる（独学だと時間的にきつい)。

軽量化しないと、深さ1024のmodelはメモリが30,000,000以上になって処理速度に如実に反映されるので、この戦略は使わなかった。

1.2 コンパイル時のエラー対策を考慮したネットワーク構成のポイント

量子化の直前・直後で精度劣化やエラーになるレイヤー構成が存在したので、それらを除外してネットワークを構築した。

改善した点

1.「Conv2D => BatchNormarization(BN) => relu 」の順番のレイヤー構成の厳守

NG構成は「relu=>BN」で、コンパイル時エラーになる

x = Conv2D(filters = n_filters, kernel_size = (kernel_size, kernel_size),
            kernel_initializer = 'he_normal', padding = 'same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)

2. Decoder側にDropoutを使う

Decoder側（ConcatenateレイヤーやAddレイヤーを使う場所で）BNを使うと量子化後の精度劣化につながる。

3. softmaxの前にConv2Dを使わずに、Conv2DTransposeを使った

今回はsoftmaxを使ったので、量子化するためには、softmax前にConv2Dレイヤー以外(Conv2Dtranspose, separateconv2D とか)を使う必要があった。

1.3 DPUと連携を考えてsoftmaxを使った

今回はハードウェアで、softmax演算IPを使えるように、unetでも最終層にsoftmaxを使った。

softmaxを使ったおかげで次のポイントが利点になった。

・DPUの使用率を増やせる

・マルチスレッドで、PL, DPU演算, softmax演算の3つを並行処理できる

・sigmoidやreluよりsoftmaxの方が精度が高い

SoftmaxをUnetで使うための条件

vitisのDPUだと、SoftmaxをUnetで使う中で、試行錯誤の過程から以下のことが分かった。

・コンパイル時の制約として、softmaxの直前のレイヤーはConv2D以外(Conv2Dtranspose, SeparateConv2Dなど)
を使う必要がある

# Finetune時のUnet(model)最終層付近のコード
x=model.get_layer(index=-5).output
x = Conv2DTranspose(nClasses, kernel_size=1, use_bias=False)(x)
x = (Activation("softmax"))(x)

・Conv2DTransposeでは「use_bias=False」を指定しないと、DNNDKライブラリの「dpuGetOutputTensorScale()」出力が変化して、sfotmax出力でエラーになることがある

1.4. 精度向上のために工夫したテクニック

深さ512でIou=60%を超えるためには単純にネットワークを構築するだけでは難しく、精度向上のためネットワークに頼る以外の工夫をした。
使った主なテクニックは下の通り。

オリジナル画像(HEIGHT, WIDTH)の比率をなるべく維持した画像サイズでのリサイズ、アスペクト比を維持してのresize

=> Shape=(400, 680)でresizeすることで元画像のサイズ比率をkeepした。また、opencvでresizeでアスペクト比を維持するようにresizeした。これで(224, 224)のように正方形でresizeするよりも小さい物体（signal, pedestrian）の予測精度が上がった。
多分位置情報がresizeでlostすることが減ったためと思う。

ヒストグラム平均化で暗い画像（画素平均80以下）を明るくする前処理

=> 暗い画像の細かい部分の精度向上に若干つながった。暗い画像は画素が偏る特徴があるため、
「画素平均が低い=暗い画像」
として画素平均80未満の画像にヒストグラム平均化を使って明るくした。

def clahe(bgr):
    #plt.imshow(bgr),plt.show()
    lab = cv2.cvtColor(bgr, cv2.COLOR_BGR2LAB)
    #plt.imshow(lab),plt.show()
    lab_planes = cv2.split(lab)
    clahe = cv2.createCLAHE(clipLimit=6.0,tileGridSize=(8,8))
    lab_planes[0] = clahe.apply(lab_planes[0])
    lab = cv2.merge(lab_planes)
    return cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)


def NormalizeImageArr(path, H, W):
    NORM_FACTOR = 255
    img = cv2.imread(path, 1)
    img = cv2.resize(img, (H, W), interpolation=cv2.INTER_NEAREST)
    if img.mean()<80:
        img = clahe(img)
    img = img.astype(np.float32)
    img = img/NORM_FACTOR
    return img

augumatationでの学習

ノイズ系、contrast系、horizontal flipなど、位置を変化させずに済むaugumatationが一番効果があった。
車など上下反転することがない物体がある時はvertical flipは逆効果。

またaugumatationは一度にやらなくても少しずつ学習させた方が精度がだんだんと確実に上がっていくようだ。
以下の手順でaugumentationの学習をした。

Epoch	データセット	IOU	Augmentation
100	CitySpacuies	なし	なし
200	train:2143枚(コンペ用画像)、val:100枚(コンペ用画像)	train=83.8%、Val = 74%	なし
200	train:2143枚(flipした画像のみ)、val:100枚(flipした画像のみ)	train=76%、Val = 68%	Horizontal flip (validationにも適用)
100	train:4286枚、val:100枚	train=88.5%、Val = 75.8%	Horizontal flip (valには適用なし)
100	train:4286枚、val:100枚	train=89.2%、Val = 77.5%	contrast系(valには適用なし)

CitySpacesデータセットでpretrain

前回のコンペで前例があったので真似したら、精度がかなり上がった。

Residual構造やセグメンテーションに有利なサブレイヤーを追加するなどを試したが、深さ512だとモデルの表現力に限界があり、ほとんど効果がなかった。またMultiply演算を使うSENetなどもコンパイル時にエラーが出るなどの制約がある部分で精度向上ができなかったのがきつかった。

PDCAで学んだ点は何らかの精度向上のロジック・仮説がないまま闇雲に技術を駆使してもほとんどの工夫は無駄になるということ。

1.5.最後のネットワークのIouなどの結果

最終的にmodelサイズが「14,067,237」の状態でIou=61%(ほど)を達成した。

2.C++のアプリケーションコードの工夫について

処理速度やハードウェアの性能を引き出すためにC++で特に注力したポイントは2つ。

2.1 計算量の削減

今回はPS側の計算が多く、マルチスレッドを3つ使用したため、冗長なコードの削減・簡略化はかなり処理速度に効果がでた。
特に以下のような書き換えで、改善箇所1つにつき、30msほど速くなった。

・forループの効率化

・無駄な関数、その関数の無駄な呼び出しの削除

・決まった値の定数化

2.2. ハードウェアの性能を引き出すために、3つのマルチスレッド処理

今回は前処理やセグメンテーションでのforループなど、DPU演算以外のPS演算の使用率が多かったので、マルチスレッドを3つにすることで、30~50msほど早くなった。

下はDPUパラメータ(B1152, 「DSP48 USAGE=LOW」など)の時のマルチスレッド2つの時と3つの時の速度の違い

マルチスレッド個数	画像1枚の平均処理速度(ms)
2こ	1061
3こ	1007

3.3 PSとPL(DPU演算)のDPU演算とsoftmax演算の3つをマルチスレッドで並行処理してさらなる処理速度の向上

本来のマルチスレッドは「PS・PL」を並行処理することで処理速度を上げるのが目的だが、今回はDNNDKライブラリを使用しているため、PLは

・DPU演算
・softmax演算

で使うメソッドが独立している。

DPU演算メソッド	dpuRunTask()
softmax演算メソッド	dpuRunSoftmax()

このため今回は

・PS演算
・DPU演算
・softmax演算

の3つをマルチスレッドの並行処理の対象とした。DPU演算とsoftmax演算の両方に非同期処理std::lock_guard lock(mtx_)を用いることで、

DPU演算とsoftmax演算を並行ことができ、マルチスレッドでさらなる処理速度向上が可能になった。

PS・DPU演算(PL)・spftmax演算(PL)の3つを並行処理したマルチスレッド用関数 (main_thread())の抜粋(重要箇所のみ)

#include <thread> 
#include <opencv2/opencv.hpp>
#include <opencv2/core.hpp>
#include <dnndk/dnndk.h>
#include <mutex>  
std::mutex mtx_;
〜〜
〜〜

int main_thread(DPUKernel *kernelConv, int s_num, int e_num, int tid){
  assert(kernelConv);
  DPUTask *task = dpuCreateTask(kernelConv, DPU_MODE_NORMAL); 
  〜〜〜
  // Main Loop
  int cnt=0;
  for(cnt=s_num; cnt<=e_num; cnt+=BLOCK_SIZE){
      for(int i=0; i<BLOCK_SIZE;i++){
        if(cnt+i>e_num) break;
        Mat img;
        resize(input_image[i], img, for_resize, INTER_NEAREST);
        // pre-process with histgram avaraving
        Mat clahe_img = img;
        if((int)mean(img)[0] < 80) {
           clahe_img = clahe_preprocess(img);	
        }
    
        float *softmax = new float[outWidth*outHeight*outChannel]
        // Set image into Conv Task with mean value
        set_input_image(task, outWidth, clahe_img);
        {
          std::lock_guard<std::mutex> lock(mtx_);
          dpuRunTask(task);
        }
        {
          std::lock_guard<std::mutex> lock(mtx_);
          //cout << "outScale : " << outScale << endl;
          int8_t *outAddr = (int8_t *)dpuGetOutputTensorAddress(task, CONV_OUTPUT_NODE);
          dpuRunSoftmax(outAddr, softmax, outChannel,outSize/outChannel, outScale);
        }

        // Post process
        PostProc(softmax, outHeight, outWidth, outChannel, image_file_name[i].c_str());
        delete[] softmax;
      }
  }
  dpuDestroyTask(task);
   return 0;
}

f:id:trafalbad:20201221122329p:plain
3マルチスレッドでPS・DPU演算・softmax演算を並行処理

3.ハードウェアプラットフォームについて

3.1 開発環境

QiitaのVitis-AI開発環境のサイトを参考にした。Vitis-AI-Runtimeライブラリは使わなかったので、DNNDKライブラリベースで開発をした。

3.2 DPUのハードウェアプラットフォーム構築上の工夫について

Vitis-AI環境設定のチュートリアルと第2回のエッジAIコンペの資料(以下：参考資料)を主に参考に、チュートリアルのプラットフォームに改善を加え構築した。

3.2.1 softmax演算IPの活用

なるべくDPU演算を活用するためにsfotmax演算IPを活用。

UnetでConv2DTrasposeを使い、modelとの連携を考慮して、softmax演算を使った。

f:id:trafalbad:20201218012335j:plain
Softmax演算を含んだプラットフォーム（不要なIP削除ずみ）

3.2.2 プラットフォームの構築・改良

DNNDKベースでの開発のため、参考資料をヒントに、Visits-AIプラットフォームのチュートリアルをメインに改良した。
はじめはDPUを搭載したプラットフォームを参考資料の

・softmaxと連携したB1600のDPUの搭載

・DPUの周波数250MHz

の条件で動くことをはじめの目標にした。

そのためにまずチュートリアルのプラットフォームに

1.不要なIPの削除

2.不要なクロックの削除

をして、WNS=0.027でプラットフォームを構築。

今回はmodelの使用のためには「DepthwiseConv」をEnableにする必要性からパラメータと周波数に変更を加えた。

3.3 DPUパラメータと周波数

今回はDPUパラメータの「DepthwiseConv」をEnableにする必要があったため、パラメータがデカすぎて周波数が大きいとリブートしてしまうので、B1600で250MHzでのDPUパラメータでの搭載は出来なかった。

「DepthwiseConv」はB1600で「3292」のLUTを使用することから、DPUパラメータのリソースを参考資料よりかなり減らす必要があった。

特に今回のmodelではconvolution層を多用するため、
「Channel Augumetation」をEnableにしないと、「DSP48 Usage」をHighの時でもかなり処理速度が低下するため、

・「Channel Augmentaion」をEnable

・「DepthWiseConv」をEnable

を必須条件にした。
結果的に、周波数225MHz以上で「DSP48 USAGE」をHIGHにした状態だとリブートしてしまったので、最終的に周波数200MHz、かつ以下のパラメータでDPUを搭載した。

周波数	200MHz
DPU	B1600(ReLU+ReLU6)
Channel Augmentation	Enable
DepthWiseConv	Enable
PoolAverage	Disnable
DSP48 USAGE	HIGH
RAM USAGE	LOW
Softmax	Enable

リソースの関係上これ以上の周波数向上はできなかった。

3.4 implとsynthストラテジで処理速度の向上

今回のDPUパラメータに周波数200MHzでは、DPUの性能を引き出すには周波数が足りないので、ストラテジの組み合わせで処理速度が向上できないか、fixstarsのサイトを参考にいくつか試した。

「高集積度FPGA設計ガイド」によるとリソースが多いほど集積度を低くしないと、リソースの使用度が難しくなるらしい。

f:id:trafalbad:20201214164645p:plain

今回はDPUのリソースが多かったため、集積度が低くなるように、分散させる系の以下のストラテジの組み合わせを選択したことで35msほど早くなった。

impl	Congestion_SpreadLogic_high
Synth	Flow_AreaOptimized_high
WNS	0.131 ns

SSIに分散するストラテジ「impl : Congestion_SSI_SpreadLogic_low」も試したが、SSIは消費電力は低いものの、集積度が高いので、上の組み合わせの方が処理性能は高かった。

このストラテジーの組み合わせ、B1600、周波数200MHZなどで制約を満たしたDPUを搭載した。
f:id:trafalbad:20201219013200p:plain

PSと比較してDPUとsoftmaxの使用率は以下の通りになった。

PS & PL Tototal	93%
DPU	43%
Softmax	6%

f:id:trafalbad:20201219013211p:plain

今回の消費電力レポート

f:id:trafalbad:20201227080507j:plain

参考資料

・vitis-AI platform site(qiita)

・DPU-TRD

・Xilinx GitHub Vitis-AI-TUTORIAL

・Vivado の合成／インプリメンテーションストラテジを変えてみる（WNS・走行時間編）

・第２回AIエッジコンペ資料

・Zynq DPU v3.2ガイド

・高集積度 FPGA 設計手法ガイド

1. Age-gender-estimation model本体

2.後処理での工夫

3.予測結果

データのDownload

フォルダ構造

データの情報・中身

データをyolov4で読み込ませる

yolov4ファイルの変更ポイント

参考サイト

python3とpip3のinstall

condaのinstall

めんどいのでcondaのショートカットを作成

アーキテクチャの確認・切替

tensorflowのinstall

関連ライブラリもcondaでinstall (almost never pip3)

参考

Iteration：BinaryGap（PAINLESS）

Array:CyclicRotation（PAINLESS）

Time Complexity：TapeEquilibrium（PAINLESS）

Counting Elements：MaxCounters（RESPECTABLE）

CoderByte

Find Intersection

Codeland Username Validation

Questions Marks

Longest Word

First Factorial

Min Window Substring(Mediam)

1. Two Sum

7. Reverse Integer

9. Palindrome Number

13. Roman to Integer

20. Valid Parentheses

26. Remove Duplicates from Sorted Array

35. Search Insert Position

53. Maximum Subarray

167. Two Sum II - Input array is sorted

171. Excel Sheet Column Number（解決法：ググリ力）

Type : Array

Arrays: Left Rotation

2D Array - DS

New Year Chaos

Type：Dictionaries and Hashmaps

Two Strings

Count Triplets

type：Sorting

Mark and Toys

Fraudulent Activity Notifications

Greedy Algorithms

Minimum Absolute Difference in an Array

1.ネットワークについて

1.1 使ったmodelと戦略

1.2 コンパイル時のエラー対策を考慮したネットワーク構成のポイント

1.3 DPUと連携を考えてsoftmaxを使った

1.4. 精度向上のために工夫したテクニック

1.5.最後のネットワークのIouなどの結果

2.C++のアプリケーションコードの工夫について

2.1 計算量の削減

2.2. ハードウェアの性能を引き出すために、3つのマルチスレッド処理

3.3 PSとPL(DPU演算)のDPU演算とsoftmax演算の3つをマルチスレッドで並行処理してさらなる処理速度の向上

3.ハードウェアプラットフォームについて

3.1 開発環境

3.2 DPUのハードウェアプラットフォーム構築上の工夫について

3.3 DPUパラメータと周波数

3.4 implとsynthストラテジで処理速度の向上