machine_learningヘッダファイルパッケージで用いている計算式

6. 複数の入力変数セットを持つ問題への応用

(Formula used in machine_learning header file package; 6. An approcation to problems having multiple input variable sets)

前節までで以下の式が得られた。
The formula below were obtained in the previous sections.

2節の(1)式
Eq. (1) of section 2:
\[\begin{equation} E=-\frac{1}{N}\sum_{n=0}^{N-1}\sum_{i=0}^{J^{(M+1)}-1} t_i^{(n)}\ln x_i^{(M+1,n)} \label{eq.E} \end{equation}\]
2節の(2)式
Eq. (2) of section 2:
\[\begin{eqnarray} y_j^{(m,n)} &=& \sum_{i=0}^{J^{(m)}-1}W_{j,i}^{(m)}x_i^{(m,n)}+W_{j,J^{(m)}}^{(m)} \nonumber \\ & & \left(m=0,\cdots,M; n=0,\cdots,N-1; j=0,\cdots,J^{(m+1)}-1\right) \label{eq.x2y} \end{eqnarray}\]
2節の(3)式
Eq. (3) of section 2:
\[\begin{eqnarray} x_j^{(m+1,n)} &=& f_j^{(m)}\left(y_0^{(m,n)},\cdots,y_{J^{(m+1)}-1}^{(m,n)}\right) \nonumber \\ & & \left(m=0,\cdots,M; n=0,\cdots,N-1; j=0,\cdots,J^{(m+1)}-1\right) \label{eq.y2x} \end{eqnarray}\]
4.5節の(5)式
Eq. (5) of section 4.5:
\[\begin{eqnarray} \PartialDiff{E}{W_{j,i}^{(m)}} &=& -\frac{1}{N}\sum_{n=0}^{N-1} \left[\left(1-\delta_{iJ^{(m)}}\right)x_{i}^{(m,n)} +\delta_{iJ^{(m)}}\right] \left[E_j^{(m,n)}-E_{J^{(M+1)}-1}^{(M,n)}\delta_{mM}\right] \nonumber \\ & & \left(m=0,\cdots,M; i=0,\cdots,J^{(m)}; j=0,\cdots,J^{(m+1)}-1; \right. \nonumber \\ & & \left. m\neq M \mbox{ or } j\neq J^{(m+1)}-1\right) \label{eq.dEdW.useEj} \end{eqnarray}\]
4.5節の(6)式
Eq. (6) of section 4.5:
\[\begin{eqnarray} E_j^{(M,n)} &=& \sum_{i’=0}^{J^{(M+1)}-1} \frac{t_{i’}^{(n)}}{x_{i’}^{(M+1,n)}} \PartialDiff{f_{i’}^{(M)} \left(y_0^{(M,n)},\cdots,y_{J^{(M+1)}-1}^{(M,n)}\right)} {y_{j}^{(M,n)}} \nonumber \\ & & \left(n=0,\cdots,N-1;j=0,\cdots,J^{(M+1)}-1\right) \label{eq.Ej.M1} \end{eqnarray}\]
4.5節の(7)式
Eq. (7) of section 4.5:
\[\begin{eqnarray} E_j^{(m,n)} &=& \sum_{j’=0}^{J^{(m+2)}-1}E_{j’}^{(m+1,n)} \sum_{i{’}{’}=0}^{J^{(m+1)-1}} W_{j’,i{’}{’}}^{(m+1)} \PartialDiff{f_{i{’}{’}}^{(m)} \left(y_0^{(m,n)},\cdots,y_{J^{(m+1)}-1}^{(m,n)}\right)} {y_{j}^{(m,n)}} \nonumber \\ & & \left(m=0,\cdots,M-1; n=0,\cdots,N-1;j=0,\cdots,J^{(m+1)}-1\right) \label{eq.Ej.recursive} \end{eqnarray}\]

通常の機械学習においては分類クラスが\(J^{(0)}\)個(固定)の入力変数の関数として与えられる。しかし\(J^{(0)}\)が教師データ毎に異なる場合や、入力変数全体ではなくそのうちの一部のみの関数として分類クラスを与えたい場合がある。
Usually, the grouping class of each data is given by a function of input variables whose number \(J^{(0)}\) is fixed. In some cases, however, \(J^{(0)}\) may vary among the teaching data, or the grouping class of each data should be given by a function of only some parts of input variables.

そのような例として地震の自動検知のための教師付き学習が挙げられる。地震の自動検知では連続地震波形を一定長さ毎のタイムウインドウに区切って特徴量を抽出し、次にそれらの特徴量と「本物の地震か否か」の判断とを結びつける教師付き学習を行う。「特徴量」が入力変数、「本物の地震か否か」の判断が分類クラスとなる。このとき問題となるのが「特徴量」の計算に用いたタイムウインドウよりも継続時間が長い地震(や微動)の処理である。タイムウインドウ毎に「本物の地震か否か」を判断する場合、複数個のタイムウインドウにまたがる地震の教師データは複数個の地震として重複して用いられることになる。それだけならまだ良いが、長い地震波形の中の「あまり地震らしくない部分」までもが「本物の地震である」という教師データとして扱われてしまい、誤検知の要因となる。これを避けるためには個々のタイムウインドウ毎に「本物の地震か否か」を判断するのではなく、地震の開始から終了までの全ウインドウでの特徴量と「本物の地震か否か」の判断とを結びつけるアプローチが有効である。このアプローチを取る場合、地震毎に入力変数(特徴量)の個数\(J^{(0)}\)が異なることになる。また上記の誤検知要因を軽減するため、全ウインドウを等しく扱うのではなく、地震の開始から終了までの全ウインドウのうち「最も地震らしい部分」を見て「本物の地震か否か」を判断するのが有効と考えられ、上記の「入力変数全体ではなくそのうちの一部のみの関数として分類クラスを与えたい場合」の例となる。
For such an example, let us consider a supervised learning for automatic detection of earthquakes. In this problem, continuous seismograms are divided into time windows of constant length, for which several characteristics are extracted. Then a relation between the characteristics and a judgement of “whether a true earthquake or not” is given by the supervised learning. Here the characteristics are the input variables, and the judgement is the grouping class. A problem here is how to treat an earthquake (or a tremor) having the duration longer than that of the time window used to compute the characteristics. If we make the judgement of “whether a true earthquake or not” for each time window, then teaching data for the long duration earthquake is treated as multiple earthquakes, resulting in a duplication of the data. More serious problem is that it would cause a misdetection; the waveform of the long duration earthquake may have some portions which are not typical earthquake waveforms, and such portions are treated as an earthquake in the teaching data. An approach to avoid these problems may be to make the judgement of “whether a true earthquake or not”, not for individual time windows but for the entire time window of each earthquake; the judgement should be based on the characteristics in all time windows encompassing the duration of the earthquake. In this approach, the number of input variables (characteristics) \(J^{(0)}\) differs by earthquakes. To avoid the possible misdetection mentioned above, all the time windows should not be equally treated; rather the judgement of “whether a true earthquake or not” should be made based on the most earthquake-like portion among the windows. Therefore this is an example of the cases where “the grouping class of each data should be given by a function of only some parts of input variables” mentioned above.

このような問題を念頭に置いて以下のように定式化を行う。 \(J^{(0)}\)個の入力変数\(X_i^{(0)}\)の関数として分類クラスが与えられる問題を考える。そのための教師データが\(N\)個与えられており、そのうちの\(n\)番目の教師データの分類クラスを\(c^{(n)}\)、その1-of-K表現を\(t_j^{(n)}\) \((j=0,\cdots,J^{(M+1)}-1)\)とする(ここまでは普通の機械学習と同じ)。この\(n\)番目の教師データにおいて、 \(J^{(0)}\)個の入力変数のセットが\(K^{(n)}\)通り与えられており、 \(k\)番目のセットにおける\(i\)番目の入力変数の値を\(x_i^{(0,n,k)}\)とする。上記の地震検知の例で言えば \(K^{(n)}\)が\(n\)番目の地震の開始から終了までをカバーするタイムウインドウ数 (地震毎に異なるので引数\(^{(n)}\)が付く)、 \(k\)がタイムウインドウ番号、 \(J^{(0)}\)が各タイムウインドウにおける特徴量(入力変数)の個数である。入力変数と出力変数(各クラスに分類される理論確率)との関係は (\ref{eq.x2y})(\ref{eq.y2x})と同様に \[\begin{eqnarray} y_j^{(m,n,k)} &=& \sum_{i=0}^{J^{(m)}-1}W_{j,i}^{(m)}x_i^{(m,n,k)}+W_{j,J^{(m)}}^{(m)} \nonumber \\ & & \left(m=0,\cdots,M; n=0,\cdots,N-1; k=0,\cdots,K^{(n)}-1; \right. \nonumber \\ & & \left. j=0,\cdots,J^{(m+1)}-1\right) \label{eq.x2y.multi} \end{eqnarray}\] \[\begin{eqnarray} x_j^{(m+1,n,k)} &=& f_j^{(m)}\left(y_0^{(m,n,k)},\cdots,y_{J^{(m+1)}-1}^{(m,n,k)}\right) \nonumber \\ & & \left(m=0,\cdots,M; n=0,\cdots,N-1; k=0,\cdots,K^{(n)}-1; \right. \nonumber \\ & & \left. j=0,\cdots,J^{(m+1)}-1\right) \label{eq.y2x.multi} \end{eqnarray}\] で与えられる。ここで\(W_{j,i}^{(m)}\)の値や\(f_j^{(m)}\)の関数形は\(k\)に依存しない。 (\ref{eq.x2y.multi})(\ref{eq.y2x.multi})式を繰り返し用いることで各教師データ、各\(k\)に対する出力層での理論値\(x_j^{(M+1,n,k)}\)が求まる。この理論値は\(n\)番目の教師データの\(k\)番目の入力変数セットが各クラスに分類される理論確率を表している。地震検知の例で言えば\(n\)番目の教師データの\(k\)番目のタイムウインドウにおける特徴量(入力変数)が\(x_i^{(0,n,k)}\)、それらを用いて計算される「\(k\)番目のタイムウインドウが本物の地震である理論確率」が \(x_j^{(M+1,n,k)}\)である。 \(x_i^{(0,n,k)}\)と\(x_j^{(M+1,n,k)}\)を結ぶ関係式自体はタイムウインドウによらず共通である。それが(\ref{eq.x2y.multi})(\ref{eq.y2x.multi})式において \(W_{j,i}^{(m)}\)の値や\(f_j^{(m)}\)の関数形が\(k\)に依存しないということの意味するところである。こうして\(x_j^{(M+1,n,k)}\)が求まったら、次に各\(n\)に対する最適な\(k\)を1つ選び、その\(k\)に対する\(x_j^{(M+1,n,k)}\)を\(x_j^{(M+1,n)}\)とおく。最適な\(k\)の選び方としては、\(x_j^{(M+1,n,k)}\)の\(j\)に関する線型結合 \[\begin{equation} C_k=\sum_{j=0}^{J^{(M+1)}-1}a_jx_j^{(M+1,n,k)} \label{eq.Ck} \end{equation}\] を最大化する\(k\)を選ぶ。地震の自動検知の例で言えば複数のタイムウインドウのうち「最も地震らしいタイムウインドウ」を用いて地震か否かを判断するのであるから、「地震である」確率が最大(「地震でない」確率が最小)になるような\(k\)を選べば良く、例えば\(j=0\)を「地震でない」、\(j=1\)を「地震である」に割り当てるなら \(a_0=-1\), \(a_1=1\)とすれば良い。このようにして\(x_j^{(M+1,n)}\)が求まれば交差エントロピー誤差の計算には(\ref{eq.E})式がそのまま利用できる。
This problem can be formulated as follows. Let us consider a problem in which the grouping class of each data is given as a function of input variables \(X_i^{(0)}\) whose number is \(J^{(0)}\). A total of \(N\) teaching data is given, with the grouping class and its 1-of-K representation for \(n\)-th teaching data being \(c^{(n)}\) and \(t_j^{(n)}\), respectively, where \(j=0,\cdots,J^{(M+1)}-1\). These settings are same as those for the normal machine learning problem provided earlier. Now, consider that a total of \(K^{(n)}\) sets of the input variable combinations, with the number of variables in each set being \(J^{(0)}\), are given to the \(n\)-th teaching data; the value of \(i\)-th input variable of \(k\)-th set is \(x_i^{(0,n,k)}\). In the example of the earthquake detection problem above, \(K^{(n)}\) is the number of time windows that encompass the entire duration of \(n\)-th earthquake (which differs by earthquakes; thus the superscript \(^{(n)}\) is needed), \(k\) is the time window index, and \(J^{(0)}\) is the number of characteristics (input variables) in each time window. The relation between the input and output variables is given by eqs. (\ref{eq.x2y.multi}) and (\ref{eq.y2x.multi}), which are natural extensions of (\ref{eq.x2y}) and (\ref{eq.y2x}), respectively. Here the output variables represent the theoretical probabilities to be grouped into individual groups. The values of \(W_{j,i}^{(m)}\) and function forms of \(f_j^{(m)}\) are independent of \(k\). The theoretical values in the output layers \(x_j^{(M+1,n,k)}\) for each teaching data and for each \(k\) are computed by repeated use of eqs. (\ref{eq.x2y.multi}) and (\ref{eq.y2x.multi}), and the results represent the theoretical probabilities such that the \(k\)-th input variable set of the \(n\)-th teaching data is grouped into individual classes. In the example of the earthquake detection, \(x_i^{(0,n,k)}\) are the characteristics (input variables) in the \(k\)-th time window of the \(n\)-th teaching data, and \(x_j^{(M+1,n,k)}\) are the theoretical probabilities that the \(k\)-th time window is classified into a true earthquakes calculated from the input variables. The relation between \(x_i^{(0,n,k)}\) and \(x_j^{(M+1,n,k)}\) is independent of the time windows; this is what is meant by “the values of \(W_{j,i}^{(m)}\) and function forms of \(f_j^{(m)}\) are independent of \(k\)” above. Once the \(x_j^{(M+1,n,k)}\) values are obtained in this way, an optimal \(k\) is selected for each \(n\), and the value of \(x_j^{(M+1,n,k)}\) for that \(k\) is then represented as \(x_j^{(M+1,n)}\). The optimal \(k\) is chosen to maximize a linear combination of \(x_j^{(M+1,n,k)}\) with respect to \(j\), given as eq. (\ref{eq.Ck}). In the example of the earthquake detection, “whether an earthquake or not” is judged based on a time window that is most likely an earthquake; thus the optimal \(k\) should maximize the probability that “it is an earthquake” (i.e., minimize the probability that “it is not an earthquake”). This optimization is implemented by setting \(a_0=-1\) and \(a_1=1\) if \(j=0\) represents “it is not an earthquake” and \(j=1\) does “it is an earthquake”. Once the \(x_j^{(M+1,n)}\) values are obtained in this way, the cross entropy error can by computed by eq. (\ref{eq.E}) without modification.

交差エントロピー誤差の微分の計算は次のようにすれば良い。この問題設定で交差エントロピー誤差の計算に用いられる式 (\ref{eq.E})(\ref{eq.x2y.multi})(\ref{eq.y2x.multi})(\ref{eq.Ck}) を普通の機械学習で用いられる式 (\ref{eq.E})(\ref{eq.x2y})(\ref{eq.y2x}) と比較すると、各\(m\), \(n\), \(i\)に対する\(x_i^{(m,n)}\)を最適な\(k\)(\(C_k\)の最大値を与える\(k\))に対する\(x_i^{(m,n,k)}\)で置き換えた式になっていることが分かる。したがって、普通の機械学習において交差エントロピー誤差の微分の計算に用いられる式 (\ref{eq.dEdW.useEj})(\ref{eq.Ej.M1})(\ref{eq.Ej.recursive}) において\(x_i^{(m,n)}\)を最適な\(k\)に対する\(x_i^{(m,n,k)}\)で置き換えれば交差エントロピー誤差の微分を計算できることが分かる。但しモデルパラメータが変われば最適な\(k\)も変化するので、パラメータを動かすたびに毎回(\ref{eq.Ck})を計算して最適な\(k\)を決め直す必要がある。
The derivatives of the cross entropy error can be calculated as follows. A comparison of the formulas used to compute the cross entropy error in this problem (eqs. \ref{eq.E}, \ref{eq.x2y.multi}, \ref{eq.y2x.multi}, and \ref{eq.Ck}) with those used in the normal machine learning (eqs. \ref{eq.E}, \ref{eq.x2y}, \ref{eq.y2x}) indicates that replacing \(x_i^{(m,n)}\) for each \(m\), \(n\), and \(i\) in the normal formula with \(x_i^{(m,n,k)}\) produces the formula for the current problem, where the value of \(k\) is obtained by maximizing \(C_k\). Thus, the derivatives of the cross entropy error can be calculated by replacing all \(x_i^{(m,n)}\) with \(x_i^{(m,n,k)}\) for the optimal \(k\) in eqs. (\ref{eq.dEdW.useEj}), (\ref{eq.Ej.M1}), and (\ref{eq.Ej.recursive}) used in the normal machine learning. Note that the optimal \(k\) depends the model parameters, and thus needs to be determined using eq. (\ref{eq.Ck}) in each round of the iteration.