Methods xxx xxxx xxx xxx br information about
Methods xxx (xxxx) xxx–xxx
information about the whole image. How to keep the contextual in-formation of each small patch has been the recent research focus. The two methods proposed in recent works by Awan et al.  and Nazeri et al.  were also dedicated to retaining contextual information and achieved good performance. The method proposed by Awan et al.  bound four feature P 22077 extracted from the spatially close patches and flattened into one. However, this simple flattened approach has difficulty integrating spatially close features. Another method to pre-serve the contextual information of each small patch was proposed by Nazeri et al. . In their method, the first patch-wise CNN acts as an autoencoder that extracts the most salient features of image patches, while the second image-wise CNN performs classification of the whole image. However, they retained only the information of the top, bottom, left and right of a patch, and the remote context information still was not retained.
In response to the above disadvantages, we propose utilizing an RNN to fuse the contextual information of features that is directly in-corporated on top of a CNN feature extractor to make the final image-wise classification decision. In our method, the CNN captures patch features, while the RNN captures short-term and long-term de-pendencies between the patches to retain contextual information. LSTM is one of the most common variations of RNNs. The connections be-tween RNN units form a directed loop, which creates the internal state of the network, giving the network the ability to remember inputs at a distance. To further improve our model, we use the bidirectional long short-term memory network (BLSTM) , which is an extension of LSTM. BLSTM combines the output of two LSTMs, one to process input data from left to right, and the other to process input data from right to left. This structure provides the output layer with complete past and future contextual information for each position in the input sequence, which is also consistent with our problem’s characteristics that the context of each patch in a pathological image has no difference between up and down or left and right. In our proposed method, 12 feature vectors (12 × 1 × 5376) can be extracted from one pathological image by the CNN. These 12 feature vectors are inputted into a 4-layer bi-directional LSTM. Finally, we add a fully connected layer in the last layer of the LSTM (as shown in Fig. 4). Because we are performing 4 classifications, the output of this fully connected layer has 4 nodes.
Fig. 4. Schematic overview of the bidirectional LSTM with four layers. Each of these hollow circles in the figure represents a node of a neuron. At the bottom of the figure is the input layer, with twelve nodes. A full connection layer with four nodes is connected at the end of the network to obtain a 4-class classifi-cation result. The directional arrows in the figure represent the flow direction of data.
In this section, we present the performance of our proposed algo-rithm on our released dataset. All experiments in this paper are finished on NVIDIA Tesla K40 GPUs using the TensorFlow  framework. We mainly use accuracy and sensitivity to evaluate the performance of our method. If an image belongs to the category, it is classified as positive; otherwise, it is classified as negative. The sensitivity and accuracy can be defined as follows:
where TP, TN, FP and FN are the true-positive, true-negative, false-positive and false-negative predictions, respectively.
5.1. t-SNE Visualization
T-distributed stochastic neighbor embedding (t-SNE)  is a nonlinear dimensionality reduction method that is excellent for em-bedding high-dimensional data for visualization in a low-dimensional space. Specifically, it models each high-dimensional object by a two- or three-dimensional point such that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The core idea behind t-SNE is to find a two-dimen-sional representation of the data, maintaining the distance between the data points as much as possible. Fig. 5a and b show the two-dimen-sional (2D) representation of 12 feature vectors extracted from 12 patches from a breast cancer pathological image using t-SNE. Each data point in Fig. 5a and b represent the feature vector extracted from the corresponding patch in Fig. 5c and d.
The reason that the data points represent some spatially remote
Methods xxx (xxxx) xxx–xxx
patches that are very close to the final 2D representation is that tumor features are diverse in pathology and irregularly distributed. Therefore, in the routine pathological diagnosis, spatially remote patches with key features for the classification of breast tumors are also an important basis for the final classification decision. As shown in Fig. 5c, Patches 2, 3, 9 and 10 all have key pathologic features of invasive carcinoma, such as the tumor cells showing diffuse or sheet distribution, small adenoid structure or cell nests with tubular structure. Jointly considering patch 2 with the spatially remote patches 9 or 10 can make the classification decision about invasive breast cancer as accurately as jointly con-sidering patch 2 with the spatially close patch 3.