medical image analysis notes 1

Posted by Possum on Fri, 21 Jan 2022 10:59:32 +0100

AI for medical image analysis

1.Data Exploration course code

In the first assignment of this lesson, you will use the ChestX-ray8 Data were taken from chest X-ray images.
In this notebook, you will have the opportunity to explore this dataset and familiarize yourself with some of the techniques you will use in your first grading assignment

Before you start coding for any machine learning project, the first step is to explore your data. The standard Python package for analyzing and manipulating data is pandas.
Using the next two code cells, you import pandas and a package named numpy for numeric operations, then use pandas to read the csv file into the data frame and print out the first few lines of data.

# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import seaborn as sns
sns.set()

# Read csv file containing training datadata
train_df = pd.read_csv("nih/train-small.csv")
# Print first 5 rows
print(f'There are {train_df.shape[0]} rows and {train_df.shape[1]} columns in this data frame')
train_df.head()

View the columns in this csv file. This file contains the name of the chest X-ray image ("image" column), and the column filled with 1 and 0 identifies the diagnosis given based on each X-ray image.

Data type and null check

# Look at the data type of each column and whether null values are present
train_df.info()

Unique ID check

"PatientId" has an identification number for each patient.
One thing you want to know about such medical data sets is whether you are looking at duplicate data for some patients, or whether each image represents a different person.

print(f"The total patient ids are {train_df['PatientId'].count()}, from those the unique ids are {train_df['PatientId'].value_counts().shape[0]} ")

The total patient ids are 1000, from those the unique ids are 928

# pandas value_ The counts() function confirms the frequency of data occurrence
count = train_df['PatientId'].value_counts()
count.shape
count

(928,)

As you can see, the number of unique patients in the dataset is less than the total, so there must be some overlap. For patients with multiple records, you need to ensure that they do not appear in both training and test sets to avoid data leakage (described later in this week's lecture).

Explore data tags

Run the next two code units to create a list of names for each patient condition or disease.

# pandas.keys() returns the column name of pd, which contains different diseases
columns = train_df.keys()
columns = list(columns)
print(columns)

# Remove unnecesary elements
columns.remove('Image')
columns.remove('PatientId')
# Get the total classes
print(f"There are {len(columns)} columns of labels for these conditions: {columns}")

There are 14 columns of labels for these conditions: ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Effusion', 'Emphysema', 'Fibrosis', 'Hernia', 'Infiltration', 'Mass', 'Nodule', 'Pleural_Thickening', 'Pneumonia', 'Pneumothorax']

Run the next cell to print out the number of positive labels (1) for each condition.

# Print out the number of positive labels for each class
for column in columns:
    print(f"The class {column} has {train_df[column].sum()} samples")

View the count of tags in each of the above classes.
Does this look like a balanced dataset?

Data visualization

Using the image names listed in the csv file, you can retrieve the image associated with each row of data in the data frame. Run the following cells to visualize randomly selected images from the dataset.

# Extract numpy values from Image column in data frame
images = train_df['Image'].values   

# Extract 9 random images from it
random_images = [np.random.choice(images) for i in range(9)]
# #numpy.random.choice(a, size=None, replace=True, p=None)
#Randomly extract numbers from a (as long as it is ndarray, but it must be one-dimensional) and form an array of specified size
#replace:True means the same number can be taken, False means the same number cannot be taken
#Array p: corresponding to array a, indicating the probability of taking each element in array A. by default, the probability of selecting each element is the same.

# Location of the image dir
img_dir = 'nih/images-small/'

print('Display Random Images')

# Adjust the size of your images
plt.figure(figsize=(20,10))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(img_dir, random_images[i]))
    # Key plt can also directly read the image and return numpy array see  https://matplotlib.org/api/_as_gen/matplotlib.pyplot.imread.html
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()

Investigate a single image

Run the following cells to view the first image in the dataset and print out some details of the image content.

# Get the first image that was listed in the train_df dataframe
sample_img = train_df.Image[0]
raw_image = plt.imread(os.path.join(img_dir, sample_img))
plt.imshow(raw_image, cmap='gray')
plt.colorbar()
plt.title('Raw Chest X Ray Image')
print(f"The dimensions of the image are {raw_image.shape[0]} pixels width and {raw_image.shape[1]} pixels height, one single color channel")
print(f"The maximum pixel value is {raw_image.max():.4f} and the minimum is {raw_image.min():.4f}")
print(f"The mean value of the pixels is {raw_image.mean():.4f} and the standard deviation is {raw_image.std():.4f}")

List item

Survey pixel value distribution
Run the following cells to plot the distribution of pixel values in the above figure.
Add the usage of seaborn https://blog.csdn.net/qq_34264472/article/details/53814653

# Plot a histogram of the distribution of the pixels
sns.distplot(raw_image.ravel(), 
             label=f'Pixel Mean {np.mean(raw_image):.4f} & Standard Deviation {np.std(raw_image):.4f}', kde=False)
plt.legend(loc='upper center')
plt.title('Distribution of Pixel Intensities in the Image')
plt.xlabel('Pixel Intensity')
plt.ylabel('# Pixels in Image')

Image preprocessing in Keras

Before training, you will first modify the image to make it more suitable for training convolutional neural network. For this task, you will use the Keras ImageDataGenerator function to perform data preprocessing and data enhancement.
Run the next two cells to import this function and create an image generator for preprocessing.

# Import data generator from keras
from keras.preprocessing.image import ImageDataGenerator

# Normalize images
image_generator = ImageDataGenerator(
    samplewise_center=True, #Set each sample mean to 0.
    samplewise_std_normalization= True # Divide each input by its standard deviation
)

Variance: image created above_ The generator will adjust your image data so that the new average value of the data is 0 and the standard deviation of the data is 1. In other words, the generator replaces each pixel value in the image
The new value calculated by subtracting the mean and dividing by the standard deviation

Run next cell to use image_ The generator preprocesses your data.
In this step, you will also reduce the image size to 320x320 pixels.

# Flow from directory with specified batch size and target image size
generator = image_generator.flow_from_dataframe(
        dataframe=train_df,
        directory="nih/images-small/",
        x_col="Image", # features
        y_col= ['Mass'], # labels
        class_mode="raw", # 'Mass' column should be in train_df
        batch_size= 1, # images per batch
        shuffle=False, # shuffle the rows or not
        target_size=(320,320) # width and height of output image
)

Example of running the next cell to draw a preprocessed image:

# Plot a processed image
sns.set_style("white")
generated_image, label = generator.__getitem__(0)
plt.imshow(generated_image[0], cmap='gray')
plt.colorbar()
plt.title('Raw Chest X Ray Image')
print(f"The dimensions of the image are {generated_image.shape[1]} pixels width and {generated_image.shape[2]} pixels height")
print(f"The maximum pixel value is {generated_image.max():.4f} and the minimum is {generated_image.min():.4f}")
print(f"The mean value of the pixels is {generated_image.mean():.4f} and the standard deviation is {generated_image.std():.4f}")

Run the following cells to see a comparison of the pixel value distribution in the new preprocessed image with the original image.

# Include a histogram of the distribution of the pixels
sns.set()
plt.figure(figsize=(10, 7))

# Plot histogram for original iamge
sns.distplot(raw_image.ravel(), 
             label=f'Original Image: mean {np.mean(raw_image):.4f} - Standard Deviation {np.std(raw_image):.4f} \n '
             f'Min pixel value {np.min(raw_image):.4} - Max pixel value {np.max(raw_image):.4}',
             color='blue', 
             kde=False)

# Plot histogram for generated image
sns.distplot(generated_image[0].ravel(), 
             label=f'Generated Image: mean {np.mean(generated_image[0]):.4f} - Standard Deviation {np.std(generated_image[0]):.4f} \n'
             f'Min pixel value {np.min(generated_image[0]):.4} - Max pixel value {np.max(generated_image[0]):.4}', 
             color='red', 
             kde=False)

# Place legends
plt.legend()
plt.title('Distribution of Pixel Intensities in the Image')
plt.xlabel('Pixel Intensity')
plt.ylabel('# Pixel')

Topics: Python Deep Learning

Programmer Think