我正在尝试使用DICOM文件检测肺癌结节。癌症检测的主要步骤包括以下步骤。
1) Preprocessing
* Converting the pixel values to Hounsfield Units (HU)
* Resampling to an isomorphic resolution to remove variance in scanner resolution
*Lung segmentation
2) Training the data set using preprocessed images in Tensorflow CNN
3) Testing and validation
我遵循了几个在线教程来做到这一点。
我需要在
中结合给定的解决方案 1)https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial
2)https://www.kaggle.com/sentdex/first-pass-through-data-w-3d-convnet。
我可以在第二个链接中实现这个例子。但由于缺乏良好的肺部分割和其他一些预处理步骤,我需要将链接一中的步骤与链接二结合起来。但是我在做这件事时遇到了很多错误。由于我是python的新手,有人可以帮我解决它。
有20个患者文件夹,每个患者文件夹都有多个切片,这些是dicom文件。
对于process_data方法,发送每个患者的slices_path和患者编号。
def process_data(slices,patient,labels_df,img_px_size,hm_slices):
try:
label=labels_df.get_value(patient,'cancer')
patient_pixels = get_pixels_hu(slices)
segmented_lungs2, spacing = resample(patient_pixels, slices, [1,1,1])
new_slices=[]
segmented_lung = segment_lung_mask(segmented_lungs2, False)
segmented_lungs_fill = segment_lung_mask(segmented_lungs2, True)
segmented_lungs=segmented_lungs_fill-segmented_lung
#This method returns smallest integer not less than x.
chunk_sizes =math.ceil(len(segmented_lungs)/HM_SLICES)
for slice_chunk in chunks(segmented_lungs,chunk_sizes):
slice_chunk=list(map(mean,zip(*slice_chunk))) #list - []
#print (slice_chunk)
new_slices.append(slice_chunk)
print(len(segmented_lungs), len(new_slices))
if len(new_slices)==HM_SLICES-1:
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-2:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES+2:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+1:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
print('LENGTH ',len(segmented_lungs), len(new_slices))
except Exception as e:
# again, some patients are not labeled, but JIC we still want the error if something
# else is wrong with our code
print(str(e))
#print(len(new_slices))
if label==1: label=np.array([0,1])
elif label==0: label=np.array([1,0])
return np.array(new_slices),label
主要方法
# Some constants
#data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
#patients = os.listdir(data_dir)
#labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
#patients.sort()
#print (labels_df.head())
much_data=[]
much_data2=[]
for num,patient in enumerate(patients):
if num%100==0:
print (num)
try:
slices = load_scan(data_dir + patients[num])
img_data,label=process_data(slices,patients[num],labels_df,IMG_PX_SIZE,HM_SLICES)
much_data.append([img_data,label])
#much_data2.append([processed,label])
except:
print ('This is unlabeled data')
np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data)
#np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data2)
预处理部分工作正常,但是当我试图输入最终输出到卷积NN并训练数据集时,以下是我收到的错误,包括我提出的一些评论
0
shape hu
(113, 512, 512)
Resize factor
[ 2.49557522 0.6015625 0.6015625 ]
shape
(282, 308, 308)
chunk size
15
282 19
LENGTH 282 20
Tensor("Placeholder:0", dtype=float32)
..........1.........
..........2.........
..........3.........
..........4.........
WARNING:tensorflow:From C:\Research\Python_installation\lib\site-packages\tensorflow\python\util\tf_should_use.py:170: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
..........5.........
..........6.........
Epoch 1 completed out of 20 loss: 0
..........7.........
Traceback (most recent call last):
File "C:\Research\LungCancerDetaction\sendbox2.py", line 436, in <module>
train_neural_network(x)
File "C:\Research\LungCancerDetaction\sendbox2.py", line 424, in train_neural_network
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\framework\ops.py", line 606, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\framework\ops.py", line 3928, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\client\session.py", line 968, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File "C:\Research\Python_installation\lib\site-packages\numpy\core\numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not broadcast input array from shape (20,310,310) into shape (20)
我认为这是'segmented_lungs = segmented_lungs_fill-segmented_lung'的问题
在工作示例中,
segmented_lungs=[cv2.resize(each_slice,(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in patient_pixels]
请帮我解决这个问题。一段时间以后我无法继续。如果有什么不清楚请告诉我。
以下是尝试过的整个代码。
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import dicom
import os
import scipy.ndimage
import matplotlib.pyplot as plt
import cv2
import math
import tensorflow as tf
from skimage import measure, morphology
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
# Some constants
data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
patients = os.listdir(data_dir)
labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
patients.sort()
print (labels_df.head())
#Image pixel array watching
for patient in patients[:10]:
#label is to get the label of the patient. This is what done in the .get_value method.
label=labels_df.get_value(patient,'cancer')
path=data_dir+patient
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
print (len(slices),slices[0].pixel_array.shape)
#If u need to see many slices and resize the large pixelated 2D images into 150*150 pixelated images
IMG_PX_SIZE=50
HM_SLICES=20
for patient in patients[:1]:
#label is to get the label of the patient. This is what done in the .get_value method.
label=labels_df.get_value(patient,'cancer')
path=data_dir+patient
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
#This shows the pixel arrayed image related to the second slice of each patient
#subplot
fig=plt.figure()
for num,each_slice in enumerate(slices[:16]):
print (num)
y=fig.add_subplot(4,4,num+1)
#down sizing everything. Resize the imag size as their pixel values are 512*512
new_image=cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE))
y.imshow(new_image)
plt.show()
print (len(patients))
###################################################################################
def get_pixels_hu(slices):
image = np.array([s.pixel_array for s in slices])
# Convert to int16 (from sometimes int16),
# should be possible as values should always be low enough (<32k)
image = image.astype(np.int16)
# Set outside-of-scan pixels to 0
# The intercept is usually -1024, so air is approximately 0
image[image == -2000] = 0
# Convert to Hounsfield units (HU)
for slice_number in range(len(slices)):
intercept = slices[slice_number].RescaleIntercept
slope = slices[slice_number].RescaleSlope
if slope != 1:
image[slice_number] = slope * image[slice_number].astype(np.float64)
image[slice_number] = image[slice_number].astype(np.int16)
image[slice_number] += np.int16(intercept)
return np.array(image, dtype=np.int16)
#The next problem is each patient is got different number of slices . This is a performance issue.
# Take the slices and put that into a list of slices and chunk that list of slices into fixed numer of
#chunk of slices and averaging those chunks.
#yield is like 'return'. It returns a generator
def chunks(l,n):
for i in range(0,len(l),n):
#print ('Inside yield')
#print (i)
yield l[i:i+n]
def mean(l):
return sum(l)/len(l)
def largest_label_volume(im, bg=-1):
vals, counts = np.unique(im, return_counts=True)
counts = counts[vals != bg]
vals = vals[vals != bg]
if len(counts) > 0:
return vals[np.argmax(counts)]
else:
return None
def segment_lung_mask(image, fill_lung_structures=True):
# not actually binary, but 1 and 2.
# 0 is treated as background, which we do not want
binary_image = np.array(image > -320, dtype=np.int8)+1
labels = measure.label(binary_image)
# Pick the pixel in the very corner to determine which label is air.
# Improvement: Pick multiple background labels from around the patient
# More resistant to "trays" on which the patient lays cutting the air
# around the person in half
background_label = labels[0,0,0]
#Fill the air around the person
binary_image[background_label == labels] = 2
# Method of filling the lung structures (that is superior to something like
# morphological closing)
if fill_lung_structures:
# For every slice we determine the largest solid structure
for i, axial_slice in enumerate(binary_image):
axial_slice = axial_slice - 1
labeling = measure.label(axial_slice)
l_max = largest_label_volume(labeling, bg=0)
if l_max is not None: #This slice contains some lung
binary_image[i][labeling != l_max] = 1
binary_image -= 1 #Make the image actual binary
binary_image = 1-binary_image # Invert it, lungs are now 1
# Remove other air pockets insided body
labels = measure.label(binary_image, background=0)
l_max = largest_label_volume(labels, bg=0)
if l_max is not None: # There are air pockets
binary_image[labels != l_max] = 0
return binary_image
#Loading the files
#Load the scans in given folder path
def load_scan(path):
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
try:
slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
except:
slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
for s in slices:
s.SliceThickness = slice_thickness
return slices
def resample(image, scan, new_spacing=[1,1,1]):
# Determine current pixel spacing
spacing = np.array([scan[0].SliceThickness] + scan[0].PixelSpacing, dtype=np.float32)
resize_factor = spacing / new_spacing
new_real_shape = image.shape * resize_factor
new_shape = np.round(new_real_shape)
real_resize_factor = new_shape / image.shape
new_spacing = spacing / real_resize_factor
print ('Resize factor')
print (real_resize_factor)
image = scipy.ndimage.interpolation.zoom(image, real_resize_factor, mode='nearest')
print ('shape')
print (image.shape)
return image, new_spacing
'''def chunks(l,n):
for i in range(0,len(l),n):
#print ('Inside yield')
#print (i)
yield l[i:i+n]
def mean(l):
return sum(l)/len(l)'''
#processing data
def process_data(slices,patient,labels_df,img_px_size,hm_slices):
#for patient in patients[:10]:
#label is to get the label of the patient. This is what done in the .get_value method.
try:
label=labels_df.get_value(patient,'cancer')
print ('label process data')
print (label)
#path=data_dir+patient
#slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
#This shows the pixel arrayed image related to the second slice of each patient
patient_pixels = get_pixels_hu(slices)
print ('shape hu')
print (patient_pixels.shape)
segmented_lungs2, spacing = resample(patient_pixels, slices, [1,1,1])
#print ('Pix shape')
#print (segmented_lungs2.shape)
#segmented_lungs=np.array(segmented_lungs2).tolist()
new_slices=[]
segmented_lung = segment_lung_mask(segmented_lungs2, False)
segmented_lungs_fill = segment_lung_mask(segmented_lungs2, True)
segmented_lungs=segmented_lungs_fill-segmented_lung
#print ('length of segmented lungs')
#print (len(segmented_lungs))
#print ('Shape of segmented lungs......................................')
#print (segmented_lungs.shape)
#print ('hiiii')
#segmented_lungs=[cv2.resize(each_slice,(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in segmented_lungs3]
#print ('bye')
#print ('length of slices')
#print (len(slices))
#print ('shape of slices')
#print (slices.shape)
#print (each_slice.pixel_array)
#This method returns smallest integer not less than x.
chunk_sizes =math.ceil(len(segmented_lungs)/HM_SLICES)
print ('chunk size ')
print (chunk_sizes)
for slice_chunk in chunks(segmented_lungs,chunk_sizes):
slice_chunk=list(map(mean,zip(*slice_chunk))) #list - []
#print (slice_chunk)
new_slices.append(slice_chunk)
print(len(segmented_lungs), len(new_slices))
if len(new_slices)==HM_SLICES-1:
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-2:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-3:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES+2:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+1:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+3:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
print('LENGTH ',len(segmented_lungs), len(new_slices))
except Exception as e:
# again, some patients are not labeled, but JIC we still want the error if something
# else is wrong with our code
print(str(e))
#print(len(new_slices))
if label==1: label=np.array([0,1])
elif label==0: label=np.array([1,0])
return np.array(new_slices),label
# Some constants
#data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
#patients = os.listdir(data_dir)
#labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
#patients.sort()
#print (labels_df.head())
much_data=[]
much_data2=[]
for num,patient in enumerate(patients):
if num%100==0:
print (num)
try:
slices = load_scan(data_dir + patients[num])
img_data,label=process_data(slices,patients[num],labels_df,IMG_PX_SIZE,HM_SLICES)
much_data.append([img_data,label])
#much_data2.append([processed,label])
except:
print ('This is unlabeled data')
np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data)
#np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data2)
IMG_SIZE_PX = 50
SLICE_COUNT = 20
n_classes=2
batch_size=10
x = tf.placeholder('float')
y = tf.placeholder('float')
keep_rate = 0.8
def conv3d(x, W):
return tf.nn.conv3d(x, W, strides=[1,1,1,1,1], padding='SAME')
def maxpool3d(x):
# size of window movement of window as you slide about
return tf.nn.max_pool3d(x, ksize=[1,2,2,2,1], strides=[1,2,2,2,1], padding='SAME')
def convolutional_neural_network(x):
# # 5 x 5 x 5 patches, 1 channel, 32 features to compute.
weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,3,1,32])),
# 5 x 5 x 5 patches, 32 channels, 64 features to compute.
'W_conv2':tf.Variable(tf.random_normal([3,3,3,32,64])),
# 64 features
'W_fc':tf.Variable(tf.random_normal([54080,1024])),
'out':tf.Variable(tf.random_normal([1024, n_classes]))}
biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
'b_conv2':tf.Variable(tf.random_normal([64])),
'b_fc':tf.Variable(tf.random_normal([1024])),
'out':tf.Variable(tf.random_normal([n_classes]))}
# image X image Y image Z
x = tf.reshape(x, shape=[-1, IMG_SIZE_PX, IMG_SIZE_PX, SLICE_COUNT, 1])
conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
conv1 = maxpool3d(conv1)
conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
conv2 = maxpool3d(conv2)
fc = tf.reshape(conv2,[-1, 54080])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
fc = tf.nn.dropout(fc, keep_rate)
output = tf.matmul(fc, weights['out'])+biases['out']
return output
much_data = np.load('muchdata-50-50-20.npy')
# If you are working with the basic sample data, use maybe 2 instead of 100 here... you don't have enough data to really do this
train_data = much_data[:-4]
validation_data = much_data[-4:]
def train_neural_network(x):
print ('..........1.........')
prediction = convolutional_neural_network(x)
print ('..........2.........')
#cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
print ('..........3.........')
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(cost)
print ('..........4.........')
hm_epochs = 20
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
successful_runs = 0
total_runs = 0
print ('..........5.........')
for epoch in range(hm_epochs):
epoch_loss = 0
for data in train_data:
total_runs += 1
try:
X = data[0]
Y = data[1]
_, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
epoch_loss += c
successful_runs += 1
except Exception as e:
# I am passing for the sake of notebook space, but we are getting 1 shaping issue from one
# input tensor. Not sure why, will have to look into it. Guessing it's
# one of the depths that doesn't come to 20.
pass
#print(str(e))
print ('..........6.........')
print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)
print ('..........7.........')
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
print('Done. Finishing accuracy:')
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
print('fitment percent:',successful_runs/total_runs)
print (x)
# Run this locally:
train_neural_network(x)
P.S:resample(),segment_lung_mask()方法可以从链接1找到。
答案 0 :(得分:4)
对于培训,你有
for data in train_data:
total_runs += 1
try:
X = data[0]
Y = data[1]
_, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
因此,x和y分别是单行train_data
的前两个元素。
但是,在计算准确度时
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
因此x是{em>所有行validation_data
的第一个元素,它给出了(20,310,310)
的维度,不能将其广播到维度{{的占位符1}}。同上y。 (广播意味着如果你给它一个尺寸(20)
的张量,它就会知道将310列中的每一列分别送到占位符。它无法弄清楚如何处理张量为{{ 1}}。)
顺便说一下,当您声明占位符时,最好指定其维度,使用(20, 310)
作为维度,具体取决于单独示例的数量。这样,程序可以在尺寸不匹配时发出警告。
答案 1 :(得分:1)
错误消息似乎表明尚未正确定义占位符张量x
和y
。它们的形状应与输入值X = data[0]
和Y = data[1]
相同,例如
x = tf.placeholder(shape=[20,310,310], dtype=tf.float32)
# if y is a scalar:
y = tf.placeholder(shape=[], dtype=tf.float32)