Nov-11-2017, 05:08 PM
(This post was last modified: Nov-11-2017, 05:09 PM by PythonNewbie.)
Hello all,
This is my first post here, and I hope to find some help.
I am trying to reproduce the results of an example (although the example isn't provided in full, so I had to write some parts myself with my limited knowledge in Python), where the file "seeds.tsv" is read by a function and returns data and labels as follows (the function is defined in a separate file called "load.py):
import numpy as np
def load_dataset(dataset_name):
'''
data,labels = load_dataset(dataset_name)
Load a given dataset
Returns
-------
data : numpy ndarray
labels : list of str
'''
data = []
labels = []
with open('{0}.tsv'.format(dataset_name)) as ifile:
for line in ifile:
tokens = line.strip().split('\t')
data.append([float(tk) for tk in tokens[:-1]])
labels.append(tokens[-1])
data = np.array(data)
labels = np.array(labels)
return data, labelsAfter reading the file, I used the x-fold cross validation for the nearest neighbor algorithm as followsfrom load import load_dataset
import numpy as np
import random
feature_neames = ['area',
'perimeter',
'compactness',
'length of kernel',
'width of kernel',
'asymmetry coefficient',
'length of kernel groove']
data, lables = load_dataset('seeds')
"""
rndInx = random.sample(range(len(lables)), len(lables))
data = data[rndInx]
lables = lables[rndInx]
print(lables)
"""
#print(lables.shape)
#This function returns the distance between two points in N-dimensional space
def distance(f1, f2):
return np.sum((f1 - f2)**2)
#10-fold cross validation
fold = 10 #number of folds and blocks in each fold
elem = int(len(lables)/fold)#number of elements in each block
error = 0.0
for fi in range(fold):
nearestLable = []
training = np.ones(len(lables), bool)
training[fi*fold: fi*fold + elem] = False
testing = ~ training
data_tr = data[training]
data_ts = data[testing]
labels_tr = lables[training]
labels_ts = lables[testing]
for x_ts in data_ts:
dists = np.array([distance(x_ts, y_tr) for y_tr in data_tr])
nearest = dists.argmin()
nearestLable.append(labels_tr[nearest])
error += np.sum(nearestLable != labels_ts)
print("\n\nThe accuracy of the nearest neighbor"
" \nclassifier using %i-fold cross "
"\nvalidation is: %1.2f" %(fold, (1-(error/len(lables)))))When I ran the above codes without randomizing the data for 10-fold cross validation, I get an accuracy of ~0.86 (it should be 0.88 as reported in the original example!!!), but when I randomize the data by using the random indices rndInx (lines 15-18 in the second code segment), I get an accuracy of 0.38!!!. I am not quite sure why? The original data is ordered in the sense that examples of the same class are placed contagiously. But when I used 70-fold cross validation I get an accuracy of 0.98!! Am I doing something wrong?
Thanks in advance
