python - Using dimensionality reduction on matrix -
for supervised learning, matrix size huge result of models agree run it. read pca can reducing dimensionality large extent.
below code:
def run(command): output = subprocess.check_output(command, shell=true) return output f = open('/users/ya/documents/10percent/vik.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("training column size:") print(col) #dataset = list() row = run('cat '+'/users/ya/documents/10percent/x_true.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size) # label_tmp.ndim must equal 1 label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/ya/documents/10percent/x_true.txt','r') count = 0 line in f: line_tmp = line.split() #print(line_tmp) word in line_tmp[0:]: if word not in vocab_temp: continue matrix_tmp[count][vocab_temp.index(word)] = 1 count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print(len(label_tmp)) print("no. of topics in train:") print(len(set(label_tmp))) print("train label size:") print(len(label_tmp))
i wish apply pca matrix_tmp has size of (202180x9984). how can modify code include it?
import codecs sklearn.decomposition import truncatedsvd sklearn.feature_extraction.text import countvectorizer codecs.open('input_file', 'r', encoding='utf-8') inf: lines = inf.readlines() vectorizer = countvectorizer(binary=true) x_train = vectorizer.fit_transform(lines) perform_pca = false if perform_pca: n_components = 100 pca = truncatedsvd(n_components) x_train = pca.fit_transform(x_train)
1- vectorization available verctorizers in sklearn produces sparse matrices instead of full matrix massive 0 values.
2- pca if needed
3- performance play parameters of vectorizer , pca if needed.
Comments
Post a Comment