python - Using dimensionality reduction on matrix -

- April 15, 2015

for supervised learning, matrix size huge result of models agree run it. read pca can reducing dimensionality large extent.

below code:

def run(command):     output = subprocess.check_output(command, shell=true)     return output  f = open('/users/ya/documents/10percent/vik.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("training column size:") print(col)  #dataset = list()  row = run('cat '+'/users/ya/documents/10percent/x_true.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size)         # label_tmp.ndim must equal 1 label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/ya/documents/10percent/x_true.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_temp:             continue         matrix_tmp[count][vocab_temp.index(word)] = 1     count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print(len(label_tmp)) print("no. of topics in train:") print(len(set(label_tmp))) print("train label size:") print(len(label_tmp))

i wish apply pca matrix_tmp has size of (202180x9984). how can modify code include it?

import codecs sklearn.decomposition import truncatedsvd sklearn.feature_extraction.text import countvectorizer codecs.open('input_file', 'r', encoding='utf-8') inf:     lines = inf.readlines() vectorizer = countvectorizer(binary=true) x_train = vectorizer.fit_transform(lines) perform_pca = false if perform_pca:     n_components = 100     pca = truncatedsvd(n_components)     x_train = pca.fit_transform(x_train)

1- vectorization available verctorizers in sklearn produces sparse matrices instead of full matrix massive 0 values.

2- pca if needed

3- performance play parameters of vectorizer , pca if needed.

Search This Blog

Erty

python - Using dimensionality reduction on matrix -

Comments

Post a Comment

Popular posts from this blog

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

c++ - llvm function pass ReplaceInstWithInst malloc -

python - IO.UnsupportedOperation: Not Writable -