python - Using dimensionality reduction on matrix -


for supervised learning, matrix size huge result of models agree run it. read pca can reducing dimensionality large extent.

below code:

def run(command):     output = subprocess.check_output(command, shell=true)     return output  f = open('/users/ya/documents/10percent/vik.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("training column size:") print(col)  #dataset = list()  row = run('cat '+'/users/ya/documents/10percent/x_true.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size)         # label_tmp.ndim must equal 1 label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/ya/documents/10percent/x_true.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_temp:             continue         matrix_tmp[count][vocab_temp.index(word)] = 1     count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print(len(label_tmp)) print("no. of topics in train:") print(len(set(label_tmp))) print("train label size:") print(len(label_tmp)) 

i wish apply pca matrix_tmp has size of (202180x9984). how can modify code include it?

import codecs sklearn.decomposition import truncatedsvd sklearn.feature_extraction.text import countvectorizer codecs.open('input_file', 'r', encoding='utf-8') inf:     lines = inf.readlines() vectorizer = countvectorizer(binary=true) x_train = vectorizer.fit_transform(lines) perform_pca = false if perform_pca:     n_components = 100     pca = truncatedsvd(n_components)     x_train = pca.fit_transform(x_train) 

1- vectorization available verctorizers in sklearn produces sparse matrices instead of full matrix massive 0 values.

2- pca if needed

3- performance play parameters of vectorizer , pca if needed.


Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -