python - How can I handle huge matrices? -


i performing topic detection supervised learning. however, matrices huge in size (202180 x 15000) , unable fit them models want. of matrix consists of zeros. logistic regression works. there way in can continue working same matrix enable them work models want? can create matrices in different way?

here code:

import numpy np import subprocess sklearn.linear_model import sgdclassifier sklearn.linear_model import logisticregression  sklearn import metrics  def run(command):     output = subprocess.check_output(command, shell=true)     return output 

load vocabulary

 f = open('/users/win/documents/wholedata/rightvo.txt','r')     vocab_temp = f.read().split()     f.close()     col = len(vocab_temp)     print("training column size:")     print(col) 

create train matrix

row = run('cat '+'/users/win/documents/wholedata/x_tr.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size)  label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/win/documents/wholedata/x_tr.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_temp:             continue         matrix_tmp[count][vocab_temp.index(word)] = 1     count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print("train label size:") print(len(label_tmp))  f = open('/users/win/documents/wholedata/rightvo.txt','r') vocab_tmp = f.read().split() f.close() col = len(vocab_tmp) print("test column size:") print(col) 

make test matrix

row = run('cat '+'/users/win/documents/wholedata/x_te.txt'+" | wc -l").split()[0] print("test row size:") print(row) matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64) print("test matrix size:") print(matrix_tmp_test.size)  label_tmp_test = np.zeros((int(row)), dtype=np.int64)  f = open('/users/win/documents/wholedata/x_te.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_tmp:             continue         matrix_tmp_test[count][vocab_tmp.index(word)] = 1     count = count + 1 f.close() print("test matrix is: \n") print(matrix_tmp_test) print(label_tmp_test)  print("test label size:") print(len(label_tmp_test))  xtrain=[] open("/users/win/documents/wholedata/y_te.txt") filer:     line in filer:         xtrain.append(line.strip().split()) xtrain= np.ravel(xtrain) label_tmp_test=xtrain  ytrain=[] open("/users/win/documents/wholedata/y_tr.txt") filer:     line in filer:         ytrain.append(line.strip().split()) ytrain = np.ravel(ytrain) label_tmp=ytrain 

load supervised model

model = logisticregression() model = model.fit(matrix_tmp, label_tmp) #print(model) print("entered 1") y_train_pred = model.predict(matrix_tmp_test) print("entered 2") print(metrics.accuracy_score(label_tmp_test, y_train_pred)) 

you can use particular data structure available in scipy package called sparse matrix: http://docs.scipy.org/doc/scipy/reference/sparse.html

according definition:

a sparse matrix matrix large number of 0 values. in contrast, matrix many or entries non-zero said dense. there no strict rules constitutes sparse matrix, we'll matrix sparse if there benefit exploiting sparsity. additionally, there variety of sparse matrix formats designed exploit different sparsity patterns (the structure of non-zero values in sparse matrix) , different methods accessing , manipulating matrix entries.


Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -