python - How can I handle huge matrices? -

- August 15, 2013

i performing topic detection supervised learning. however, matrices huge in size (202180 x 15000) , unable fit them models want. of matrix consists of zeros. logistic regression works. there way in can continue working same matrix enable them work models want? can create matrices in different way?

here code:

import numpy np import subprocess sklearn.linear_model import sgdclassifier sklearn.linear_model import logisticregression  sklearn import metrics  def run(command):     output = subprocess.check_output(command, shell=true)     return output

load vocabulary

 f = open('/users/win/documents/wholedata/rightvo.txt','r')     vocab_temp = f.read().split()     f.close()     col = len(vocab_temp)     print("training column size:")     print(col)

create train matrix

row = run('cat '+'/users/win/documents/wholedata/x_tr.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size)  label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/win/documents/wholedata/x_tr.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_temp:             continue         matrix_tmp[count][vocab_temp.index(word)] = 1     count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print("train label size:") print(len(label_tmp))  f = open('/users/win/documents/wholedata/rightvo.txt','r') vocab_tmp = f.read().split() f.close() col = len(vocab_tmp) print("test column size:") print(col)

make test matrix

row = run('cat '+'/users/win/documents/wholedata/x_te.txt'+" | wc -l").split()[0] print("test row size:") print(row) matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64) print("test matrix size:") print(matrix_tmp_test.size)  label_tmp_test = np.zeros((int(row)), dtype=np.int64)  f = open('/users/win/documents/wholedata/x_te.txt','r') count = 0 line in f:     line_tmp = line.split()     #print(line_tmp)     word in line_tmp[0:]:         if word not in vocab_tmp:             continue         matrix_tmp_test[count][vocab_tmp.index(word)] = 1     count = count + 1 f.close() print("test matrix is: \n") print(matrix_tmp_test) print(label_tmp_test)  print("test label size:") print(len(label_tmp_test))  xtrain=[] open("/users/win/documents/wholedata/y_te.txt") filer:     line in filer:         xtrain.append(line.strip().split()) xtrain= np.ravel(xtrain) label_tmp_test=xtrain  ytrain=[] open("/users/win/documents/wholedata/y_tr.txt") filer:     line in filer:         ytrain.append(line.strip().split()) ytrain = np.ravel(ytrain) label_tmp=ytrain

load supervised model

model = logisticregression() model = model.fit(matrix_tmp, label_tmp) #print(model) print("entered 1") y_train_pred = model.predict(matrix_tmp_test) print("entered 2") print(metrics.accuracy_score(label_tmp_test, y_train_pred))

you can use particular data structure available in scipy package called sparse matrix: http://docs.scipy.org/doc/scipy/reference/sparse.html

according definition:

a sparse matrix matrix large number of 0 values. in contrast, matrix many or entries non-zero said dense. there no strict rules constitutes sparse matrix, we'll matrix sparse if there benefit exploiting sparsity. additionally, there variety of sparse matrix formats designed exploit different sparsity patterns (the structure of non-zero values in sparse matrix) , different methods accessing , manipulating matrix entries.

Search This Blog

Erty

python - How can I handle huge matrices? -

load vocabulary

create train matrix

make test matrix

load supervised model

Comments

Post a Comment

Popular posts from this blog

C++: Boost interprocess memory mapped file error -

python - Selecting distinct values from a column in Peewee -

python - IO.UnsupportedOperation: Not Writable -