python - How can I handle huge matrices? -
i performing topic detection supervised learning. however, matrices huge in size (202180 x 15000
) , unable fit them models want. of matrix consists of zeros. logistic regression works. there way in can continue working same matrix enable them work models want? can create matrices in different way?
here code:
import numpy np import subprocess sklearn.linear_model import sgdclassifier sklearn.linear_model import logisticregression sklearn import metrics def run(command): output = subprocess.check_output(command, shell=true) return output
load vocabulary
f = open('/users/win/documents/wholedata/rightvo.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("training column size:") print(col)
create train matrix
row = run('cat '+'/users/win/documents/wholedata/x_tr.txt'+" | wc -l").split()[0] print("training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("train matrix size:") print(matrix_tmp.size) label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/users/win/documents/wholedata/x_tr.txt','r') count = 0 line in f: line_tmp = line.split() #print(line_tmp) word in line_tmp[0:]: if word not in vocab_temp: continue matrix_tmp[count][vocab_temp.index(word)] = 1 count = count + 1 f.close() print("train matrix is:\n ") print(matrix_tmp) print(label_tmp) print("train label size:") print(len(label_tmp)) f = open('/users/win/documents/wholedata/rightvo.txt','r') vocab_tmp = f.read().split() f.close() col = len(vocab_tmp) print("test column size:") print(col)
make test matrix
row = run('cat '+'/users/win/documents/wholedata/x_te.txt'+" | wc -l").split()[0] print("test row size:") print(row) matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64) print("test matrix size:") print(matrix_tmp_test.size) label_tmp_test = np.zeros((int(row)), dtype=np.int64) f = open('/users/win/documents/wholedata/x_te.txt','r') count = 0 line in f: line_tmp = line.split() #print(line_tmp) word in line_tmp[0:]: if word not in vocab_tmp: continue matrix_tmp_test[count][vocab_tmp.index(word)] = 1 count = count + 1 f.close() print("test matrix is: \n") print(matrix_tmp_test) print(label_tmp_test) print("test label size:") print(len(label_tmp_test)) xtrain=[] open("/users/win/documents/wholedata/y_te.txt") filer: line in filer: xtrain.append(line.strip().split()) xtrain= np.ravel(xtrain) label_tmp_test=xtrain ytrain=[] open("/users/win/documents/wholedata/y_tr.txt") filer: line in filer: ytrain.append(line.strip().split()) ytrain = np.ravel(ytrain) label_tmp=ytrain
load supervised model
model = logisticregression() model = model.fit(matrix_tmp, label_tmp) #print(model) print("entered 1") y_train_pred = model.predict(matrix_tmp_test) print("entered 2") print(metrics.accuracy_score(label_tmp_test, y_train_pred))
you can use particular data structure available in scipy
package called sparse matrix: http://docs.scipy.org/doc/scipy/reference/sparse.html
according definition:
a sparse matrix matrix large number of 0 values. in contrast, matrix many or entries non-zero said dense. there no strict rules constitutes sparse matrix, we'll matrix sparse if there benefit exploiting sparsity. additionally, there variety of sparse matrix formats designed exploit different sparsity patterns (the structure of non-zero values in sparse matrix) , different methods accessing , manipulating matrix entries.
Comments
Post a Comment