+2 votes
in Programming Languages by (77.0k points)
I have a very large CSR matrix with about 100K columns. A large number of those columns have value 0 for all rows. I want to remove those columns. How can I do?

E.g.

X= [[1,0,0,0,1], [1,0,0,1,1],[1,0,0,0,0],[1,0,0,1,1]]

The output should be

X= [[1,0,1], [1,1,1],[1,0,0],[1,1,1]]

1 Answer

+1 vote
by (354k points)
selected by
 
Best answer

I am not sure about any function that can do it in one line, but there are several ways to do this. In the following code, I am removing those columns that value 1 for less than <60% rows.

>>> X
<3x3 sparse matrix of type '<type 'numpy.int32'>'
        with 6 stored elements in Compressed Sparse Row format>
>>> X.toarray()
array([[1, 0, 1],
       [0, 0, 1],
       [1, 1, 1]])
>>> row,col=X.nonzero()  #get row an col where value is not zero
>>> row
array([0, 0, 1, 2, 2, 2])
>>> col
array([0, 2, 2, 0, 1, 2])
>>> t=[]
>>> for v,vv in scipy.stats.itemfreq(col): #If >60% rows have 1, select those cols.
...     if(float(vv)/len(np.unique(row)) > 0.6):
...             t.append(v)
...
>>> t
[0, 2]
>>> X[:,t]
<3x2 sparse matrix of type '<type 'numpy.int32'>'
        with 5 stored elements in Compressed Sparse Row format>
>>> X[:,t].toarray()
array([[1, 1],
       [0, 1],
       [1, 1]])
>>> X
<3x3 sparse matrix of type '<type 'numpy.int32'>'
        with 6 stored elements in Compressed Sparse Row format>
>>> X.toarray()
array([[1, 0, 1],
       [0, 0, 1],
       [1, 1, 1]])
>>>

If you want to remove cols with value 0 for all rows:

>>> X=np.array([[1,0,0,0,1], [1,0,0,1,1],[1,0,0,0,0],[1,0,0,1,1]])
>>> X
array([[1, 0, 0, 0, 1],
       [1, 0, 0, 1, 1],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 1, 1]])
>>> row,col=np.nonzero(X)
>>> c=np.unique(col)
>>> c
array([0, 3, 4], dtype=int64)
>>> X[:,c]
array([[1, 0, 1],
       [1, 1, 1],
       [1, 0, 0],
       [1, 1, 1]])


...