Most of the data used for Data Science and Machine Learning is stored as a Dataframe or a Numpy array. In this article, we will be going over Numpy
NumPy Why
NumPy is a library provided by Python for scientific computations. Machine Learning and Data Analysis need a lot of data to be stored and more often than not, NumPy is used to store the data. Some of its benefits are listed below
- Numpy uses less space to store data as compared to lists. You can specify that datatype of your array to reduce the memory consumed by the array.
- It is faster than lists.
- It gives you the ability to perform arithmetic computations such as adding elements of two arrays, multiplying elements of two arrays etc. You can also use functions for EDA (Exploratory Data Analysis) such as min, max,avg etc.
l = [1,2,3,4,5,6]
print ("Size of list is ",sys.getsizeof(l))
a1 = np.array([1,2,3,4,5,6])
print("Size of a is " , a1.nbytes)
a2 = np.array([1,2,3,4,5,6], dtype="int16")
print("Size of a is " , a2.nbytes)
------------------ OUTPUT OF THE CODE-------------------------------
Size of list is 112
Size of a is 48
Size of a is 12
As you can see, a list uses more than double the memory compared to a Numpy array.
Numpy How
To install Numpy, type the following command
pip install numpy
To import the NumPy library you type the following code.
import numpy as np
The ‘as np’ is not necessary. It allows you to use ‘np’ instead of ‘numpy’ or as an alias when invoking functions.
Creating an Array
You can use NumPy to create multi-dimensional arrays.
# 1D array
d1 = np.array([1,2,3,4,5,6])
print(d1)
# 2D array
d2 = np.array( [ [1,2,3] , [4,5,6] ])
print(d2)
------------------ OUTPUT OF THE CODE-------------------------------
[1 2 3 4 5 6]
[[1 2 3] [4 5 6]]
The syntax is
variable = np.array( [ elements/array of elements ])
You can specify the data type by adding
dtype = “int16”
dtype= “ int32 “
Shape and Dimension
.ndim
returns the dimension of your array and .shape
returns a tuple of length .ndim
containing the length of each ‘row’.
arr = np.array([ [1,2,3],[4,5,6],[6,7,8] ])
print(arr.ndim)
print(arr.shape) #arr is 3x3 matrix
------------------ OUTPUT OF THE CODE-------------------------------
2
(3, 3)
The variable arr is a 2-dimensional array with 9 elements each. It has three 1-dimensional arrays with 3 elements in each. It is basically a 3×3 matrix.
Accessing Elements
Accessing a specific element in a NumPy array is like accessing an element in a list. The number of length of the tuple returned by .shape or the value returned by .ndim is the number of indices you will need to access a single element the multidimensional array.
# print a single element
# syntax: arr[row , col]
print(arr[0 , 0] , arr[2 , 2] , arr[1 , 2])
#prints all elements in row 0
print(arr[0 , :])
#prints all elements in col 0
print(arr[: , 0])
#prints all elements in the matrix
print(arr[:,:])
------------------ OUTPUT OF THE CODE-------------------------------
1 8 6
[1 2 3]
[1 4 6]
[[1 2 3] [4 5 6] [6 7 8]]
Initializing Arrays
- A multidimensional array with all zeroes
print ( np.zeros((2)) ) # A 1D array with 2 elements set to zero
print("------------------------------")
print (np.zeros((3,3)) ) # A 3x3 matrix will all elements set to
# zero
------------------ OUTPUT OF THE CODE-------------------------------
[0. 0.]
------------------------------
[[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
Note the syntax, .zeroes( ( dimension ) )
2. Multidimensional array with all ones
print ( np.ones((2)) ) # A 1D array with 2 elements set to one
print("------------------------------")
print (np.ones((3,3)) ) # A 3x3 matrix will all elements set to one
------------------ OUTPUT OF THE CODE-------------------------------
[1. 1.]
------------------------------
[[1. 1. 1.] [1. 1. 1.] [1. 1. 1.]]
Note the syntax, .ones( ( dimension ) )
3. A multidimensional array with each element equal to a number
print ( np.full((2) , 6 ) ) # A 1D array with 2 elements set to 6
print("------------------------------")
print (np.full((3,3),77)) #A 3x3 matrix will all elements set to 77
------------------ OUTPUT OF THE CODE-------------------------------
[6 6]
------------------------------
[[77 77 77] [77 77 77] [77 77 77]]
Note the syntax, .full( ( dimension ),number )
4. Multidimensional array with random decimals/integers
print ( np.random.rand(4,2) )
print("----------------------")
print (np.random.randint(0,100, size = (4,2) ) )
------------------ OUTPUT OF THE CODE-------------------------------
[[0.97398125 0.86285608] [0.84382674 0.76331437] [0.71798434 0.2150087 ] [0.38535155 0.33849209]]
----------------------
[[ 5 10] [73 65] [29 38] [39 26]]
Note the syntax, .random.rand( ( dimension ),number )
Note the syntax, .random.randint( min , max , size = ( dimension ) )
5. Identity Matrix
np.identity(3)
------------------ OUTPUT OF THE CODE-------------------------------
array([[1., 0., 0.],[0., 1., 0.],[0., 0., 1.]])
Since it is a matrix of size N x N, you only need to give a single value.
BEWARE OF COPY
There is a correct way to copy a NumPy array and a wrong way. Obviously, you must use the correct way.
Always use the .copy() method to copy an array. Simply setting b = a will make both b and a point to the same array. As a result, all changes in b will also be reflected in a and this is often not desirable.
Basic Arithmetic
You can perform basic arithmetic operations such as addition, multiplication etc on two or more arrays.
You must make sure the dimension and shape of the arrays are the same
The operation will be performed element by element. If we used the addition operation on array a and array b, the first element of a will be added to the first element of b, the second element of a will be added to the second element of array b and so on. If we used the subtraction operation on array a and array b, the first element of b will be subtracted from the first element of a, the second element of array b will be subtracted from the second element of array b and so on.
We can also perform arithmetic operations on arrays with scalars. The operation will be performed element-wise.
a = np.array([1,2,3,4])
b = np.array([5,6,7,8])
print(a+b)
print(a-b)
print(b-a)
print(a*b)
print(a/2)
print(a + 2)
------------------ OUTPUT OF THE CODE-------------------------------
[ 6 8 10 12]
[-4 -4 -4 -4]
[4 4 4 4]
[ 5 12 21 32]
[0.5 1. 1.5 2. ]
[3 4 5 6]
Statistic Functions
NumPy also allows us to perform various stats functions used for EDA such as min, max, sum, mean etc
a = np.random.randint(0,100,size=(10))
print(a)
print(a.max())
print(a.min())
print(a.sum())
print(a.mean())
------------------ OUTPUT OF THE CODE-------------------------------
[35 73 93 24 14 39 66 96 89 69]
96
14
598
59.8
Reshape
NumPy allows us to change the shape of an array. We can change a 2×3 array to a 3×2 array. We must be very careful while performing reshape operations since it often gets confusing.
Sanity Check : While reshaping, the product of elements in reshaped array’s .shape tuple must be equal to product of original array’s .shape tuple. I.e the number of elements must be the same in the original and reshaped array.
You can also specify one of the elements in reshape as -1 and python will calculate the unknown dimension for you. Only one of the elements can be specified as -1, if you set more than one element to -1, you will get an error. We often need to reshape our array when passing it to various functions. It might seem confusing but read the explanation below and things will be clear.
b = np.zeros( (3,2))
print("Orignial")
print(b)
print(b.shape)
print("---------------")
print("Reshaped")
print(b.reshape(2,3))
print(b.reshape(2,3).shape)
print("---------------")
print("Reshaped")
print(b.reshape(-1,3))
print(b.reshape(-1,3).shape)
print("Reshaped")
print(b.reshape(2,-1))
print(b.reshape(2,-1).shape)
------------------ OUTPUT OF THE CODE-------------------------------
Orignial
[[0. 0.] [0. 0.] [0. 0.]]
(3, 2)
---------------
Reshaped
[[0. 0. 0.] [0. 0. 0.]]
(2, 3)
---------------
Reshaped
[[0. 0. 0.] [0. 0. 0.]]
(2, 3)
---------------
Reshaped
[[0. 0. 0.] [0. 0. 0.]]
(2, 3)
First, we created a 2 dimension array or a matrix with 3 rows and 2 columns with all elements set to 0. Next, we try to convert this array into an array with 2 rows and 3 columns so we use .reshape(2,3). Note the product is 6 in both cases.
Next, we pass the values (-1,3). We tell NumPy that we want 3 columns and ask it to calculate the number of rows. The product must still be 6, so NumPy replaces -1 with 2 and we get a 2×3 matrix.
In the last case, we pass the values (2,-1). We tell NumPy that we want 2 rows and ask it to calculate the columns. The product must still be 6, so NumPy replaces -1 with 4 and we get 2×3 matrix.
Below are a few more reshapes of our original array.
b.reshape(-1,8) gives an error since no integer multiplied to 8 will produce a 6. Always remember to do the sanity check before reshaping.
Conclusion
I hope I have helped you understand the basics of NumPy. There are many more functions available in the NumPy library and all of them are just a google search away.