The Photograph That Helped Me Heal

Many years ago my friend Kathy approached me and asked if I would be interested in helping her supply a new orphanage in Cambodia with furniture, clothing and essentials. She explained that gathering…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Pandas Essentials

Pandas is a python library used for data analysis. It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. And it also has a user-friendly API.

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional) handle the vast majority of typical use cases in biological sciences, finance, statistics, social science, and many areas of engineering. In a future article, I’ll explain why more than one data structure is needed in pandas but for now, let’s get on with the flow.

This article focuses on introducing you to the essential attributes and methods Pandas possesses and how to use them.

List of topics:

pandas has two data structures namely:

Essentially, we can think of a Series to be a 1-dimensional container for scalers. And a DataFrame as a multi-dimensional container that holds a Series.

pandas%20data%20structures.png

Let us create a Series and a DataFrame in the following cells.

As shown above, a Pandas Series is created by passing a list of values to the Series() method and allowing Pandas to automatically create a default index integer.

Now let’s create a DataFrame by passing a NumPy array, with a DateTime index and labeled columns to the DataFrame() method:

The above cell creates a DataFrame having 10 rows and 10 columns with an index having the dates object and column names “A, B, C, D, E, F, G, H, I, J”.

We can also create a DataFrame using dictionaries in python. This is done by passing a dictionary of objects that can be converted to series-like.

Pandas has two methods used to view the rows of a dataset:

And two attributes used to view the index and columns of a dataset:

These are helpful when you want to have a quick understanding of your dataset.

Let us see how they work in the following cells.

png
png
png
png

To view the index and columns:

The describe() method provides a quick summary of the numerical data contained in your dataset.

png

You can transpose your DataFrame with:

And sort by axes and values with:

View the transposed DataFrame

In pandas, rows and columns are known as axis 0 and axis 1 respectively.

png
png

The above sorts the DataFrame based on the values of column B. By default, it is set to sort it in ascending order but that can be changed by setting the argument ascending = False.

These methods include:

Although, standard python/NumPy expressions for selecting and setting are intuitive and good for interactive work, working with Pandas data access methods is highly recommended because they are optimized for writing production code.

png

Selecting with index and column names can be achieved like so:

png

Reduction in the dimension of the returned object:

A scaler value can be obtained by:

For getting fast access to a scaler (same as the method used above):

Selection can be done via the position of the passed integers:

By integer slices, acting similar to NumPy/python:

png

By list of integer position locations, similar to NumPy/python:

png

For slicing rows explicitly

For slicing columns explicitly

png

For getting a value explicitly:

For fast access to a scaler

Setting a new column automatically aligns the data by the indexes.

Pass the values of the above series to the ‘F’ column of our DataFrame object.

setting values by label

This sets the value of the first row in column ‘A’ to zero.

Setting values by position

This sets the value of the first row in column ‘B’ to zero.

Setting by assigning a NumPy array

View the changes made to our DataFrame object by the above setting operations.

To obtain a particular column in a DataFrame

Using a single column’s values to select data:

Selecting values from a DataFrame where a boolean condition is met.

png

Using the isin() method for filtering:

png
png

Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

Reindexing allows you to modify the index on a specified axis. This returns a copy of the data.

png

To get the boolean mask where values are nan

png

Doesn’t look like we have any missing data here 😋 But let’s continue with the tutorial anyway 🤓

To drop any rows that have missing data.

Filling missing data

Summary

This is just a quick introduction to the Pandas workflow as there a lot of topics and concepts not described here. Full understanding and mastery only come with regular practice and consistency. And these will help a great deal on your journey to becoming a Data Scientist.

References

Add a comment

Related posts:

I Bribed My Child to go to School

What started as a bribe has now turned into a family tradition. When my son began Pre-K, he absolutely refused to go back the second day. My husband was traveling for work. My 2-year-old was crying…

The Devastating Earthquake And Tsunami In Japan

Ten years ago on March 11, 2011, a terrible earthquake and tsunami hit the northeastern coast of Japan. More than 20,000 lives were lost, and the damage was beyond imagination. It was a huge natural…

What is the role of testosterone?

I made a big dietary lifestyle change four years ago in an effort to be more healthy. After testing prevailing theories and monitoring endless biomarkers and micronutrient levels I lost fifty plus…