Pandas fluency is essential for any Python-based data professional, people interested in trying a Kaggle challengeor anyone seeking to automate a data process. The aim of this post is to help beginners get to grips with the basic data format for Pandas — the DataFrame.

We will examine basic methods for creating data frames, what a DataFrame actually is, renaming and deleting data frame columns and rows, and where to go next to further your skills.

The topics in this post will enable you hopefully to: Load your data from a file into a Python Pandas DataFrameExamine the basic statistics of the data, Change some values, Finally output the result to a new file.

What is a Python Pandas DataFrame? In plain terms, think of a DataFrame as a table of data, i. There can be multiple rows and columns in the data.

Each row represents a sample of data, Each column contains a different variable that describes the samples rows. The data in every column is usually the same type of data — e. Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.

By way of example, the following data sets that would fit well in a Pandas DataFrame: In a school system DataFrame — each row could represent a single student in the school, and columns may represent the students name stringage numberdate of birth dateand address string. In an economics DataFrame, each row may represent a single city or geographical area, and columns might include the the name of area stringthe population numberthe average age of the population numberthe number of households numberthe number of schools in each area number etc.

In a shop or e-commerce system DataFrame, each row in a DataFrame may be used to represent a customer, where there are columns for the number of items purchased numberthe date of original registration dateand the credit card number string. Manually entering data The start of every data science project will include getting useful data into an analysis environment, in this case Python.

Using Python dictionaries and lists to create DataFrames only works for small datasets that you can type out manually.

There are other ways to format manually entered data which you can check out here. However, for simplicity, sometimes extracting data directly to CSV and using that is preferable. You can download the CSV file from Kaggle, or directly from here.

The data is nicely formatted, and you can open it in Excel at first to get a preview: The sample data for this post consists of food global production information spanning to The sample data contains 21, rows of data, with each row corresponding to a food source from a specific country.

Some installation instructions are here. Printing is a convenient way to preview your loaded data, you can confirm that column names were imported correctly, that the data formats are as expected, and if there are missing values anywhere. In a Jupyter notebook, simply typing the name of a data frame will result in a neatly formatted outputs.

This is an excellent way to preview data, however notes that, by default, only rows will print, and 20 columns. You can see the full set of options available in the official Pandas options and settings documentation. DataFrame rows and columns with.

Get the shape of your DataFrame — the number of rows and columns using.

Our food production data contains 21, rows, each with 63 columns as seen by the output of.

