How to Create Pandas DataFrames: A Hands-On Guide

The first and most important steps in any data science project are creating and loading data into Pandas DataFrames. Pandas is an essential part of the Python data science ecosystem. Data scientists merge, manipulate, and analyze tabular data with Pandas DataFrames. Finally, they prepare data for machine learning. 

Real-world data typically comes unclean, unorganized, and unstructured. The main challenges at the beginning of any project are to: 

  1. Define the data needed for the project.
  2. Find and identify the right data source(s).
  3. Load the data from the data source into Pandas DataFrames.
  4. Clean, process, and manipulate the data.



The Complete Pandas Bootcamp 2020: Data Science with Python <!––>

Last Updated October 2020

Highest Rated

  • 307 lectures
  • All Levels


4.7 (1,483)

Pandas fully explained | 150+ Exercises | Must-have skills for Machine Learning & Finance | + Scikit-Learn and Seaborn | By Alexander Hagmann


Explore Course

In this article, I am going to explain how to load data from the most common data sources into Pandas DataFrames.  You will learn how to combine steps 3 and 4: cleaning and shaping up the data as much as possible when loading it into the row and column format of a DataFrame. This saves a lot of time and extra workflows. Put simply, the smoother the data import, the more efficient the entire project.  

Data sources – an overview

We can categorize data sources into four groups: Python (and its basic data structures), local files, the internet, and databases. Some argue that there is another source: data stored in other Pandas DataFrames and Pandas Series. 

  • Python: You can create a DataFrame from data that is stored in other (basic) Python data structures like lists, dictionaries, and NumPy arrays. 
  • Local Files: Between two Pandas coding sessions, it is a very common workflow to store the data in local files. The most commonly used file types are Excel, comma-separated values) (CSV), and JavaScript object notation (JSON). These file types are used to share and exchange tabular data with others. While CSV files and Excel files are used for simple two-dimensional data, you can use JSON files to store more complex, nested data.      
  • Internet: Commercial data providers allow data downloads via web application programming interfaces(APIs). Web APIs can be free or fee-based. You can pull data from Web APIs by submitting customized HTTP requests. After that, you receive the requested data in CSV files or JSON files. Scraping websites is another way to get data from the web into Python and Pandas. Keep in mind that web scraping is a legal grey area. 
  • Databases: Relational databases are a very common way to store large and complex data. Several Python libraries allow you to connect to local and web-based (cloud) database servers. Once connected, there are a few solutions available to load data from databases into Python and Pandas. 
  • Pandas: You can change, manipulate, aggregate, and merge existing DataFrames to create new DataFrames. Pandas Series can be converted into a DataFrame and multiple Pandas Series can be concatenated into one DataFrame.   

Understanding Pandas DataFrames

Let’s have a look at the following small DataFrame df that contains information on five football players:

A DataFrame is a 2-dimensional labeled data structure. In our example, df has five rows and five columns. Each row is a football player (for example, “Lionel Messi). Each column contains information on the players (for example, height in meters). The column “Name” on the left side isn’t a column. It’s the index of the DataFrame. The index labels the rows. In our example, the rows are labeled by the players´ names. If not specified, DataFrames have a RangeIndex with ascending integers (0, 1, 2, …,). At the top, we can find the column headers (for example, “Country”). It’s best practice to have unique row labels and unique column headers. This allows you to identify rows and columns clearly.

Follow three rules when creating DataFrames: 

  1. Organize observations (for example, football players) in rows
  2. Organize features/variables (for example,  the height of the players) in columns 
  3. Homogeneous data: all elements in a column must have the same data type (dtype)

You can check the data types with the info() method.

df.info()

The dtype object either indicates string/text data or mixed data types. In our example, we have homogeneous string/text data in the columns “Country and “Club_2019”. Therefore, df meets all three conditions.

How to create DataFrames with basic data structures in Python

As a first step, import the Pandas Library with import pandas as pd whenever you work with Pandas.

 import pandas as pd

In case you already have the data in basic Python structures, you can create a Pandas DataFrame object with pd.DataFrame(). The next steps you follow depend on how the data is organized. There are two major scenarios: 1. you already have the columns in lists/arrays. 2. you already have the rows in lists/arrays.  

  1. Having the columns in lists (‘dictionary scenario’)

Let’s assume you have the columns (excl. headers) already stored in the lists: country, club, wc, height, goals. In addition, you have the row labels in the list: names.   

names = ["Lionel Messi", "Cristiano Ronaldo", "Neymar Junior", "Kylian Mbappe",
     	  "Manuel Neuer"]
country = ["Argentina", "Portugal", "Brazil", "France", "Germany"]
club = ["FC Barcelona", "Juventus FC", "Paris SG", "Paris SG", "FC Bayern"]
wc = [False, False, False, True, True]
height = [1.70, 1.87, 1.75, 1.78, 1.93]
goals = [51, 28, 23, 39, 0]

In this scenario, it’s best to create a dictionary with all columns. Each key-value pair in the dictionary consists of an appropriate column header (for example,  Club_2019) as key and the respective list as value (club). Let’s create the dictionary data:  

data = {"Country": country,
   	"Club_2019": club,
   	"WC": wc,
   	"Height_m": height,
   	"Goals_2019": goals
  	}

We are ready to create the DataFrame object df with pd.DataFrame(). Pass the dictionary data to the parameter data and define that names should be the index of the DataFrame with index = names.

df = pd.DataFrame(data = data, index = names)

Finally, you can assign a name for the index with df.index.name =. In our example, we assign “Name”.  

df.index.name = "Name"

Let’s have a final look at df:

  1. Having the rows in lists (‘nested lists scenario’)

Let’s assume you have the selected rows (incl. row labels) already stored in the lists: messi, ronaldo, neymar, mbappe, neuer.

messi = ["Lionel Messi", "Argentina", "FC Barcelona", False, 1.7, 51]
ronaldo = ["Cristiano Ronaldo", "Portugal", "Juventus FC", False, 1.87, 28]
neymar = ["Neymar Junior", "Brazil", "Paris SG", False, 1.75, 23]
mbappe = ["Kylian Mbappe", "France", "Paris SG", True, 1.78, 39]
neuer = ["Manuel Neuer", "Germany", "FC Bayern", True, 1.93, 0]

 In addition, you have the desired column headers in the list “headers.”

headers = ["Name", "Country", "Club_2019", "WC", "Height_m", "Goals_2019" ]

In this scenario, it’s best to create a list of lists. Let’s put all rows into the nested list “data”:

data = [messi, ronaldo, neymar, mbappe, neuer]

We are ready to create the DataFrame object df with pd.DataFrame(). Pass the nested list “data to the parameter data and define that “headers” should be the column headers of the DataFrame with columns = headers.

df = pd.DataFrame(data = data, columns = headers)

Let’s inspect df

In this scenario, we end up with six columns (incl. the names) and a RangeIndex. You can set the index to the column Name with the method set_index(). To change the DataFrame object df with the new index, set inplace = True. Otherwise, the change of the index is not saved in memory.  

players.set_index("Name", inplace = True)

We are finally there:

There is one scenario left: What if we start with a dictionary data that has the wrong data organization: Each key-value pair is a row/observation?

data = {"Lionel Messi": ["Argentina", "FC Barcelona", False, 1.7, 51],
    	"Cristiano Ronaldo": ["Portugal", "Juventus FC", False, 1.87, 28],
    	"Neymar Junior": ["Brazil", "Paris SG", False, 1.75, 23],
    	"Kylian Mbappe": [ "France", "Paris SG", True, 1.78, 39],
    	"Manuel Neuer": ["Germany", "FC Bayern", True, 1.93, 0]
  	}

If you pass the dictionary data to pd.DataFrame(), you’ll end up with a DataFrame where observations are in columns and features are in rows. This isn’t a DataFrame you can work with! You can fix that problem with a few Pandas commands. But there is a better way that allows you to avoid that issue completely. It’s best to reorganize the dictionary data and create a nested list. 

nested_list = []
for key, value in data.items():
	value.insert(0, key)
	nested_list.append(value)

We are back in scenario 2:  

df = pd.DataFrame(data = nested_list, columns = headers)
df.set_index("Name", inplace = True)

How to load datasets from local files into Pandas DataFrames 

You can load datasets from local files on your computer into Pandas with the pd.read_xxx() family:

  • Load CSV files with pd.read_csv()
  • Load Excel files with pd.read_excel()
  • Load JSON files with pd.read_json()
  • and a few more…

pd.read_csv() and pd.read_excel() are very similar and share most of the options and parameters.

5 Things you should know when loading data from CSV and Excel files

  1. Location

The first and most important thing you need to know when loading data from local files: the location of the file. Pass the full file path/name as a string to the parameter filepath_or_buffer. The following is a template to create the DataFrame object df from CSV and Excel files:  

df = pd.read_csv(filepath_or_buffer = full_file_name) #CSV file
df = pd.read_excel(filepath_or_buffer = full_file_name) #Excel file

Note that you can omit ‘filepath_or_buffer =’.

Let’s assume that the CSV file players.csv is located on my desktop. When opening the file, we can see the following structure:

Name,Country,Club_2019,WC,Height_m,Goals_2019
Lionel Messi,Argentina,FC Barcelona,False,1.7,51
Cristiano Ronaldo,Portugal,Juventus FC,False,1.87,28
Neymar Junior,Brazil,Paris SG,False,1.75,23
Kylian Mbappe,France,Paris SG,True,1.78,39
Manuel Neuer,Germany,FC Bayern,True,1.93,0

A CSV file is a delimited text file that uses a comma to separate values. You can still see the tabular data structure. Each line of the file is a data record (a football player). Each record comprises one or more values, separated by commas.

The full file name on Windows could be C:Usersalexdesktopplayers.csv

The full file name on macOS and Linux could be: /Users/alex/desktop/players.csv

Please note that Windows uses the backslash (“”) instead of the slash (“/”). Since backslash is a special character in Python, using the following code will drop an error:

df = pd.read_csv("C:Usersalexdesktopplayers.csv")

There are two ways how to fix this issue:

  • Use forward slash (“/”):
df = pd.read_csv("C:/Users/alex/desktop/players.csv")
  • Use the prefix r:
df = pd.read_csv(r"C:Usersalexdesktopplayers.csv")

On macOS and Linux the single best solution is:

df = pd.read_csv("/Users/alex/desktop/players.csv")

In case the file players.csv is in your current working directory (CWD), it is sufficient to pass the filename players.csv without the full path. Be aware that the CWD can vary and depends on your system and your Python Installation.

df = pd.read_csv("players.csv")

Loading the players dataset from the Excel file players.xlsx works accordingly. 

df = pd.read_excel(r"C:Usersalexdesktopplayers.xlsx") # Windows
  1. Setting an Index

You can select a column as the index of the DataFrame. The column of your choice should contain unique values only (no duplicates). In our example, setting the Name column as the index is reasonable and can be done with index_col = “Name.

df = pd.read_csv(..., index_col = "Name")

Instead of passing the column header, you can also pass the column index position. In our example, Name is at column index position 0.

df = pd.read_csv(..., index_col = 0)

If you do not specify an index, Pandas creates a RangeIndex. 

(Loading the players dataset from the Excel file players.xlsx works the same way.)

  1. Selecting columns

There is no need to load all columns into Pandas. You can select specific columns by passing a list with the column headers to the parameter usecols. As an example, you can load the columns Name, Country, and Goals_2019 with usecols = [“Name”, “Country”, “Goals_2019”]

df = pd.read_csv(..., usecols = ["Name", "Country", "Goals_2019"])

This creates the DataFrame df with a RangeIndex. Of course, you can combine usecols and index_col

df = pd.read_csv(..., usecols = ["Name", "Country", "Goals_2019"],
                 index_col = "Name")

Loading the players dataset from the Excel file players.xlsx works accordingly. But there is one more option. Instead of passing a list with column headers, you can also specify Excel columns (A, B, C,…) in a string: usecols = “A, B, D”. 

df = pd.read_excel(..., usecols = "A, B, D")

This loads the Excel columns A, B and D into Pandas.

  1. Column headers

Sometimes, there are no column headers in the external file. This dataset starts with the first observation (Lionel Messi). Let’s consider the CSV files players.csv without column headers:

Lionel Messi,Argentina,FC Barcelona,False,1.7,51
Cristiano Ronaldo,Portugal,Juventus FC,False,1.87,28
Neymar Junior,Brazil,Paris SG,False,1.75,23
Kylian Mbappe,France,Paris SG,True,1.78,39
Manuel Neuer,Germany,FC Bayern,True,1.93,0

With header = None you specify that there are no column headers in the file. header = None is typically used in combination with the parameter names. You can pass a list of appropriate column headers to names:

df = pd.read_csv(..., header = None, names = ["Name", "Country", "Club_2019",
                                              "WC", "Height_m", "Goals_2019"])

In case the file contains column headers that are not appropriate, you can change those headers with name (don’t use header = None here!). 

df = pd.read_csv(..., names = ["Name", "Country", "Club_2019", "WC",
                               "Height_m", "Goals_2019"])
  1. Dates and Time

Some Datasets have columns that contain date and time information (‘datetime’). The following CSV file stocks.csv contains daily stock prices for Microsoft (MSFT) and Apple (AAPL):

Date,AAPL,MSFT
2020-05-04,293.16,178.84
2020-05-05,297.56,180.76
2020-05-06,300.63,182.54
2020-05-07,303.74,183.60
2020-05-08,310.13,184.68

If not specified, Pandas loads datetime information as string/object data type. Most times, it is desirable to convert the data type of those columns into datetime64 by passing the column header(s) in a list to the parameter parse_dates. This is often used in combination with index_col to create a DatetimeIndex. Managing and analyzing financial data with Pandas is easy with a DatetimeIndex.

df = pd.read_csv(..., parse_dates = ["Date"], index_col = "Date")
df.info()

There are more options to customize the data import with pd.read_csv() and pd.read_excel(). Learn more on how to import data from messy and unclean CSV and Excel files

How to load data from JSON files

The following is a template to create the DataFrame object df from the JSON file players.json with pd.read_json():  

df = pd.read_json(full_file_name)

There are very few options available when loading data from JSON files. JSON files are used to store and transfer complex and nested datasets. Sometimes, you must use the parameter orient or to flatten the data with pd.json_normalize() (learn more). 

How to load datasets from the internet into Pandas DataFrames

Platforms like Twitter, Yahoo Finance, or The Movies Database allow users to retrieve data via their web APIs. The API documentation contains detailed instructions about using the web API. Users need to send HTTP requests (as defined in the API documentation) to the web server and receive the data in CSV or JSON file format. The requests library is the standard for making HTTP requests in Python. Finally, the data can be loaded into Pandas (see 2 examples).

In simple cases, you can directly load CSV files from the web into Pandas with pd.read_csv() by passing the URL as a string to filepath_or_buffer =.

df = pd.read_csv("https://.../players.csv")

With pd.read_html() you can read all tables from a website by passing the URL to io =.

df_list = pd.read_html(io = "https://...")

Note that pd.read_html() returns a list of DataFrames.

How to load datasets from SQL Databases into Pandas DataFrames

You can read tables from SQL Databases like SQLite, MySQL, PostgreSQL, and more with pd.read_sql().

df = pd.read_sql(sql = "SQL query", con = connection)

The sql parameter requires a SQL query in SQL language. Before you can pull data from the database, you must create a connection to the database and pass the connection object to the parameter con.

Depending on the database system you choose, you also have to install and import Python libraries like sqlite3 or SQLAlchemy to create the connection (see an example with SQLite).  

How to create new Pandas DataFrames from other DataFrames and Pandas Series  

Many workflows create new DataFrames from existing DataFrames and Series: filtering, aggregation, manipulation, merging, joining, concatenating, and more. Let me show two examples:

  1. Filtering DataFrames

Let´s filter the players DataFrame df and create a new DataFrame tall with only those players that are taller than 1.75 meters:

tall = df[df.Height_m > 1.75].copy()
tall

To avoid any problems when working with tall and df, chain the copy() method. This creates and saves a new DataFrame object in memory that is independent of the original DataFrame.  

  1. Converting a Pandas Series into a DataFrame

When selecting one column of a DataFrame (for example, “Goals_2019”), Pandas creates a Pandas Series. Let’s create the Series “goals”: 

goals = df.Goals_2019.copy()
goals

A Pandas Series is a one-dimensional labeled array. DataFrame objects and Series objects behave similarly and share many methods. But they are not identical. Sometimes, it’s beneficial to convert a Series into a DataFrame with one column by using the method to_frame().  

goals = goals.to_frame()
goals

Conclusion

You can create Pandas DataFrames in many ways. The first and most important question you have to answer is: Where is the data coming from? Once you know the data source, you can select the appropriate tool to load the data into Pandas. The following table gives an overview: 

The post How to Create Pandas DataFrames: A Hands-On Guide appeared first on Udemy Blog.

Read More

Total
0
Share