Pandas Read Csv Reading Everything in as an Index

In this tutorial, nosotros'll show how to use read_csv pandas to import data into Python, with practical examples.

csv (comma-separated values) files are pop to store and transfer data. And pandas is the most popular Python package for data analysis/manipulation. These make pandas read_csv a critical starting time step to start many information science projects with Python.

You'll learn from basics to advanced of pandas read_csv, how to:

import csv files to pandas DataFrame.
specify data types (low_memory/dtype/converters).
apply a subset of columns/rows.
assign cavalcade names with no header.
And More!

This pandas tutorial includes all common cases when loading data using pandas read_csv.

Practice along and keep this as a cheat sheet as well!

Before reading the files
Pandas Data Structures: Serial? DataFrame?
pandas read_csv Nuts
Fix error_bad_lines of more than commas
Specify Data Types: Numeric or Cord
Specify Data Types: Datetime
Utilize certain Columns (usecols)
Ready Column Names (names/prefix/no header)
Specify Rows/Random Sampling (nrows/skiprows)
pandas read_csv in chunks (chunksize) with summary statistics
Load zip File (pinch)
Prepare Missing Values Strings (na_values)
pandas read_csv Crook Sheet

Before reading the files

If y'all have no experience with Python, please take our FREE Python crash form: breaking into information scientific discipline.

Learn Python for information science: FREE online course – Simply into Data

A FREE Python online class, beginner-friendly tutorial. Start your successful data science career journeying: learn Python for data scientific discipline, machine learning.

To arrive easy to show, nosotros created 5 small csv files. You can download them to your estimator from Github pandas-read-csv-practice to practice.

Note: You tin open these csv files and view them through Jupyter Notebook. Simply launch a Jupyter Notebook, then look for them within the directory.
Or for Windows users, you can right-click the file, select "Edit" and view information technology in Notepad. For Mac users, you can right-click and open with TextEdit.
The comma (,) is used to split up columns, while a new line is used to separate rows.

To start practicing, you tin can either:

open the read_csv_code.ipynb file from Jupyter Notebook.
or
create a new notebook within the aforementioned folder as these csv files.

Note: To avoid specifying the whole path/directory for the folder, Python needs both the csv files and the notebook to be in the same directory. This mode, Python can locate the files by its name in the electric current working directory.
Otherwise, you will get FileNotFoundError.

To be able to use the pandas read_csv part, we as well need to import the pandas and NumPy packages into Python. Both libraries should have been installed on your reckoner, if you installed Anaconda Distribution.

Notation: pd is the common alias proper noun for pandas, and np is the mutual alias proper noun for NumPy.

Pandas Data Structures: Series? DataFrame?

pandas is the about popular Python data analysis/manipulation package, which is congenital upon NumPy.

pandas has two primary data structures: Series and DataFrame.

Series: a one-dimensional labeled assortment that can hold any data type such as integers, strings, floating points, Python objects.
It has row/axis labels equally the index.
DataFrame: a ii-dimensional labeled data construction with columns of potentially different types.
Information technology too contains row labels as the index.

DataFrame can be considered as a collection of Series; it has a structure like a spreadsheet.

Nosotros'll be using the read_csv function to load csv files into Python as pandas DataFrames.

Bully!

With all this bones knowledge, we can start practicing pandas read_csv!

pandas read_csv Basics

There is a long listing of input parameters for the read_csv function. We'll but be showing the popular ones in this tutorial.

The most basic syntax of read_csv is below.

With only the file specified, the read_csv assumes:

the delimiter is commas (,) in the file.
We can change information technology by using the sep parameter if it'due south non a comma. For example, df = pd.read_csv('test1.csv', sep= ';')
the commencement row of the file is the headers/column names.
read all the information.
the quote character is double (").
an error volition occur if there are bad lines.
Bad lines happen when there are also many delimiters in the row.

Nearly structured datasets that are saved to text-based files tin can exist opened using this method. They often accept articulate formats and don't need whatsoever farther specification of read_csv.

The test1.csv file is nice and make clean, so the default setting is appropriate to load the file.

Beneath you tin can run across the original file (left) and the pandas DataFrame df (correct).

Note: you can utilise the type function to find out that df is a pandas.core.frame.DataFrame.

As you can see, the kickoff row in the csv file was taken as the header, and the first three lines are straightforward.

But how tin can we have commas (,) and double quotes (") in the tertiary and 4th row?

Shouldn't they exist special characters?

Take a closer look at the original file of test1.csv, and yous will find the tricks:

to have the delimiter comma (,) in the data, we need to put quotes around them.
to accept the quote (") in the data, we need to type two double quotes ("") to aid read_csv sympathize that we want to read it literally.

Gear up error_bad_lines of more commas

The most common error is when in that location are extra delimiter characters in a row.

This usually happens when the comma in the data wasn't quoted, which often appears for variables of addresses or visitor names.

Let's encounter an case.

Beneath is the original data for test2.csv. We tin run across that amid the row for CityD, there's an extra comma (,) within the address (58 Fourth Street, Apt 500). Just the entry isn't quoted.

If we read by using the default settings, read_csv volition be confused by this extra delimiter and give an error.

As the error message says, Python expected 3 fields in line 5, but saw iv. This is due to the actress comma.

There are two master methods to set this error:

the best style is to right the error inside the original csv file.
when non possible, we can likewise skip the bad lines by changing the error_bad_lines parameter setting to exist False.

This will load the information into Python while skipping the bad lines, but with warnings.

b'Skipping line 5: expected 3 fields, saw 4\n'

We tin also suppress this warning by setting warn_bad_lines=False, only we'd similar to see the warnings virtually of the fourth dimension.

Equally you tin can see, the line with CityD was skipped here.

Side by side, let's run across some other common issue with csv files.

Specify Data Types: Numeric or Cord

As y'all know, the csv files are plain-text files.

While it is important to specify the data types such every bit numeric or cord in Python. Nosotros need to rely on pandas read_csv to determine the information types.

By default, if everything in a column is number, read_csv volition observe that it is a numerical cavalcade; if there are whatever not-numbers in the column, read_csv will set the column to exist an object type.

It is more than credible with an example.

The test3.csv has 3 columns as below. The address column has both numbers and strings.

If nosotros apply the default settings of read_csv to load its data.

read_csv specifies the data types of:

id as int64 (integer) since it contains simply numbers.
address equally object, since it contains some text, fifty-fifty though virtually of the lines are numbers.
city as object since it's all text.

This is great since these are the data types nosotros want these columns to take.

The trouble of Mixed Data Types

Just when the file has many rows, we might go a mixed data type column. In this state of affairs, in that location will be problems.

To testify this, let'due south create a large csv file with two columns and 1,000,010 rows:

col1 has the alphabetic character 'a' in every row.
col2 has 1,000,000 rows with the number 123 and the terminal x rows with the string 'How-do-you-do'.

If nosotros use read_csv with the default settings, pandas volition become confused and gives a alarm saying it is mixed data types.

/Users/justin/py_envs/DL/lib/python3.7/site-packages/IPython/cadre/interactiveshell.py:3063: DtypeWarning: Columns (one) have mixed types.Specify dtype option on import or prepare low_memory=False.   interactivity=interactivity, compiler=compiler, outcome=outcome)

Then even though Python loaded the information, it has a strange data blazon. Python considers the starting time 1,000,000 rows with 123 as integer/numeric, and the last ten rows with 'Hello' as strings.

We can perform numeric operations on the first xx rows of 123.

Just we can't perform any numerical or string operations on the entire column.

The outset "+100" operation returns an error "TypeError: can merely concatenate str (not "int") to str". While the second str.len() function returns NaN for all the 123 rows.

And we tin can take out two random rows to check their data types. We can see that the row 0 with 123 is an integer, while the row 1,000,009 with 'Hi' is a string.

This column has mixed data types. This is Not what we desire!

How practise we fix them?

There are 3 main ways. Let'south expect at them one-by-one.

Method #1: set low_memory = Imitation

By default, low_memory is gear up to True. That ways when the file is larger, read_csv loads the file in chunks.

If an entire chunk has all numeric values, read_csv will save it as numeric.
If it encounters some non-numeric values in a chunk, and then it will save the values in that chunk as non-numeric (object).
This is why we had the mixed data type in our big file example.

When we set up low_memory = Faux, read_csv reads in all rows together and decides how to shop the values based on all the rows. If in that location are any non-numeric values, then it will store as an object.

Let'southward attempt it out.

After it's loaded, we can test the information types by taking out one row with 123 and one row with 'Hello' from the DataFrame df again.

It will tell us that Python is now treating them both as str (strings).

And we can perform string operations on the unabridged column. Try information technology out!

Simply using low_memory = False is not memory efficient, which leads to better method #2.

Method #2: set dtype (data types)

It is more than efficient to tell Python the data types (dtype) when loading the data, especially when the dataset is larger.

Let'south try setting the column to exist the string information type as beneath.

We'll see that the string performance str.len() returns the outcome with no issue.

What if we know that this column should exist numeric, and those text entries are typos?

Annotation that we cannot utilize a numeric type in this situation, because at that place are texts 'Hello' in the column.

Merely we can utilize method #3 to convert the values in the column.

Method #3: set converters (functions)

We can define a converter office to catechumen the values in specific columns.

In the example beneath, we kept the values that are numeric and changed the text values to all NaNs for col2.

So we use this converter_func as a parameter in read_csv.

Numeric and strings are the most common data types, and we at present know how to deal with them!

What almost another common information type: engagement and time?

Specify Data Types: Datetime

When we have date/time columns, we cannot use the dtype parameter to specify the information blazon.

Nosotros'll need to utilise the parse_dates, date_parser, dayfirst, keep_date_colparameters within the read_csv function.

We won't get over information technology here, but please take a wait at the official document for more details.

And if you do use date_parser, it is good to go familiar with the Python DateTime formats.

Farther Reading: How to Manipulate Date And Time in Python Like a Boss

So far nosotros've been loading the entire file, what if we only want a subset?

Use certain Columns (usecols)

When we only want to analyze sure columns from the file, it saves memory to merely read in those columns.

We can apply the usecols parameter.

For instance, just the col2 is loaded with the specification below.

Since we are looking at columns, let'southward besides run into how to name them amend.

Set up Column Names (names/prefix/no header)

When we don't have a header row in the csv file, we can specify columns names.

For case, test5.csv is created without a header row.

pandas read_csv column names prefix no header test dataset

If we utilise the default setting to read_csv, Python volition take the first row as the column names.

pandas read_csv default use first row as header

To fix this, we tin can use the names parameter to set the cavalcade names as my_col1, mycol2, mycol3.

What if we have many columns, and we want to assign them names with the same format/prefix to them?

We can set header = None, and the prefix parameter. read_csv volition assign column names with the prefix.

Later on looking at columns, let's meet how we can deal with rows with more than flexibility.

Specify Rows/Random Sampling (nrows/skiprows)

When we take a big dataset, it might non all fit into memory. Rather than importing the entire dataset, nosotros can accept a subset/sample from it.

Method #1: use nrows

We tin use the nrows parameter to set the pinnacle number of rows to read from the file.

For example, nosotros tin set nrows = 10 to read the first 10 rows.

Method #2: use skiprows

Using the skiprows parameter gives united states more flexibility for the rows to skip.

But nosotros practise need to provide more data. We have to input:

an integer showing the number of rows to skip at the start of the file.
or a list of row numbers (starting at 0 index position) to skip.
or a callable part that returns either True (skip) or False (otherwise) for each row number.

Allow'due south outset with a elementary example.

Nosotros can set skiprows=100,000 to skip the first 100,000 rows.

If we check the shape of the DataFrame, it will return (900010, 2), which is 100,000 rows less than the original file.

When we set up skiprows as a list of numbers, read_csv will skip the row numbers in the listing.

For case, to achieve random sampling, we tin can start create a random list skip_list of size 200,000 from a range(1,000,010). Then we tin can employ skiprows to skip the 200,000 random rows specified by the skip_list.

In this case, nosotros know the total number of rows in the file is 1,000,010, and then we tin draw random row numbers from range(1,000,010).

What if we don't know that?

We tin can create a function that randomly samples some of the rows.

For example, the lawmaking below uses a lambda office to sample roughly 10% of the rows randomly. The lambda function goes through each row index, and at that place'south a 10% chance that a particular row is included in the new dataset.
Annotation that we also skipped the first row (x == 0) containing the header since we are using names to specify the column names.

pandas read_csv in chunks (chunksize) with summary statistics

When we have a really large dataset, some other good practice is to use chunksize.

As mentioned earlier as well, pandas read_csv reads files in chunks by default. But it keeps all chunks in memory.

While with the chunksize setting, Python reads in chunks without keeping them in memory until it'south called. This is more efficient and makes it easier to spot any errors when loading the data.

Let's see how it works.

For case, nosotros use the chunksize setting to create a TextFileReader object reader beneath.
Note that reader is not a pandas DataFrame anymore. It is a pandas TextFileReader. The data is not in retentiveness until we call it.

We can use the get_chunk method to fetch chunks from the file.

You may try out the below code to see what information technology returns.

A more than popular way of using chunk is to loop through it and use aggregating functions of pandas groupby to get summary statistics.

For case, we can iterate through reader to process the file by chunks, grouping by col2, and counting the number of values within each group/chunk.

Related article: How to GroupBy with Python Pandas Similar a Dominate
If you are not familiar with pandas GroupBy, accept a look at this complete tutorial with examples.

Based on the output higher up, nosotros can do another GroupBy to return the total count for each group in col2.

pandas read_csv chunksize groupby pandas

Load zip File (compression)

read_csv also supports reading compressed files. This is very useful since we oftentimes store the csv file compressed to save storage space.

For example, we saved a zip version of test4.csv within the folder. And nosotros can use the compression parameter setting to read it direct.

Set Missing Values Strings (na_values)

By default the following values are interpreted as NaN/missing value: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-i.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'Zip', 'NaN', 'n/a', 'nan', 'null'.

Nosotros tin can as well specify extra strings within the file to be recognized equally NA/NaN by using the na_values parameter.

pandas read_csv Cheat Sheet

The code below summarized everything covered in this tutorial. You lot've learned a lot of pandas read_csv parameters!

That's it for this pandas tutorial for pandas read_csv. You've mastered this first step of your data science projects!

Leave a comment for any questions you may accept or anything else.

Related "Interruption into Data Scientific discipline" resources:

Python crash form: Interruption into Data Scientific discipline – FREE

Learn Python for data scientific discipline: FREE online course – Just into Data

A FREE Python online form, beginner-friendly tutorial. Commencement your successful data science career journeying: learn Python for data science, machine learning.

How to GroupBy with Python Pandas Like a Boss

Read this pandas tutorial to learn Group past in pandas. It is an essential operation on datasets (DataFrame) when doing data manipulation or assay.

How to Learn Data Science Online: ALL You Need to Know

Bank check out this for a detailed review of resources online, includingcourses,books,free tutorials,portfolios building, and more.

gaskblace1968.blogspot.com

Source: https://www.justintodata.com/pandas-read_csv-python-pandas-tutorial/

Gask Blace1968