How to Read in Data From .dat Pandas
In [1]: import pandas as pd
-
Titanic data
This tutorial uses the Titanic information set, stored as CSV. The information consists of the post-obit information columns:
-
PassengerId: Id of every rider.
-
Survived: This characteristic have value 0 and 1. 0 for non survived and 1 for survived.
-
Pclass: There are 3 classes: Grade i, Grade two and Class 3.
-
Name: Name of passenger.
-
Sex: Gender of passenger.
-
Age: Age of rider.
-
SibSp: Indication that passenger take siblings and spouse.
-
Parch: Whether a rider is lonely or have family.
-
Ticket: Ticket number of passenger.
-
Fare: Indicating the fare.
-
Cabin: The cabin of passenger.
-
Embarked: The embarked category.
To raw information
-
How do I read and write tabular data?¶
-
I want to analyze the Titanic rider data, available every bit a CSV file.
In [2]: titanic = pd . read_csv ( "data/titanic.csv" )
pandas provides the
read_csv()
function to read data stored as a csv file into a pandasDataFrame
. pandas supports many unlike file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefixread_*
.
Brand sure to always take a check on the data after reading in the data. When displaying a DataFrame
, the first and last v rows volition be shown by default:
In [three]: titanic Out[3]: PassengerId Survived Pclass Proper noun ... Ticket Fare Motel Embarked 0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 seven.2500 NaN S i 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C 2 3 ane 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Pare) ... 113803 53.1000 C123 Due south 4 5 0 three Allen, Mr. William Henry ... 373450 8.0500 NaN S .. ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas ... 211536 13.0000 NaN S 887 888 1 one Graham, Miss. Margaret Edith ... 112053 30.0000 B42 S 888 889 0 iii Johnston, Miss. Catherine Helen "Carrie" ... W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell ... 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick ... 370376 7.7500 NaN Q [891 rows x 12 columns]
-
I want to run across the first 8 rows of a pandas DataFrame.
In [4]: titanic . head ( 8 ) Out[4]: PassengerId Survived Pclass Proper name ... Ticket Fare Cabin Embarked 0 1 0 three Braund, Mr. Owen Harris ... A/5 21171 seven.2500 NaN South 1 2 1 one Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C ii iii 1 three Heikkinen, Miss. Laina ... STON/O2. 3101282 vii.9250 NaN S iii iv i one Futrelle, Mrs. Jacques Heath (Lily May Pare) ... 113803 53.1000 C123 S 4 5 0 three Allen, Mr. William Henry ... 373450 8.0500 NaN South 5 vi 0 iii Moran, Mr. James ... 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J ... 17463 51.8625 E46 S 7 8 0 3 Palsson, Master. Gosta Leonard ... 349909 21.0750 NaN South [8 rows x 12 columns]
To meet the first Due north rows of a
DataFrame
, use thehead()
method with the required number of rows (in this case 8) as argument.
Annotation
Interested in the last North rows instead? pandas also provides a tail()
method. For example, titanic.tail(10)
will return the concluding x rows of the DataFrame.
A check on how pandas interpreted each of the column data types tin can be washed past requesting the pandas dtypes
attribute:
In [5]: titanic . dtypes Out[5]: PassengerId int64 Survived int64 Pclass int64 Proper noun object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
For each of the columns, the used data type is enlisted. The data types in this DataFrame
are integers ( int64
), floats ( float64
) and strings ( object
).
Note
When asking for the dtypes
, no brackets are used! dtypes
is an aspect of a DataFrame
and Series
. Attributes of DataFrame
or Series
do not need brackets. Attributes represent a characteristic of a DataFrame
/ Series
, whereas a method (which requires brackets) do something with the DataFrame
/ Series
as introduced in the first tutorial.
-
My colleague requested the Titanic information as a spreadsheet.
In [6]: titanic . to_excel ( "titanic.xlsx" , sheet_name = "passengers" , index = False )
Whereas
read_*
functions are used to read data to pandas, theto_*
methods are used to store information. Theto_excel()
method stores the data equally an excel file. In the example here, thesheet_name
is named passengers instead of the default Sheet1. By settingindex=Imitation
the row index labels are non saved in the spreadsheet.
The equivalent read function read_excel()
will reload the data to a DataFrame
:
In [seven]: titanic = pd . read_excel ( "titanic.xlsx" , sheet_name = "passengers" )
In [8]: titanic . head () Out[viii]: PassengerId Survived Pclass Name ... Ticket Fare Motel Embarked 0 i 0 three Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN Due south one 2 i i Cumings, Mrs. John Bradley (Florence Briggs Thursday... ... PC 17599 71.2833 C85 C 2 3 1 iii Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN Southward 3 iv ane 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.thousand C123 South 4 5 0 iii Allen, Mr. William Henry ... 373450 8.0500 NaN S [5 rows x 12 columns]
-
I'thou interested in a technical summary of a
DataFrame
In [9]: titanic . info () <form 'pandas.cadre.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Nix Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-goose egg int64 one Survived 891 non-cipher int64 2 Pclass 891 not-naught int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Historic period 714 non-nix float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 eight Ticket 891 non-null object 9 Fare 891 non-nix float64 10 Motel 204 non-naught object 11 Embarked 889 non-null object dtypes: float64(two), int64(5), object(five) memory usage: 83.7+ KB
The method
info()
provides technical information about aDataFrame
, so let'southward explain the output in more detail:-
Information technology is indeed a
DataFrame
. -
There are 891 entries, i.east. 891 rows.
-
Each row has a row label (aka the
index
) with values ranging from 0 to 890. -
The tabular array has 12 columns. Almost columns accept a value for each of the rows (all 891 values are
non-null
). Some columns do accept missing values and less than 891non-null
values. -
The columns
Name
,Sex
,Cabin
andEmbarked
consists of textual data (strings, akaobject
). The other columns are numerical data with some of them whole numbers (akainteger
) and others are real numbers (akafloat
). -
The kind of data (characters, integers,…) in the different columns are summarized by listing the
dtypes
. -
The approximate corporeality of RAM used to hold the DataFrame is provided as well.
-
REMEMBER
-
Getting data in to pandas from many different file formats or data sources is supported by
read_*
functions. -
Exporting data out of pandas is provided by different
to_*
methods. -
The
caput
/tail
/info
methods and thedtypes
attribute are user-friendly for a first check.
To user guide
For a complete overview of the input and output possibilities from and to pandas, run across the user guide section about reader and writer functions.
Source: https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html
Post a Comment for "How to Read in Data From .dat Pandas"