Pandas
Commands in Pandas which I have used or would like to use. This is not meant to be complete, but cover some common usage situations or some tricky or obscure syntax.
See also:
- Pandas Cheat Sheet PDF on Pandas main site.
- Pands on Learn to Code.
Imports
import numpy as np
import pandas as pd
Types
pd.Series
pd.GroupBy
pd.DataFrame
Create dataframe
Column-input
With data
as a dict
where key is column name and value is a list of values.
data = {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}
df = pd.DataFrame(
data,
indexes=[1, 2, 3],
)
Row input
With data
as a 2-D list - a list of rows where each row is a list. Columns must be named.
data = [[4, 7, 10],[5, 8, 11],[6, 9, 12]]
df = pd.DataFrame(
data,
indexes=[1, 2, 3],
columns=['a', 'b', 'c'],
)
Select and filter
Get column.
df['foo']
df.foo
Get columns.
df[['foo', 'bar']]
Get columns of name matching pattern.
df.filter(regex='foo')
Select range of named columns, inclusive.
df.loc[:,'foo':'bar']
Select individual columns by index.
df.iloc[:,[0, 2 ,5]]
Also filter by condition.
df.loc[df['a'] > 10, ['a','c']]
Summarize
Row count.
len(df)
Get pairs of values and frequency, sorted by frequency descending.
df['foo'].value_counts()
e.g.
df = pd.Series([2, 2, 5, 3, 4, np.nan])
df.value_counts()
# 2.0 2
# 4.0 1
# 3.0 1
# 5.0 1
Distinct values. Does not sort.
df['foo'].nunique()
e.g.
> pd.Series([2, 4, 3, 3], name='P').unique()
# => array([2, 4, 3], dtype=int64)
Basic descriptive statistics for each column
df.describe()
Functions
Calculations on a column or possibly dataframe.
.count()
.sum()
.min()
.max()
.mean()
.median()
.var()
.std()
.quantile([0.25,0.75]
.apply(function)
Group by
Get GroupBy
object.
Column.
df.groupby(by="foo")
Index.
df.groupby(level="foo")
Size of group.
.size()
Plot
Histogram for each column.
df.plot.hist()
Scatter chart using pairs of points.
df.plot.scatter(x='foo', y='bar')