This article explains how to get unique values and their counts in a column (= Series
) of a DataFrame
in pandas.
Use the unique()
, value_counts()
, and nunique()
methods on Series
. nunique()
is also available as a method on DataFrame
.
pandas.Series.unique()
returns unique values as a NumPy array (ndarray
)pandas.Series.value_counts()
returns unique values and their counts as aSeries
pandas.Series.nunique()
andpandas.DataFrame.nunique()
return the number of unique values as either anint
or aSeries
This article begins by explaining the basic usage of each method, then shows how to get unique values and their counts, and more.
Contents
- pandas.Series.unique()
- pandas.Series.value_counts()
- pandas.Series.nunique(), pandas.DataFrame.nunique()
- Get the number of unique values
- Get the list of unique values
- Get the counts of each unique value
- Get the dictionary of unique values and their counts
- Get the mode (most frequent value) and its frequency
- value_counts()
- mode()
- describe()
- Get the normalized frequencies
To count values that meet certain conditions, refer to the following article.
- pandas: Count DataFrame/Series elements matching conditions
The describe()
method is useful to compute summary statistics including the mode and its frequency.
- pandas: Get summary statistics for each column with describe()
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used for the examples. Missing values (NaN
) are inserted for explanation purposes.
import pandas as pdprint(pd.__version__)# 2.1.4df = pd.read_csv('data/src/sample_pandas_normal.csv')df.iloc[1] = float('nan')print(df)# name age state point# 0 Alice 24.0 NY 64.0# 1 NaN NaN NaN NaN# 2 Charlie 18.0 CA 70.0# 3 Dave 68.0 TX 70.0# 4 Ellen 24.0 CA 88.0# 5 Frank 30.0 NY 57.0
source: pandas_value_counts.py
pandas.Series.unique()
unique()
returns unique values as a one-dimensional NumPy array (ndarray
). Missing values (NaN
) are included. The values are arranged in the order of appearance.
print(df['state'].unique())# ['NY' nan 'CA' 'TX']print(type(df['state'].unique()))# <class 'numpy.ndarray'>
source: pandas_value_counts.py
pandas.Series.value_counts()
value_counts()
returns a Series
where the unique values are the index (labels) and their counts are the values.
print(df['state'].value_counts())# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(type(df['state'].value_counts()))# <class 'pandas.core.series.Series'>
source: pandas_value_counts.py
By default, missing values (NaN
) are excluded, but if the dropna
argument is set to False
, they are also counted.
print(df['state'].value_counts(dropna=False))# state# NY 2# CA 2# NaN 1# TX 1# Name: count, dtype: int64
source: pandas_value_counts.py
By default, the values are sorted in descending order of frequency. If the ascending
argument is set to True
, they are sorted in ascending order. Alternatively, setting sort
to False
leaves them unsorted, arranged in their original order of appearance.
print(df['state'].value_counts(dropna=False, ascending=True))# state# NaN 1# TX 1# NY 2# CA 2# Name: count, dtype: int64print(df['state'].value_counts(dropna=False, sort=False))# state# NY 2# NaN 1# CA 2# TX 1# Name: count, dtype: int64
If the normalize
argument is set to True
, the values are normalized so that their total is 1
. Be careful that the value changes depending on the setting of the dropna
argument if NaN
is included.
print(df['state'].value_counts(normalize=True))# state# NY 0.4# CA 0.4# TX 0.2# Name: proportion, dtype: float64print(df['state'].value_counts(dropna=False, normalize=True))# state# NY 0.333333# CA 0.333333# NaN 0.166667# TX 0.166667# Name: proportion, dtype: float64
source: pandas_value_counts.py
pandas.Series.nunique()
, pandas.DataFrame.nunique()
nunique()
on Series
returns the number of unique values as an integer (int
).
By default, missing values (NaN
) are excluded; however, setting the dropna
argument to False
includes them in the count.
print(df['state'].nunique())# 3print(type(df['state'].nunique()))# <class 'int'>print(df['state'].nunique(dropna=False))# 4
source: pandas_value_counts.py
nunique()
on DataFrame
returns the number of unique values for each column as a Series
.
Similar to Series
, the nunique()
method on DataFrame
also has the dropna
argument. Additionally, while the default counting is column-wise, changing the axis
argument to 1
or 'columns'
switches the count to row-wise.
print(df.nunique())# name 5# age 4# state 3# point 4# dtype: int64print(type(df.nunique()))# <class 'pandas.core.series.Series'>print(df.nunique(dropna=False))# name 6# age 5# state 4# point 5# dtype: int64print(df.nunique(dropna=False, axis='columns'))# 0 4# 1 1# 2 4# 3 4# 4 4# 5 4# dtype: int64
source: pandas_value_counts.py
Get the number of unique values
The number of unique values can be counted using nunique()
on Series
and DataFrame
.
print(df['state'].nunique())# 3print(df.nunique())# name 5# age 4# state 3# point 4# dtype: int64
source: pandas_value_counts.py
Get the list of unique values
unique()
returns unique values as a NumPy array (ndarray
). ndarray
can be converted to a Python built-in list (list
) using the tolist()
method.
- Convert numpy.ndarray and list to each other
print(df['state'].unique().tolist())# ['NY', nan, 'CA', 'TX']print(type(df['state'].unique().tolist()))# <class 'list'>
source: pandas_value_counts.py
You can call tolist()
on the index
attribute of the Series
returned by value_counts()
, or use the values
attribute to obtain the data as a NumPy array (ndarray
).
print(df['state'].value_counts().index.tolist())# ['NY', 'CA', 'TX']print(type(df['state'].value_counts().index.tolist()))# <class 'list'>print(df['state'].value_counts().index.values)# ['NY' 'CA' 'TX']print(type(df['state'].value_counts().index.values))# <class 'numpy.ndarray'>
source: pandas_value_counts.py
unique()
always counts NaN
as a unique value, but value_counts()
allows you to specify whether to count NaN
with the dropna
argument.
print(df['state'].value_counts(dropna=False).index.tolist())# ['NY', 'CA', nan, 'TX']
source: pandas_value_counts.py
Get the counts of each unique value
To get the counts (frequency, number of occurrences) of each unique value, access the values of the Series
returned by value_counts()
.
vc = df['state'].value_counts()print(vc)# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(vc['NY'])# 2print(vc['TX'])# 1
source: pandas_value_counts.py
To extract the unique value and its count in a for
loop, use the items()
method.
for index, value in df['state'].value_counts().items(): print(index, value)# NY 2# CA 2# TX 1
source: pandas_value_counts.py
It was named iteritems()
, but it has been changed to items()
. iteritems()
was removed in pandas version 2.0.
- What’s new in 2.0.0 (April 3, 2023) — pandas 2.1.4 documentation
- DEPR: Series/DataFrame/HDFStore.iteritems() by mroeschke · Pull Request #45321 · pandas-dev/pandas
Get the dictionary of unique values and their counts
You can call the to_dict()
method on the Series
returned by value_counts()
to convert it into a dictionary (dict
).
d = df['state'].value_counts().to_dict()print(d)# {'NY': 2, 'CA': 2, 'TX': 1}print(type(d))# <class 'dict'>print(d['NY'])# 2print(d['TX'])# 1
source: pandas_value_counts.py
To extract the unique value and its count in a for
loop, use the items()
method.
- Iterate through dictionary keys and values in Python
for key, value in d.items(): print(key, value)# NY 2# CA 2# TX 1
source: pandas_value_counts.py
Get the mode (most frequent value) and its frequency
value_counts()
By default, value_counts()
returns a Series
sorted in order of frequency, so the first element represents the mode (most frequent value) and its frequency.
print(df['state'].value_counts())# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(df['state'].value_counts().index[0])# NYprint(df['state'].value_counts().iat[0])# 2
source: pandas_value_counts.py
The original Series
values are used as the index
of the resulting Series
. If this index
is numeric, accessing it directly with [number]
can lead to errors. Instead, use iat[number]
for accurate access.
- pandas: Select rows/columns by index (numbers and names)
# print(df['age'].value_counts()[0])# KeyError: 0print(df['age'].value_counts().iat[0])# 2
source: pandas_value_counts.py
You can apply it to each column of a DataFrame
using the apply()
method.
- pandas: Apply functions to values, rows, columns with map(), apply()
- Lambda expressions in Python
print(df.apply(lambda x: x.value_counts().index[0]))# name Alice# age 24.0# state NY# point 70.0# dtype: objectprint(df.apply(lambda x: x.value_counts().iat[0]))# name 1# age 2# state 2# point 2# dtype: int64
source: pandas_value_counts.py
As mentioned above, by default, missing values (NaN
) are excluded. If the dropna
argument is set to False
, they are also counted.
Be aware that if there are multiple modes, you can get only one mode using this method.
mode()
The mode()
method on Series
returns the modes as a Series
. Converting this Series
to a list with tolist()
allows you to obtain the modes as a list. Even if there is only one mode, it will be a list.
print(df['state'].mode())# 0 CA# 1 NY# Name: state, dtype: objectprint(df['state'].mode().tolist())# ['CA', 'NY']print(df['age'].mode().tolist())# [24.0]
source: pandas_value_counts.py
Applying it with apply()
to each column results in a Series
with lists of modes as values.
s_list = df.apply(lambda x: x.mode().tolist())print(s_list)# name [Alice, Charlie, Dave, Ellen, Frank]# age [24.0]# state [CA, NY]# point [70.0]# dtype: objectprint(type(s_list))# <class 'pandas.core.series.Series'>print(s_list['name'])# ['Alice', 'Charlie', 'Dave', 'Ellen', 'Frank']print(type(s_list['name']))# <class 'list'>
source: pandas_value_counts.py
mode()
is also available as a method of DataFrame
. It returns a DataFrame
. If the number of modes differs for each column, the empty parts are filled with missing values (NaN
).
print(df.mode())# name age state point# 0 Alice 24.0 CA 70.0# 1 Charlie NaN NY NaN# 2 Dave NaN NaN NaN# 3 Ellen NaN NaN NaN# 4 Frank NaN NaN NaN
source: pandas_value_counts.py
By default, missing values (NaN
) are excluded. If the dropna
argument is set to False
, they are also counted. For more details about mode()
, refer to the following article.
- pandas: Get the mode (the most frequent value) with mode()
describe()
The describe()
method can calculate the number of unique values, the mode, and its frequency for each column together. top
represents the mode, and freq
represents its frequency. Each item can be obtained with loc[]
.
print(df.astype('object').describe())# name age state point# count 5 5.0 5 5.0# unique 5 4.0 3 4.0# top Alice 24.0 NY 70.0# freq 1 2.0 2 2.0print(df.astype('object').describe().loc['top'])# name Alice# age 24.0# state NY# point 70.0# Name: top, dtype: object
source: pandas_value_counts.py
In describe(),
the listed items depend on the data type (dtype
) of the column, so astype()
is used for type conversion.
describe()
excludes missing values (NaN
), and unlike other methods, it does not have a dropna
argument. Note that even if there are several modes, this method returns only one.
Get the normalized frequencies
When the normalize
argument of value_counts()
is set to True
, the returned values are normalized so that their total is 1
. Be aware that the values may differ based on the dropna
setting if missing values NaN
are included.
print(df['state'].value_counts(normalize=True))# state# NY 0.4# CA 0.4# TX 0.2# Name: proportion, dtype: float64print(df['state'].value_counts(dropna=False, normalize=True))# state# NY 0.333333# CA 0.333333# NaN 0.166667# TX 0.166667# Name: proportion, dtype: float64
source: pandas_value_counts.py