pandas: Get unique values and their counts in a column | note.nkmk.me (2024)

This article explains how to get unique values and their counts in a column (= Series) of a DataFrame in pandas.

Use the unique(), value_counts(), and nunique() methods on Series. nunique() is also available as a method on DataFrame.

  • pandas.Series.unique() returns unique values as a NumPy array (ndarray)
  • pandas.Series.value_counts() returns unique values and their counts as a Series
  • pandas.Series.nunique() and pandas.DataFrame.nunique() return the number of unique values as either an int or a Series

This article begins by explaining the basic usage of each method, then shows how to get unique values and their counts, and more.

Contents

  • pandas.Series.unique()
  • pandas.Series.value_counts()
  • pandas.Series.nunique(), pandas.DataFrame.nunique()
  • Get the number of unique values
  • Get the list of unique values
  • Get the counts of each unique value
  • Get the dictionary of unique values and their counts
  • Get the mode (most frequent value) and its frequency
    • value_counts()
    • mode()
    • describe()
  • Get the normalized frequencies

To count values that meet certain conditions, refer to the following article.

  • pandas: Count DataFrame/Series elements matching conditions

The describe() method is useful to compute summary statistics including the mode and its frequency.

  • pandas: Get summary statistics for each column with describe()

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used for the examples. Missing values (NaN) are inserted for explanation purposes.

import pandas as pdprint(pd.__version__)# 2.1.4df = pd.read_csv('data/src/sample_pandas_normal.csv')df.iloc[1] = float('nan')print(df)# name age state point# 0 Alice 24.0 NY 64.0# 1 NaN NaN NaN NaN# 2 Charlie 18.0 CA 70.0# 3 Dave 68.0 TX 70.0# 4 Ellen 24.0 CA 88.0# 5 Frank 30.0 NY 57.0

pandas.Series.unique()

unique() returns unique values as a one-dimensional NumPy array (ndarray). Missing values (NaN) are included. The values are arranged in the order of appearance.

print(df['state'].unique())# ['NY' nan 'CA' 'TX']print(type(df['state'].unique()))# <class 'numpy.ndarray'>

pandas.Series.value_counts()

value_counts() returns a Series where the unique values are the index (labels) and their counts are the values.

print(df['state'].value_counts())# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(type(df['state'].value_counts()))# <class 'pandas.core.series.Series'>

By default, missing values (NaN) are excluded, but if the dropna argument is set to False, they are also counted.

print(df['state'].value_counts(dropna=False))# state# NY 2# CA 2# NaN 1# TX 1# Name: count, dtype: int64

By default, the values are sorted in descending order of frequency. If the ascending argument is set to True, they are sorted in ascending order. Alternatively, setting sort to False leaves them unsorted, arranged in their original order of appearance.

print(df['state'].value_counts(dropna=False, ascending=True))# state# NaN 1# TX 1# NY 2# CA 2# Name: count, dtype: int64print(df['state'].value_counts(dropna=False, sort=False))# state# NY 2# NaN 1# CA 2# TX 1# Name: count, dtype: int64

If the normalize argument is set to True, the values are normalized so that their total is 1. Be careful that the value changes depending on the setting of the dropna argument if NaN is included.

print(df['state'].value_counts(normalize=True))# state# NY 0.4# CA 0.4# TX 0.2# Name: proportion, dtype: float64print(df['state'].value_counts(dropna=False, normalize=True))# state# NY 0.333333# CA 0.333333# NaN 0.166667# TX 0.166667# Name: proportion, dtype: float64

pandas.Series.nunique(), pandas.DataFrame.nunique()

nunique() on Series returns the number of unique values as an integer (int).

By default, missing values (NaN) are excluded; however, setting the dropna argument to False includes them in the count.

print(df['state'].nunique())# 3print(type(df['state'].nunique()))# <class 'int'>print(df['state'].nunique(dropna=False))# 4

nunique() on DataFrame returns the number of unique values for each column as a Series.

Similar to Series, the nunique() method on DataFrame also has the dropna argument. Additionally, while the default counting is column-wise, changing the axis argument to 1 or 'columns' switches the count to row-wise.

print(df.nunique())# name 5# age 4# state 3# point 4# dtype: int64print(type(df.nunique()))# <class 'pandas.core.series.Series'>print(df.nunique(dropna=False))# name 6# age 5# state 4# point 5# dtype: int64print(df.nunique(dropna=False, axis='columns'))# 0 4# 1 1# 2 4# 3 4# 4 4# 5 4# dtype: int64

Get the number of unique values

The number of unique values can be counted using nunique() on Series and DataFrame.

print(df['state'].nunique())# 3print(df.nunique())# name 5# age 4# state 3# point 4# dtype: int64

Get the list of unique values

unique() returns unique values as a NumPy array (ndarray). ndarray can be converted to a Python built-in list (list) using the tolist() method.

  • Convert numpy.ndarray and list to each other
print(df['state'].unique().tolist())# ['NY', nan, 'CA', 'TX']print(type(df['state'].unique().tolist()))# <class 'list'>

You can call tolist() on the index attribute of the Series returned by value_counts(), or use the values attribute to obtain the data as a NumPy array (ndarray).

print(df['state'].value_counts().index.tolist())# ['NY', 'CA', 'TX']print(type(df['state'].value_counts().index.tolist()))# <class 'list'>print(df['state'].value_counts().index.values)# ['NY' 'CA' 'TX']print(type(df['state'].value_counts().index.values))# <class 'numpy.ndarray'>

unique() always counts NaN as a unique value, but value_counts() allows you to specify whether to count NaN with the dropna argument.

print(df['state'].value_counts(dropna=False).index.tolist())# ['NY', 'CA', nan, 'TX']

Get the counts of each unique value

To get the counts (frequency, number of occurrences) of each unique value, access the values of the Series returned by value_counts().

vc = df['state'].value_counts()print(vc)# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(vc['NY'])# 2print(vc['TX'])# 1

To extract the unique value and its count in a for loop, use the items() method.

for index, value in df['state'].value_counts().items(): print(index, value)# NY 2# CA 2# TX 1

It was named iteritems(), but it has been changed to items(). iteritems() was removed in pandas version 2.0.

Get the dictionary of unique values and their counts

You can call the to_dict() method on the Series returned by value_counts() to convert it into a dictionary (dict).

d = df['state'].value_counts().to_dict()print(d)# {'NY': 2, 'CA': 2, 'TX': 1}print(type(d))# <class 'dict'>print(d['NY'])# 2print(d['TX'])# 1

To extract the unique value and its count in a for loop, use the items() method.

  • Iterate through dictionary keys and values in Python
for key, value in d.items(): print(key, value)# NY 2# CA 2# TX 1

Get the mode (most frequent value) and its frequency

value_counts()

By default, value_counts() returns a Series sorted in order of frequency, so the first element represents the mode (most frequent value) and its frequency.

print(df['state'].value_counts())# state# NY 2# CA 2# TX 1# Name: count, dtype: int64print(df['state'].value_counts().index[0])# NYprint(df['state'].value_counts().iat[0])# 2

The original Series values are used as the index of the resulting Series. If this index is numeric, accessing it directly with [number] can lead to errors. Instead, use iat[number] for accurate access.

  • pandas: Select rows/columns by index (numbers and names)
# print(df['age'].value_counts()[0])# KeyError: 0print(df['age'].value_counts().iat[0])# 2

You can apply it to each column of a DataFrame using the apply() method.

  • pandas: Apply functions to values, rows, columns with map(), apply()
  • Lambda expressions in Python
print(df.apply(lambda x: x.value_counts().index[0]))# name Alice# age 24.0# state NY# point 70.0# dtype: objectprint(df.apply(lambda x: x.value_counts().iat[0]))# name 1# age 2# state 2# point 2# dtype: int64

As mentioned above, by default, missing values (NaN) are excluded. If the dropna argument is set to False, they are also counted.

Be aware that if there are multiple modes, you can get only one mode using this method.

mode()

The mode() method on Series returns the modes as a Series. Converting this Series to a list with tolist() allows you to obtain the modes as a list. Even if there is only one mode, it will be a list.

print(df['state'].mode())# 0 CA# 1 NY# Name: state, dtype: objectprint(df['state'].mode().tolist())# ['CA', 'NY']print(df['age'].mode().tolist())# [24.0]

Applying it with apply() to each column results in a Series with lists of modes as values.

s_list = df.apply(lambda x: x.mode().tolist())print(s_list)# name [Alice, Charlie, Dave, Ellen, Frank]# age [24.0]# state [CA, NY]# point [70.0]# dtype: objectprint(type(s_list))# <class 'pandas.core.series.Series'>print(s_list['name'])# ['Alice', 'Charlie', 'Dave', 'Ellen', 'Frank']print(type(s_list['name']))# <class 'list'>

mode() is also available as a method of DataFrame. It returns a DataFrame. If the number of modes differs for each column, the empty parts are filled with missing values (NaN).

print(df.mode())# name age state point# 0 Alice 24.0 CA 70.0# 1 Charlie NaN NY NaN# 2 Dave NaN NaN NaN# 3 Ellen NaN NaN NaN# 4 Frank NaN NaN NaN

By default, missing values (NaN) are excluded. If the dropna argument is set to False, they are also counted. For more details about mode(), refer to the following article.

  • pandas: Get the mode (the most frequent value) with mode()

describe()

The describe() method can calculate the number of unique values, the mode, and its frequency for each column together. top represents the mode, and freq represents its frequency. Each item can be obtained with loc[].

print(df.astype('object').describe())# name age state point# count 5 5.0 5 5.0# unique 5 4.0 3 4.0# top Alice 24.0 NY 70.0# freq 1 2.0 2 2.0print(df.astype('object').describe().loc['top'])# name Alice# age 24.0# state NY# point 70.0# Name: top, dtype: object

In describe(), the listed items depend on the data type (dtype) of the column, so astype() is used for type conversion.

describe() excludes missing values (NaN), and unlike other methods, it does not have a dropna argument. Note that even if there are several modes, this method returns only one.

Get the normalized frequencies

When the normalize argument of value_counts() is set to True, the returned values are normalized so that their total is 1. Be aware that the values may differ based on the dropna setting if missing values NaN are included.

print(df['state'].value_counts(normalize=True))# state# NY 0.4# CA 0.4# TX 0.2# Name: proportion, dtype: float64print(df['state'].value_counts(dropna=False, normalize=True))# state# NY 0.333333# CA 0.333333# NaN 0.166667# TX 0.166667# Name: proportion, dtype: float64
pandas: Get unique values and their counts in a column | note.nkmk.me (2024)
Top Articles
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 6434

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.