pandas基础--缺失数据处理
一下代码的前提:import pandas as p
缺失数据是数据分析中的常见现象。pandas使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测出来的标记而已。python内置的None值也会被当作NA处理。
>>> string_data = pd.Series([‘aardvark‘, ‘artichoke‘, np.nan, ‘avocado‘]) >>> string_data 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object >>> string_data.isnull() 0 False 1 False 2 True 3 False dtype: bool >>> string_data[0] = None >>> string_data.isnull() 0 True 1 False 2 True 3 False dtype: bool >>>
NA处理方法。
方式 | 说明 |
dropna | 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度 |
fillna | 用指定值或插值方法(如ffill或bfill)填充缺失数据 |
isnull | 返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值NA,该对象的类型和源类型一样 |
notnull | isnull的否定式 |
1.1 滤除缺失数据
过滤掉缺失数据的方法有多种,可通过dropna实现。
>>> from numpy import nan as NA >>> data = pd.Series([1, NA, 3.5, NA, 7]) >>> data.dropna() 0 1.0 2 3.5 4 7.0 dtype: float64 >>> data[data.isnull()] 1 NaN 3 NaN dtype: float64 >>> data[data.notnull()] 0 1.0 2 3.5 4 7.0 dtype: float64
对于DataFrame,可能希望丢弃全NA或含有NA的行或列。
>>> data = pd.DataFrame([[1, 1.6, 3], [1, NA, NA], [NA, NA, NA], [NA, 6.5, 3]]) >>> data 0 1 2 0 1.0 1.6 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> >>> cleaned = data.dropna() #默认丢弃任何含有缺失值的行 >>> cleaned 0 1 2 0 1.0 1.6 3.0 >>> data.dropna(how=‘all‘) #只丢弃全为NA的行 0 1 2 0 1.0 1.6 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 >>> data[4] = NA >>> data 0 1 2 4 0 1.0 1.6 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN >>> data.dropna(axis=1, how=‘all‘) #丢弃列 0 1 2 0 1.0 1.6 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>>
另一个滤除DataFrame行的问题涉及到时间序列数据。如果只想留下一部分观测数据,可以用thresh参数实现。
>>> df = pd.DataFrame(np.random.randn(7, 3)) >>> df 1 2 0.752301 1.360969 -0.474561 0.466749 0.563536 1.978575 0.223606 0.414722 0.094315 -1.687511 -0.116227 0.442363 0.705580 -0.131169 -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>> df.loc[:4, 1] = NA >>> df.loc[:2, 2] = NA >>> df 1 2 0.752301 NaN NaN 0.466749 NaN NaN 0.223606 NaN NaN -1.687511 NaN 0.442363 0.705580 NaN -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>> df.dropna(thresh=3) 1 2 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>>
1.2 填充缺失数据
fillna方法可实现将缺失值替换为一个常数值。
>>> df 1 2 0.752301 NaN NaN 0.466749 NaN NaN 0.223606 NaN NaN -1.687511 NaN 0.442363 0.705580 NaN -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>> df.fillna(0) 1 2 0.752301 0.000000 0.000000 0.466749 0.000000 0.000000 0.223606 0.000000 0.000000 -1.687511 0.000000 0.442363 0.705580 0.000000 -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>> df.fillna({1:0.5, 3:-1}) #第1列的NA替换为0.5,第3列的NA替换为-1 1 2 0.752301 0.500000 NaN 0.466749 0.500000 NaN 0.223606 0.500000 NaN -1.687511 0.500000 0.442363 0.705580 0.500000 -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>>
fillna默认会返回新对象。但也可以对现有对象进行就地修改。
>>> _ = df.fillna(0, inplace=True) >>> df 1 2 0.752301 0.000000 0.000000 0.466749 0.000000 0.000000 0.223606 0.000000 0.000000 -1.687511 0.000000 0.442363 0.705580 0.000000 -0.868425 -0.158964 -0.164512 -0.937150 -0.281537 -1.579942 -0.562886 >>>
对reindex有效的插值方法也可以用于fillna。
>>> df = pd.DataFrame(np.random.randn(6, 3)) >>> df.loc[2:, 1] = NA >>> df.loc[4:, 2] = NA >>> df 1 2 -1.433489 0.162951 -0.664600 0.033722 -0.478252 0.480072 -0.000977 NaN -1.555649 -0.947501 NaN 0.089918 1.360481 NaN NaN -0.966030 NaN NaN >>> df.fillna(method=‘ffill‘) 1 2 -1.433489 0.162951 -0.664600 0.033722 -0.478252 0.480072 -0.000977 -0.478252 -1.555649 -0.947501 -0.478252 0.089918 1.360481 -0.478252 0.089918 -0.966030 -0.478252 0.089918 >>> df.fillna(method=‘ffill‘, limit=2) 1 2 -1.433489 0.162951 -0.664600 0.033722 -0.478252 0.480072 -0.000977 -0.478252 -1.555649 -0.947501 -0.478252 0.089918 1.360481 NaN 0.089918 -0.966030 NaN 0.089918
下表是fillna的参数参考。
参数 | 说明 |
value | 用于填充缺失值的标量值或字典对象 |
method | 插值方式,如果函数调用时未指定其他参数的话,默认为“ffill” |
axis | 带填充的轴,默认为axis=0 |
limit | (对于向前或先后填充)可以连续填充的最大数量 |
相关推荐
三石 2020-10-30
三石 2020-10-29
roamer 2020-10-29
wangquannuaa 2020-10-15
wangquannuaa 2020-09-29
jzlixiao 2020-09-15
wangquannuaa 2020-08-30
三石 2020-08-23
逍遥友 2020-08-21
jzlixiao 2020-08-18
wangquannuaa 2020-08-17
QianYanDai 2020-08-16
cjsyrwt 2020-08-14
jzlixiao 2020-07-29
xirongxudlut 2020-07-20
mmmjyjy 2020-07-16
QianYanDai 2020-07-05
QianYanDai 2020-07-05
june0 2020-07-04