pandas基础--基本功能

QianYanDai

2020-02-03

本节介绍操作Series和DataFrame中的数据的基本手段。

1.1 重新索引

重新索引reindex，其作用是创建一个适应新索引的新对象。调用reindex将会根据新索引进行重排，如果某个索引值当前不存在，就引入缺失值。

>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=[‘d‘, ‘b‘, ‘a‘, ‘c‘])
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘]) 
>>> obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
>>> obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘], fill_value=0) 
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
>>>

对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理，method选项即可达到此目的。

>>> obj3 = pd.Series([‘blue‘, ‘purple‘, ‘yellow‘], index=[0, 2, 4]) 
>>> obj3.reindex(range(6), method=‘ffill‘)
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

下表是可用的method选项。

ffill或pad	前向填充（或搬运）值
bfill或backfill	后向填充（或搬运）值

对于DataFrame，reindex可以修改（行）索引、列，或两个都修改，如果仅传入一个序列，则会重新索引行。

>>> frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=[‘a‘, ‘c‘, ‘d‘], columns=[‘Ohio‘, 
‘Texas‘, ‘California‘])
>>> frame
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
>>> frame2 = frame.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘])  #重新索引行
>>> frame2
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0
>>> states = [‘Texas‘, ‘Utah‘, ‘California‘]
>>> frame.reindex(columns=states)  #使用columns关键字重新索引列
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8
>>> frame.reindex(index=[‘a‘, ‘b‘, ‘c‘, ‘d‘], method=‘ffill‘)                 
   Ohio  Texas  California
a     0      1           2
b     0      1           2
c     3      4           5
d     6      7           8

下表是reindex函数的各参数及说明。

参数	说明
index	用作索引的新序列，既可以是Index实例，也可以是其他序列型的python数据结构，Index会被完全使用，就像没有任何复制一样
method	插值（填充）方式，具体常见之前的表格
fill_value	在重新索引的过程中，需要引入缺失值时使用的替代值
limit	前向或后向填充时的最大填充量
level	在MultiIndex的指定级别上匹配简单索引，否则选取其子集
copy	默认为True，无论无何都复制，否则为False，则新旧相等就不复制

1.2 丢弃指定轴上的项

丢弃某条轴上的一个或多个项只要有一个索引数组或列表即可完成。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的新对象。

>>> obj = pd.Series(np.arange(5), index=[‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘])
>>> new_obj = obj.drop(‘c‘)
>>> new_obj
a    0
b    1
d    3
e    4
dtype: int32
>>> obj.drop([‘d‘, ‘c‘])
a    0
b    1
e    4
dtype: int32

对于DataFrame，可以删除任意轴上的索引值。

>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=[‘Oh‘, ‘Co‘, ‘Ut‘, ‘New‘], columns=[‘one‘, ‘two‘, ‘three‘, ‘four‘])
>>> data
     one  two  three  four
Oh     0    1      2     3
Co     4    5      6     7
Ut     8    9     10    11
New   12   13     14    15
>>> data.drop(‘two‘, axis=1)
     one  three  four
Oh     0      2     3
Co     4      6     7
Ut     8     10    11
New   12     14    15
>>> data.drop([‘two‘, ‘four‘], axis=1) 
     one  three
Oh     0      2
Co     4      6
Ut     8     10
New   12     14

1.3 索引、选取和过滤

Series索引（obj[…]）的工作方式类似于NumPy数组的索引，只不过Series的索引值不是整数。

>>> obj = pd.Series(np.arange(4), index=[‘a‘, ‘b‘, ‘c‘, ‘d‘])
>>> obj[‘b‘] 
1
>>> obj[1]   
1
>>> obj[2:4] 
c    2
d    3
dtype: int32
>>> obj[[‘b‘, ‘a‘, ‘d‘]] 
b    1
a    0
d    3
dtype: int32
>>> obj[[1, 3]]          
b    1
d    3
dtype: int32
>>> obj[obj < 2] 
a    0
b    1
dtype: int32
>>>

利用标签的切片运算和普通的python切片运算不同，其末端是包含的。

>>> obj
a    0
b    1
c    2
d    3
dtype: int32
>>> obj[‘b‘:‘d‘]
b    1
c    2
d    3
dtype: int32
>>> obj[‘b‘:‘d‘] = 5  #赋值操作
>>> obj
a    0
b    5
c    5
d    5
dtype: int32

对DataFrame进行索引就是获取一个或多个列。

>>> data
     one  two  three  four
Oh     0    1      2     3
Co     4    5      6     7
Ut     8    9     10    11
New   12   13     14    15
>>> data[‘two‘] 
Oh      1
Co      5
Ut      9
New    13
Name: two, dtype: int32
>>> data[[‘three‘, ‘one‘]] 
     three  one
Oh       2    0
Co       6    4
Ut      10    8
New     14   12
>>> data[:2]   # 通过切片选取行
    one  two  three  four
Oh    0    1      2     3
Co    4    5      6     7
>>> data[data[‘three‘] > 5]  #通过布尔型数组选取行
     one  two  three  four
Co     4    5      6     7
Ut     8    9     10    11
New   12   13     14    15
>>> data < 5
       one    two  three   four
Oh    True   True   True   True
Co    True  False  False  False
Ut   False  False  False  False
New  False  False  False  False
>>> data[data < 5] = 0  #通过布尔型数组选取行
>>> data
     one  two  three  four
Oh     0    0      0     0
Co     0    5      6     7
Ut     8    9     10    11
New   12   13     14    15
>>>

DataFrame的索引选项如下表所示：

类型	说明
obj[val]	选取DataFrame的单个列或一组列，在一些特殊情况下会比较方便：布尔型数组（过滤行）、切片（行切片）、布尔型DataFrame（根据条件设置值）
reindex方法	将一个或多个轴匹配到新索引
xs	根据标签选取单行或单列，返回一个Series

1.4 算术运算和数据对齐

pandas可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的所以该索引对的并集。自动的数据对齐操作在不重叠的索引处引入了NaN值，缺失值会在算术运算过程中传播。

>>> s1 = pd.Series([1, 2, 3, 4], index=[‘a‘, ‘b‘, ‘c‘, ‘d‘]) 
>>> s2 = pd.Series([5, 6, 7, 8], index = [‘a‘, ‘c‘, ‘e‘, ‘f‘])
>>> s1 + s2  #加法操作
a    6.0
b    NaN
c    9.0
d    NaN
e    NaN
f    NaN
dtype: float64
>>>

对于DataFrame，对齐操作会同时发生在行和列上。它们相加后会返回一个新的DataFrame，其索引和列为原来那两个DataFrame的并集。

>>> df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=list(‘bcd‘), index=[‘one‘, ‘two‘, 
‘three‘])
>>> df2 = pd.DataFrame(np.arange(4).reshape((2, 2)), columns=list(‘be‘), index=[‘two‘, ‘four‘])
>>> df1
       b  c  d
one    0  1  2
two    3  4  5
three  6  7  8
>>> df2
      b  e
two   0  1
four  2  3
>>> df1 + df2  #相加
         b   c   d   e
four   NaN NaN NaN NaN
one    NaN NaN NaN NaN
three  NaN NaN NaN NaN
two    3.0 NaN NaN NaN
>>>

1.5 在算术方法中填充值

对不同索引的对象进行算术运算时，当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值。

>>> df1
       b  c  d
one    0  1  2
two    3  4  5
three  6  7  8
>>> df2
      b  e
two   0  1
four  2  3
>>> df1.add(df2, fill_value=0)
         b    c    d    e
four   2.0  NaN  NaN  3.0
one    0.0  1.0  2.0  NaN
three  6.0  7.0  8.0  NaN
two    3.0  4.0  5.0  1.0
>>> df1.add(df2, fill_value=1) 
         b    c    d    e
four   3.0  NaN  NaN  4.0
one    1.0  2.0  3.0  NaN
three  7.0  8.0  9.0  NaN
two    3.0  5.0  6.0  2.0
>>> df1.reindex(columns=df2.columns, fill_value=0) 
       b  e
one    0  0
two    3  0
three  6  0

灵活的算术方法如下表所示：

方法	说明
add	用于加法（+）的方法
sub	用于减法（-）的方法
div	用于除法（/）的方法
mul	用于乘法（*）的方法

1.6 DataFrame和Series之间的运算

DATaFrame和Series之间的运算由明确的规定。例如计算一个二维数组与其某行之间的差。

>>> arr = np.arange(12).reshape((3, 4))
>>> arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> arr[0]
array([0, 1, 2, 3])
>>> arr - arr[0]  #会进行广播
array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

DataFrame和Series之间的运算会将Series的索引匹配到DataFrame的列，然后沿着行一直向下广播。

>>> frame = pd.DataFrame(np.arange(12).reshape((4, 3)), columns=list(‘bde‘), index=[‘Ut‘, ‘Oh‘, ‘Te‘, ‘Or‘])
>>> frame
    b   d   e
Ut  0   1   2
Oh  3   4   5
Te  6   7   8
Or  9  10  11
>>> series = pd.Series(np.arange(3), index=[‘b‘, ‘d‘, ‘e‘])
>>> series
b    0
d    1
e    2
dtype: int32
>>> frame - series 
    b  d  e
Ut  0  0  0
Oh  3  3  3
Te  6  6  6
Or  9  9  9
>>>

如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集。

>>> series2 = pd.Series(range(3), index=list(‘bef‘))
>>> series2
b    0
e    1
f    2
dtype: int64
>>> frame
    b   d   e
Ut  0   1   2
Oh  3   4   5
Te  6   7   8
Or  9  10  11
>>> frame + series2
      b   d     e   f
Ut  0.0 NaN   3.0 NaN
Oh  3.0 NaN   6.0 NaN
Te  6.0 NaN   9.0 NaN
Or  9.0 NaN  12.0 NaN
>>>

如果希望匹配行且在列上广播，则必须使用算术运算方法。

>>> series3 = frame[‘d‘]
>>> frame
    b   d   e
Ut  0   1   2
Oh  3   4   5
Te  6   7   8
Or  9  10  11
>>> series3
Ut     1
Oh     4
Te     7
Or    10
Name: d, dtype: int32
>>> frame.sub(series3, axis=0) 
    b  d  e
Ut -1  0  1
Oh -1  0  1
Te -1  0  1
Or -1  0  1

1.7 函数应用和映射

NumPy的ufuncs（元素级数组方法）也可用于操作pandas对象。

>>> frame
    b  d   e
Ut  0 -3   2
Oh  3 -3   5
Te  6 -3   8
Or  9 -3  11
>>> np.abs(frame) 
    b  d   e
Ut  0  3   2
Oh  3  3   5
Te  6  3   8
Or  9  3  11

另一个常见操作，将函数应用到各列或行所形成的一维数组上。DataFrame的apply方法可实现此功能。

>>> f = lambda x: x.max() - x.min() 
>>> frame
    b  d   e
Ut  0 -3   2
Oh  3 -3   5
Te  6 -3   8
Or  9 -3  11
>>> frame.apply(f)
b    9
d    0
e    9
dtype: int64
>>> frame.apply(f, axis=1) 
Ut     5
Oh     8
Te    11
Or    14
dtype: int64
>>>

除标量值外，传递给apply的函数还可以返回由多个值组成的Series。

>>> def f(x):
...     return pd.Series([x.min(), x.max()], index=[‘min‘, ‘max‘])
... 
>>> frame
    b  d   e
Ut  0 -3   2
Oh  3 -3   5
Te  6 -3   8
Or  9 -3  11
>>> frame.apply(f)
     b  d   e
min  0 -3   2
max  9 -3  11
>>>

元素级的python函数也是可以使用的。例如求frame中各个浮点值的格式化字符串，使用applymap即可。

>>> format = lambda x: ‘%.2f‘ % x
>>> frame.applymap(format) 
       b      d      e
Ut  0.00  -3.00   2.00
Oh  3.00  -3.00   5.00
Te  6.00  -3.00   8.00
Or  9.00  -3.00  11.00

1.8 排序和排名

（1）排序

根据条件对数据集排序（sorting）也是一种重要的内置运算。要对行或列索引进行排序（按字典顺序），可使用sort_index方法，它返回的是一个已排序的新对象。

>>> obj = pd.Series(range(4), index=[‘d‘, ‘e‘, ‘b‘, ‘c‘])
>>> obj.sort_index()
b    2
c    3
d    0
e    1
dtype: int64

对于DataFrame，可以根据任意一个轴上的索引进行排序。

>>> frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=[‘three‘, ‘one‘], columns=[‘d‘, ‘e‘, ‘b‘, ‘c‘])
>>> frame.sort_index()
       d  e  b  c
one    4  5  6  7
three  0  1  2  3
>>> frame.sort_index(axis=1)  #对轴1进行排序
       b  c  d  e
three  2  3  0  1
one    6  7  4  5
>>> frame.sort_index(axis=1, ascending=False)  #默认为升序，改为降序
       e  d  c  b
three  1  0  3  2
one    5  4  7  6
>>>

（2）排名

排名跟排序密切相关，且它会增加一个排名值（从1开始，一直到数组中有效数据的数量）。使用的是rank方法，rank是通过“为各组分配一个平均排名”的方式破坏平级关系的。

这里有点不好理解，可按照下图理解。

原始数据		人为的排名	method参数值
索引	值	人为的排名	average	min	max	first
	7	6	6.5	6	7	6
1	-5	1	1	1	1	1
2	7	7	6.5	6	7	7
3	4	4	4.5	4	5	4
4	2	3	3	3	3	3
5		2	2	2	2	2
6	4	5	4.5	4	5	5

method参数说明。

method	说明
‘average’	默认，在相等分组中，为各个值分配平均排名
‘min’	使用整个分组的最小排名
‘max’	使用整个分组的最大排名
‘first’	按值在原始数据中出现顺序分配排名

示例：

>>> obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) 
>>> obj.rank()
   6.5
   1.0
   6.5
   4.5
   3.0
   2.0
   4.5
dtype: float64
>>> obj
   7
  -5
   7
   4
   2
   0
   4
dtype: int64
>>> obj.rank(method=‘first‘)  #根据值在原数据中出现的顺序给出排名
   6.0
   1.0
   7.0
   4.0
   3.0
   2.0
   5.0
dtype: float64
>>> obj.rank(ascending=False, method=‘max‘)  #按降序进行排名
   2.0
   7.0
   2.0
   4.0
   5.0
   6.0
   4.0
dtype: float64

DataFrame可以在行或列上计算排名：

>>> frame = pd.DataFrame({‘b‘: [4.3, 7, -3, 2], ‘a‘: [0, 1, 0, 1], ‘c‘: [-2, 5, 8, -2.5]}) 
>>> frame
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
>>> frame.rank(axis=1)  
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

1.9 带有重复值的轴索引

pandas并不强制要求轴标签唯一。对于带有重复值的索引，数据选取的型位将会有所不同。如果某个索引对应多个值，则返回一个Series；而对应单个值的，则返回一个标量值。DataFrame也是如此。

>>> obj = pd.Series(range(5), index=[‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘])
>>> obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
>>> obj.index.is_unique
False
>>> obj[‘a‘]           
a    0
a    1
dtype: int64
>>> obj[‘c‘] 
4

QianYanDai

0 关注 0 粉丝 0 动态

关注关注

安科网

pandas基础--基本功能

QianYanDai

1.1 重新索引

1.6 DataFrame和Series之间的运算

QianYanDai

QianYanDai