10 Minutes to pandas

2017-03-21
Python

pandas官方入门教程:10 Minutes to pandas

下面的操作依赖于这些包,所以操作之前要引入这些包

1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. Object Creation 创建对象

  1. 将数组转换为Series对象

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    In [4]: s = pd.Series([1,3,5,np.nan,6,8])
    In [5]: s
    Out[5]:
    0 1.0
    1 3.0
    2 5.0
    3 NaN
    4 6.0
    5 8.0
    dtype: float64
  2. numpy数组转换为DataFrame

    • pandas.date_range

      创建一个datetime index,起始时间为20170320,时间跨度为7天,默认的freq'D'。如果没有periods参数则必须指定结束时间end

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    In [6]: dates = pd.date_range('20170320', periods=7)
    In [7]: dates
    Out[7]:
    DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23',
    '2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],
    dtype='datetime64[ns]', freq='D')
    In [8]: dates = pd.date_range('20170320','20170327')
    In [9]: dates
    Out[9]:
    DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23',
    '2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],
    dtype='datetime64[ns]', freq='D')
    # 更改freq
    In[10]: dates = pd.date_range('20170320', periods=4,freq='w-mon')
    In[11]: dates
    Out[11]:
    DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')
    • np.random.randn

      np.random.randn(4,4)生成一个numpy4*4的数组,其中的每一个数服从正态分布。

    • index=dates 将dates指定为index。

    • columns指定列名称

    1
    2
    3
    4
    5
    6
    7
    8
    9
    In[86]: df = pd.DataFrame(np.random.randn(4,4), index=dates, columns=list('ABCD'))
    In[87]: df
    Out[87]:
    A B C D
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
  3. 将字典对象转换成DataFrame

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    In[88]: df2 = pd.DataFrame({ 'A' : 1.,
    ....: 'B' : pd.Timestamp('20130102'),
    ....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
    ....: 'D' : np.array([3] * 4,dtype='int32'),
    ....: 'E' : pd.Categorical(["test","train","test","train"]),
    ....: 'F' : 'foo' })
    In[89]: df2
    Out[89]:
    A B C D E F
    0 1.0 2013-01-02 1.0 3 test foo
    1 1.0 2013-01-02 1.0 3 train foo
    2 1.0 2013-01-02 1.0 3 test foo
    3 1.0 2013-01-02 1.0 3 train foo

2. Viewing Data 查看数据

  1. dtypes 查看各列的数据类型

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    In[94]: df2.dtypes
    Out[94]:
    A float64
    B int64
    C float32
    D int32
    E category
    F object
    dtype: object
    In[96]: df['A'].dtype
    Out[96]:
    dtype('float64')
  2. head(),tail(),查看数据框头几行和最后几行的数据

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    In[97]: df.head()
    Out[97]:
    A B C D
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
    In[98]: df.tail(2)
    Out[98]:
    A B C D
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
  3. 查看索引df.index

    1
    2
    3
    In[99]: df.index
    Out[99]:
    DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')
  4. 查看列df.colums

    1
    2
    3
    In[100]: df.columns
    Out[100]:
    Index(['A', 'B', 'C', 'D'], dtype='object')
  5. 查看这个DataFrame下的numpy data:df.values

    1
    2
    3
    4
    5
    6
    7
    In[101]: df.values
    Out[101]:
    array([[-0.27393096, 1.46413903, 0.5114613 , 0.70707094],
    [ 0.29052687, 0.79957181, 0.52614137, 0.69645894],
    [ 0.39017244, 0.17137885, -1.32174227, 0.9413581 ],
    [ 0.9638237 , -1.08939135, 0.13306936, 1.00735588]])
  6. df.describe:给出这个数据框的statistic summary

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    In[102]: df.describe()
    Out[102]:
    A B C D
    count 4.000000 4.000000 4.000000 4.000000
    mean 0.342648 0.336425 -0.037768 0.838061
    std 0.506953 1.087265 0.875104 0.159730
    min -0.273931 -1.089391 -1.321742 0.696459
    25% 0.149412 -0.143814 -0.230634 0.704418
    50% 0.340350 0.485475 0.322265 0.824215
    75% 0.533585 0.965714 0.515131 0.957858
    max 0.963824 1.464139 0.526141 1.007356
  7. df.T:转置矩阵,转置后index也随之改变

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    In[103]: df.T
    Out[103]:
    2017-03-20 2017-03-27 2017-04-03 2017-04-10
    A -0.273931 0.290527 0.390172 0.963824
    B 1.464139 0.799572 0.171379 -1.089391
    C 0.511461 0.526141 -1.321742 0.133069
    D 0.707071 0.696459 0.941358 1.007356
    In[104]: df.T.index
    Out[104]:
    Index(['A', 'B', 'C', 'D'], dtype='object')
  8. df.sort_index():根据行列索引排序,默认axis=0,根据索引排序,当axis=1,根据列排序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    In[119]: df.sort_index(ascending=True)
    Out[119]:
    A B C D
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
    In[120]: df.sort_index(ascending=False)
    Out[120]:
    A B C D
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    In[122]: df.sort_index(axis=1, ascending=False)
    Out[122]:
    D C B A
    2017-03-20 0.707071 0.511461 1.464139 -0.273931
    2017-03-27 0.696459 0.526141 0.799572 0.290527
    2017-04-03 0.941358 -1.321742 0.171379 0.390172
    2017-04-10 1.007356 0.133069 -1.089391 0.963824
    In[123]: df.sort_index(axis=1, ascending=True)
    Out[123]:
    A B C D
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
  9. df.sort_values:根据值排序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    In[124]: df.sort_values(by='A')
    Out[124]:
    A B C D
    2017-03-20 -0.273931 1.464139 0.511461 0.707071
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
    In[125]: df.sort_values(by='A',ascending=False)
    Out[125]:
    A B C D
    2017-04-10 0.963824 -1.089391 0.133069 1.007356
    2017-04-03 0.390172 0.171379 -1.321742 0.941358
    2017-03-27 0.290527 0.799572 0.526141 0.696459
    2017-03-20 -0.273931 1.464139 0.511461 0.707071

3. Selection

3.1 Getting

  1. df.A,df[‘A’]:将A列的数据取出,两者的作用是相同的

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    In[10]: df.A
    Out[10]:
    2017-03-20 0.387798
    2017-03-27 0.221780
    2017-04-03 -1.585317
    2017-04-10 0.035160
    Freq: W-MON, Name: A, dtype: float64
    In[11]: df['A']
    Out[11]:
    2017-03-20 0.387798
    2017-03-27 0.221780
    2017-04-03 -1.585317
    2017-04-10 0.035160
    Freq: W-MON, Name: A, dtype: float64
  2. 按行取数据

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    In[12]: df[0:3]
    Out[12]:
    A B C D
    2017-03-20 0.387798 -0.058203 -0.169525 -0.612571
    2017-03-27 0.221780 -1.170097 0.876672 0.552794
    2017-04-03 -1.585317 2.253459 0.109537 0.028095
    In[14]: df['2017-03-20':'2017-04-03']
    Out[14]:
    A B C D
    2017-03-20 0.387798 -0.058203 -0.169525 -0.612571
    2017-03-27 0.221780 -1.170097 0.876672 0.552794
    2017-04-03 -1.585317 2.253459 0.109537 0.028095

3.2 Selection by Label

  1. df.loc:按照索引和列取数据

    df.at:和df.loc类似,但是只能取一个元素的值,只取一个元素的值,at是更为快速的方法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    In[16]: df.loc['2017-03-20']
    Out[16]:
    A 0.387798
    B -0.058203
    C -0.169525
    D -0.612571
    Name: 2017-03-20 00:00:00, dtype: float64
    In[17]: df.loc[dates[0]]
    Out[17]:
    A 0.387798
    B -0.058203
    C -0.169525
    D -0.612571
    Name: 2017-03-20 00:00:00, dtype: float64
    In[34]: df.loc[:,'A']
    Out[34]:
    2017-03-20 0.387798
    2017-03-27 0.221780
    2017-04-03 -1.585317
    2017-04-10 0.035160
    Freq: W-MON, Name: A, dtype: float64
    In[36]: df.loc[dates[0:3],'A']
    Out[36]:
    2017-03-20 0.387798
    2017-03-27 0.221780
    2017-04-03 -1.585317
    Freq: W-MON, Name: A, dtype: float64
    In[37]: df.loc[dates[0:3],['A','B']]
    Out[37]:
    A B
    2017-03-20 0.387798 -0.058203
    2017-03-27 0.221780 -1.170097
    2017-04-03 -1.585317 2.253459
    In[39]: df.loc[:,['A','B']]
    Out[39]:
    A B
    2017-03-20 0.387798 -0.058203
    2017-03-27 0.221780 -1.170097
    2017-04-03 -1.585317 2.253459
    2017-04-10 0.035160 0.143292
    In[74]: df.at[dates[0],'A']
    Out[74]:
    0.38779782953047309

3.3 Selection by Position

  1. df.iloc:根据位置取数据,类似于matlab里面的取矩阵的方法

    df.iat:和df.iloc类似,但是只能取一个元素的值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    In[44]: df.iloc[3]
    Out[44]:
    A 0.035160
    B 0.143292
    C 0.139537
    D -0.823278
    Name: 2017-04-10 00:00:00, dtype: float64
    In[47]: df.iloc[2:4]
    Out[47]:
    A B C D
    2017-04-03 -1.585317 2.253459 0.109537 0.028095
    2017-04-10 0.035160 0.143292 0.139537 -0.823278
    In[43]: df.iloc[2:4,0:2]
    Out[43]:
    A B
    2017-04-03 -1.585317 2.253459
    2017-04-10 0.035160 0.143292
    In[49]: df.iloc[:,2:4]
    Out[49]:
    C D
    2017-03-20 -0.169525 -0.612571
    2017-03-27 0.876672 0.552794
    2017-04-03 0.109537 0.028095
    2017-04-10 0.139537 -0.823278
    In[50]: df.iloc[3,2]
    Out[50]:
    0.13953667147016041
    In[75]: df.iat[0,0]
    Out[75]:
    0.38779782953047309

3.4 Boolean Indexing

  1. 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    In[52]: df[df.A > 0]
    Out[52]:
    A B C D
    2017-03-20 0.387798 -0.058203 -0.169525 -0.612571
    2017-03-27 0.221780 -1.170097 0.876672 0.552794
    2017-04-10 0.035160 0.143292 0.139537 -0.823278
    In[53]: df[df > 0]
    Out[53]:
    A B C D
    2017-03-20 0.387798 NaN NaN NaN
    2017-03-27 0.221780 NaN 0.876672 0.552794
    2017-04-03 NaN 2.253459 0.109537 0.028095
    2017-04-10 0.035160 0.143292 0.139537 NaN
  2. isin()

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    In[67]: df2 = df.copy()
    In[68]: df2['E'] = ['one', 'one','two','three']
    In[69]: df2
    Out[69]:
    A B C D E
    2017-03-20 0.387798 -0.058203 -0.169525 -0.612571 one
    2017-03-27 0.221780 -1.170097 0.876672 0.552794 one
    2017-04-03 -1.585317 2.253459 0.109537 0.028095 two
    In[70]: df2['E'].isin( ['one', 'one','two'])
    Out[70]:
    2017-03-20 True
    2017-03-27 True
    2017-04-03 True
    2017-04-10 False
    Freq: W-MON, Name: E, dtype: bool
    In[71]: df2[df2['E'].isin(['one', 'one','two'])]
    Out[71]:
    A B C D E
    2017-03-20 0.387798 -0.058203 -0.169525 -0.612571 one
    2017-03-27 0.221780 -1.170097 0.876672 0.552794 one
    2017-04-03 -1.585317 2.253459 0.109537 0.028095 two

4. Missing Data

缺失值:NaN

df.reindex:重置索引,原先不存在的索引所对应的值为NaN

df.dropna():将缺失值删除

df.fillna():填充缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
In[78]: df.reindex(index=pd.date_range('2017-03-27',periods=6,freq='W-MON'))
Out[78]:
A B C D
2017-03-27 0.221780 -1.170097 0.876672 0.552794
2017-04-03 -1.585317 2.253459 0.109537 0.028095
2017-04-10 0.035160 0.143292 0.139537 -0.823278
2017-04-17 NaN NaN NaN NaN
2017-04-24 NaN NaN NaN NaN
2017-05-01 NaN NaN NaN NaN
In[79]: df3 = df.reindex(index=pd.date_range('2017-03-27',periods=6,freq='W-MON'))
In[80]: df3.dropna(how='any')
Out[80]:
A B C D
2017-03-27 0.221780 -1.170097 0.876672 0.552794
2017-04-03 -1.585317 2.253459 0.109537 0.028095
2017-04-10 0.035160 0.143292 0.139537 -0.823278
In[81]: df3.fillna(value=1)
Out[81]:
A B C D
2017-03-27 0.221780 -1.170097 0.876672 0.552794
2017-04-03 -1.585317 2.253459 0.109537 0.028095
2017-04-10 0.035160 0.143292 0.139537 -0.823278
2017-04-17 1.000000 1.000000 1.000000 1.000000
2017-04-24 1.000000 1.000000 1.000000 1.000000
2017-05-01 1.000000 1.000000 1.000000 1.000000
In[82]: pd.isnull(df3)
Out[82]:
A B C D
2017-03-27 False False False False
2017-04-03 False False False False
2017-04-10 False False False False
2017-04-17 True True True True
2017-04-24 True True True True
2017-05-01 True True True True

5. Operations

5.1 Stats

  1. df.mean():默认按照索引取均值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    In[83]: df.mean(0)
    Out[83]:
    A -0.235145
    B 0.292113
    C 0.239055
    D -0.213740
    dtype: float64
    In[84]: df.mean()
    Out[84]:
    A -0.235145
    B 0.292113
    C 0.239055
    D -0.213740
    dtype: float64
    In[85]: df.mean(1)
    Out[85]:
    2017-03-20 -0.113125
    2017-03-27 0.120287
    2017-04-03 0.201443
    2017-04-10 -0.126322
    Freq: W-MON, dtype: float64
  2. df.sub 减法,df.add 加法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    In[112]: df.sub(df['A'],axis=0)
    Out[112]:
    A B C D
    2017-03-20 0.0 -0.446001 -0.557323 -1.000369
    2017-03-27 0.0 -1.391877 0.654892 0.331014
    2017-04-03 0.0 3.838775 1.694853 1.613412
    2017-04-10 0.0 0.108132 0.104377 -0.858438
    In[113]: df.add(df['A'],axis=0)
    Out[113]:
    A B C D
    2017-03-20 0.775596 0.329595 0.218273 -0.224773
    2017-03-27 0.443560 -0.948317 1.098452 0.774574
    2017-04-03 -3.170634 0.668142 -1.475780 -1.557222
    2017-04-10 0.070320 0.178452 0.174697 -0.788119

5.2 Apply

df.apply():Applying functions to the data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In[90]: df.apply(np.cumsum)
Out[90]:
A B C D
2017-03-20 0.387798 -0.058203 -0.169525 -0.612571
2017-03-27 0.609578 -1.228301 0.707147 -0.059777
2017-04-03 -0.975739 1.025158 0.816683 -0.031682
2017-04-10 -0.940579 1.168450 0.956220 -0.854961
In[91]: df.apply(lambda x: x.max()-x.min())
Out[91]:
A 1.973115
B 3.423556
C 1.046196
D 1.376072
dtype: float64

5.3 Histogramming

value_counts() 计数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In[114]: s = pd.Series(np.random.randint(0, 7, size=10))
In[115]: s
Out[115]:
0 5
1 4
2 6
3 3
4 4
5 6
6 5
7 0
8 6
9 4
dtype: int64
In[117]: s.value_counts()
Out[117]:
6 3
4 3
5 2
3 1
0 1
dtype: int64

5.4 String Methods

Series.str下面还有很多方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In[118]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In[121]: s.str.upper()
Out[121]:
0 A
1 B
2 C
3 AABA
4 BACA
5 NaN
6 CABA
7 DOG
8 CAT
dtype: object

6. Merge

6.1 Concat

默认axis=0,按行拼接,axis=1,按列拼接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In[123]: df = pd.DataFrame(np.random.randn(6,6))
In[124]: df
Out[124]:
0 1 2 3 4 5
0 -0.829274 -0.862832 0.799663 -1.115011 2.055959 -0.569186
1 0.058196 -0.571979 0.124036 1.777956 -1.192992 -1.427277
2 -0.442932 -1.177776 0.669681 -1.244613 -0.213186 -2.433752
3 1.305375 -0.540983 -0.232134 -0.373056 -0.036850 -0.171402
4 -0.574777 -0.632868 -0.268940 0.613305 1.000270 1.800234
5 0.874804 -1.711324 -0.240330 -0.183900 0.154792 -1.423335
In[126]: pd.concat([df[:2],df[3:5]])
Out[126]:
0 1 2 3 4 5
0 -0.829274 -0.862832 0.799663 -1.115011 2.055959 -0.569186
1 0.058196 -0.571979 0.124036 1.777956 -1.192992 -1.427277
3 1.305375 -0.540983 -0.232134 -0.373056 -0.036850 -0.171402
4 -0.574777 -0.632868 -0.268940 0.613305 1.000270 1.800234
In[131]: pd.concat([df.iloc[:,0:2],df.iloc[:,3:5]],axis=1)
Out[131]:
0 1 3 4
0 -0.829274 -0.862832 -1.115011 2.055959
1 0.058196 -0.571979 1.777956 -1.192992
2 -0.442932 -1.177776 -1.244613 -0.213186
3 1.305375 -0.540983 -0.373056 -0.036850
4 -0.574777 -0.632868 0.613305 1.000270
5 0.874804 -1.711324 -0.183900 0.154792

6.2 Join

6.3 Append

Append rows to a dataframe. See the Appending

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
In[132]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In[133]: df
Out[133]:
A B C D
0 -1.077744 0.751278 0.261896 0.738191
1 0.865350 -1.652052 0.090780 -0.022723
2 -0.205837 0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916 0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5 0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664 0.136152 0.112945 -0.965557
7 0.050109 0.071000 -1.793460 0.071480
In[135]: df.append(df[3:5])
Out[135]:
A B C D
0 -1.077744 0.751278 0.261896 0.738191
1 0.865350 -1.652052 0.090780 -0.022723
2 -0.205837 0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916 0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5 0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664 0.136152 0.112945 -0.965557
7 0.050109 0.071000 -1.793460 0.071480
3 -0.232356 -0.905884 -0.495916 0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
In[136]: df.append(df[3:5], ignore_index=True)
Out[136]:
A B C D
0 -1.077744 0.751278 0.261896 0.738191
1 0.865350 -1.652052 0.090780 -0.022723
2 -0.205837 0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916 0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5 0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664 0.136152 0.112945 -0.965557
7 0.050109 0.071000 -1.793460 0.071480
8 -0.232356 -0.905884 -0.495916 0.865143
9 -1.218397 -0.091938 -2.044627 -0.688541

6.4 Grouping

df.groupby(by=)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In[137]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
In[138]: df
Out[138]:
A B C D
0 foo one -1.378873 -1.289674
1 bar one -0.800759 1.247083
2 foo two -1.172841 0.605887
3 bar three -0.232046 0.328599
4 foo two -0.027785 1.644086
5 bar two -0.456277 -1.737294
6 foo one -0.092326 -0.424813
7 foo three 0.676421 -0.328490
In[140]: df.groupby(by='A').sum()
Out[140]:
C D
A
bar -1.489081 -0.161612
foo -1.995404 0.206996
In[141]: df.groupby(by=['A', 'B']).sum()
Out[141]:
C D
A B
bar one -0.800759 1.247083
three -0.232046 0.328599
two -0.456277 -1.737294
foo one -1.471198 -1.714486
three 0.676421 -0.328490
two -1.200627 2.249972

6.5 Reshaping

  1. Stack

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
    ....: 'foo', 'foo', 'qux', 'qux'],
    ....: ['one', 'two', 'one', 'two',
    ....: 'one', 'two', 'one', 'two']]))
    tuples
    Out[27]:
    [('bar', 'one'),
    ('bar', 'two'),
    ('baz', 'one'),
    ('baz', 'two'),
    ('foo', 'one'),
    ('foo', 'two'),
    ('qux', 'one'),
    ('qux', 'two')]
    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
    index
    Out[29]:
    MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
    labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
    names=['first', 'second'])
    df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
    df
    Out[31]:
    A B
    first second
    bar one 0.398046 0.488732
    two -0.024753 -0.567505
    baz one -0.509408 0.055228
    two 0.936854 0.326829
    foo one 0.390471 0.540038
    two -1.001224 0.858414
    qux one -1.640981 0.103979
    two -0.167781 0.949106
    df[:4]
    Out[32]:
    A B
    first second
    bar one 0.398046 0.488732
    two -0.024753 -0.567505
    baz one -0.509408 0.055228
    two 0.936854 0.326829
    stacked = df2.stack()
    stacked
    Out[34]:
    first second
    bar one A -1.131337
    B 0.734982
    two A 0.008166
    B -1.358364
    baz one A -1.578985
    B 0.399612
    two A 1.697060
    B 0.830027
    dtype: float64
    stacked.unstack()
    Out[35]:
    A B
    first second
    bar one -1.131337 0.734982
    two 0.008166 -1.358364
    baz one -1.578985 0.399612
    two 1.697060 0.830027
    stacked.unstack().unstack()
    Out[36]:
    A B
    second one two one two
    first
    bar -1.131337 0.008166 0.734982 -1.358364
    baz -1.578985 1.697060 0.399612 0.830027
    stacked.unstack(1)
    Out[37]:
    second one two
    first
    bar A -1.131337 0.008166
    B 0.734982 -1.358364
    baz A -1.578985 1.697060
    B 0.399612 0.830027
    stacked.unstack(0)
    Out[38]:
    first bar baz
    second
    one A -1.131337 -1.578985
    B 0.734982 0.399612
    two A 0.008166 1.697060
    B -1.358364 0.830027
    stacked.unstack(2)
    Out[39]:
    A B
    first second
    bar one -1.131337 0.734982
    two 0.008166 -1.358364
    baz one -1.578985 0.399612
    two 1.697060 0.830027

6.6 Time Series

  1. resample:对时间序列做类似于groupby的操作

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    In[163]: rng = pd.date_range('1/1/2012', periods=10, freq='S')
    In[164]: ts = pd.Series(np.random.randn(10), index=rng)
    In[165]: ts
    Out[165]:
    2012-01-01 00:00:00 0.924984
    2012-01-01 00:00:01 -1.253491
    2012-01-01 00:00:02 -0.417912
    2012-01-01 00:00:03 -0.104731
    2012-01-01 00:00:04 -1.165980
    2012-01-01 00:00:05 -0.535897
    2012-01-01 00:00:06 -0.937104
    2012-01-01 00:00:07 -0.143732
    2012-01-01 00:00:08 -0.292696
    2012-01-01 00:00:09 -0.700354
    Freq: S, dtype: float64
    In[166]: ts.resample('2S').sum()
    Out[166]:
    2012-01-01 00:00:00 -0.328507
    2012-01-01 00:00:02 -0.522643
    2012-01-01 00:00:04 -1.701877
    2012-01-01 00:00:06 -1.080836
    2012-01-01 00:00:08 -0.993050
    Freq: 2S, dtype: float64
  2. to_period,to_timestamp:Timestamped data can be converted to PeriodIndex-ed data using to_period and vice-versa using to_timestamp

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    In[167]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
    In[168]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    In[169]: ts
    Out[169]:
    2012-01-31 -0.174429
    2012-02-29 2.073488
    2012-03-31 1.459442
    2012-04-30 0.703198
    2012-05-31 -1.796758
    Freq: M, dtype: float64
    In[175]: ps = ts.to_period()
    In[176]: ps
    Out[176]:
    2012-01 -0.174429
    2012-02 2.073488
    2012-03 1.459442
    2012-04 0.703198
    2012-05 -1.796758
    Freq: M, dtype: float64
    In[177]: ps.to_timestamp()
    Out[177]:
    2012-01-01 -0.174429
    2012-02-01 2.073488
    2012-03-01 1.459442
    2012-04-01 0.703198
    2012-05-01 -1.796758
    Freq: MS, dtype: float64
    # Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:
    In[179]: ps.to_timestamp('D', 'e')
    Out[179]:
    2012-01-31 -0.174429
    2012-02-29 2.073488
    2012-03-31 1.459442
    2012-04-30 0.703198
    2012-05-31 -1.796758
    Freq: M, dtype: float64

7. Getting Data In/Out

7.1 CSV

df.to_csv()

pd.read_csv()

7.2 Json

to_json

read_json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In[181]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
In[182]: dfj
Out[182]:
A B
0 -0.570906 -1.082939
1 -0.181702 0.512260
2 0.267922 -0.495724
3 1.182305 0.792193
4 -0.238487 -0.293700
In[184]: json = dfj.to_json()
In[185]: json
Out[185]:
'{"A":{"0":-0.570906023,"1":-0.1817023093,"2":0.267921959,"3":1.1823048088,"4":-0.2384873419},"B":{"0":-1.082939276,"1":0.5122603476,"2":-0.495723635,"3":0.7921934348,"4":-0.2936996606}}'
In[186]: pd.read_json(json)
Out[186]:
A B
0 -0.570906 -1.082939
1 -0.181702 0.512260
2 0.267922 -0.495724
3 1.182305 0.792193
4 -0.238487 -0.293700

7.3 Excel

df.to_excel

pd.read_excel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[146]:
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]