导图社区 Pandas手册

Pandas手册

Pandas是一种常用的数据分析工具，本手册包含了该工具的用法总结，可以作为备忘单常放身边。

编辑于2020-03-26 15:56:34

望江人工智库

他的近期作品查看更多>>

CPlusPlus编程语言基础
本资源名为“CPlusPlus编程语言基础”，由袁宵归纳整理出五千余条C 知识点，几乎涵盖了C 基础的所有知识，希望此资源能帮助C 初学者入门和C 使用者参考。
Java手册
计算机专业同学福利！总结了6年的JavaScript 高级程序设计思维导图，非常全面非常细致。需要拿去！Java具有简单性、面向对象、分布式、健壮性、安全性、平台独立与可移植性、多线程、动态性等特点。Java可以编写桌面应用程序、Web应用程序、分布式系统和嵌入式系统应用程序等。还不快收藏学起来！
CPlusPlus入门
本资源名为“CPlusPlus入门”，系统性地介绍了有关C 编程语言的入门知识。希望此资源能帮助C 编程初学者入门，本资源旨在培养C 编程兴趣。

Pandas手册

社区模板帮助中心，点此进入>>

望江人工智库

他的近期作品查看更多>>

相似推荐
大纲

互联网9大思维
- 33.8k
- 907
- 2.4k
- 390
MindMaster
组织架构-单商户商城webAPP 思维导图。
- 14.6k
- 3
- 184
- 10
Kacyun
域控上线
- 1.6k
- 162
- 11
- 4
jackrao
python思维导图
- 5.4k
- 525
- 242
- 7
(*^▽^*)
css
- 1.2k
- 1
- 43
- 3
A张舫
CSS
- 3.3k
- 261
- 188
- 33
journey
计算机操作系统思维导图
- 4.2k
- 333
- 203
- 18
journey
计算机组成原理
- 1.5k
- 97
- 70
- 8
journey
IMX6UL(A7)
- 512
- 40
- 5
- 0
Handler XU
考试学情分析系统
- 680
- 50
- 10
- 1
蒋龙

Pandas手册

pandas是一种快速，强大，灵活且易于使用的开源数据分析和处理工具，建立在Python编程语言之上。通常，我们按以下方式导入numpy和pandas库： import numpy as np import pandas as pd 文中的例子df的内容如下： A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 说明： 1. 本文内容来自pandas官网的十分钟学pandas：https://pandas.pydata.org/docs/getting_started/10min.html 2. 再熟悉本文内容后建议阅读本文“开始”部分提供的超链接的内容，特别是pandas备忘单A和B

开始

安装：conda install pandas / pip install pandas

官网

入门

API参考

用户指南

Pandas备忘单A

Pandas备忘单B

文件读写

读写csv文件：df.to_csv("file_name.csv") / pd.read_csv("file_name.csv")

读写HDF5文件：df.to_hdf("file_name.h5", "key") / pd.read_csv("file_name.csv", "key")，key用来区分不同的DataFram，如果只有一个则可以不使用key

read_hdf(path_or_buf, key=None, mode:str='r', errors:str='strict', where=None, start:Union[int, NoneType]=None, stop:Union[int, NoneType]=None, columns=None, iterator=False, chunksize:Union[int, NoneType]=None, **kwargs) Read from the store, close it if we opened it. Retrieve pandas object stored in file, optionally based on where criteria Parameters ---------- path_or_buf : str, path object, pandas.HDFStore or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: ``file://localhost/path/to/table.h5``. If you want to pass in a path object, pandas accepts any ``os.PathLike``. Alternatively, pandas accepts an open :class:`pandas.HDFStore` object. By file-like object, we refer to objects with a ``read()`` method, such as a file handler (e.g. via builtin ``open`` function) or ``StringIO``. .. versionadded:: 0.21.0 support for __fspath__ protocol. key : object, optional The group identifier in the store. Can be omitted if the HDF file contains a single pandas object. mode : {'r', 'r+', 'a'}, default 'r' Mode to use when opening the file. Ignored if path_or_buf is a :class:`pandas.HDFStore`. Default is 'r'. where : list, optional A list of Term (or convertible) objects. start : int, optional Row number to start selection. stop : int, optional Row number to stop selection. columns : list, optional A list of columns names to return. iterator : bool, optional Return an iterator object. chunksize : int, optional Number of rows to include in an iteration when using an iterator. errors : str, default 'strict' Specifies how encoding and decoding errors are to be handled. See the errors argument for :func:`open` for a full list of options. **kwargs Additional keyword arguments passed to HDFStore. Returns ------- item : object The selected object. Return type depends on the object stored. See Also -------- DataFrame.to_hdf : Write a HDF file from a DataFrame. HDFStore : Low-level access to HDF files. Examples -------- >>> df = pd.DataFrame([[1, 1.0, 'a']], columns=['x', 'y', 'z']) >>> df.to_hdf('./store.h5', 'data') >>> reread = pd.read_hdf('./store.h5')

读写Excel文件：df.to_excel('foo.xlsx', sheet_name='Sheet1') /pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

创建对象

通过传入list值列表来创建Series对象：pd.Series(list)

通过传入list值列表来创建Series对象例子：创建具有默认的整数索引Series对象 In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

通过传入Numpy array来创建DataFram对象：pd.DataFrame(np.array)

通过传入Numpy array来创建DataFram对象例子：创建一个以datetime为index，以ABCD为columns的DataFram对象。 In [5]: dates = pd.date_range('20130101', periods=6) In [6]: dates Out[6]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) In [8]: df Out[8]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

通过传入dict来创建DataFram对象：pd.DataFram(dict)

In [9]: df2 = pd.DataFrame({'A': 1., ...: 'B': pd.Timestamp('20130102'), ...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'), ...: 'D': np.array([3] * 4, dtype='int32'), ...: 'E': pd.Categorical(["test", "train", "test", "train"]), ...: 'F': 'foo'}) ...: In [10]: df2 Out[10]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo

查看数据

查看前n行/后n行：df.head(n)/df.tail(n)，n默认值是5

查看前n行/后n行：df.head(n)/df.tail(n)，n默认值是5 例子： In [13]: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 In [14]: df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

查看index/columns：df.index()/df.columns()

查看index/columns：df.index()/df.columns() In [15]: df.index Out[15]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [16]: df.columns Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')

查看统计信息：df.describe()

查看统计信息：df.describe() In [19]: df.describe() Out[19]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804

查看列均值/行均值：df.mean() / df.mean(1)

查看列均值/行均值：df.mean() / df.mean(1)，操作通常排除丢失的数据。 In [61]: df.mean() Out[61]: A -0.004474 B -0.383981 C -0.687758 D 5.000000 F 3.000000 dtype: float64 In [62]: df.mean(1) Out[62]: 2013-01-01 0.872735 2013-01-02 1.431621 2013-01-03 0.707731 2013-01-04 1.395042 2013-01-05 1.883656 2013-01-06 1.592306 Freq: D, dtype: float64

查看Series中值属于每种类别的数量：Series.value_counts()

查看Series中值属于每种类别的数量：Series.value_counts() In [68]: s = pd.Series(np.random.randint(0, 7, size=10)) In [69]: s Out[69]: 0 4 1 2 2 1 3 2 4 6 5 4 6 4 7 6 8 4 9 4 dtype: int64 In [70]: s.value_counts() Out[70]: 4 5 6 2 2 2 1 1 dtype: int64

转置：df.T

转置：df.T In [20]: df.T Out[20]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988

转换为Numpy类型数据：DataFrame.to_numpy()

转换为Numpy类型数据：DataFrame.to_numpy() DataFrame.to_numpy() 将会丢失DataFram原有的索引和列标签 In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)

按列标签排序：df.sort_index(axis=1, ascending=False)

按列标签排序：df.sort_index(axis=1, ascending=False) In [21]: df.sort_index(axis=1, ascending=False) Out[21]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 2013-01-04 0.271860 -1.039575 -0.706771 0.721555 2013-01-05 -1.087401 0.276232 0.567020 -0.424972 2013-01-06 0.524988 -1.478427 0.113648 -0.673690

按某个列的值排序（默认从小到大）：df.sort_values(by="B", ascending=True)

按某个列的值排序：df.sort_values(by="B") In [22]: df.sort_values(by='B') Out[22]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401

选择数据

通过列名选择单列：df["A"]/df.A，会生成一个Series

选择单列：df["A"]/df.A，会生成一个Series In [23]: df['A'] Out[23]: 2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03 -0.861849 2013-01-04 0.721555 2013-01-05 -0.424972 2013-01-06 -0.673690 Freq: D, Name: A, dtype: float64

通过行号/index切片选择多行（包含切片末尾位置）：df[0:3]/df[index1:index3]

通过行号/index切片选择多行（包含切片末尾位置）：df[0:3]/df[index1:index3] In [24]: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860

通过index标签选择某行：df.loc[row_index_label]

通过index标签选择某行：df.loc[row_index_label] In [26]: df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64

通过列标签选择多列：df.loc[:, [column_A, column_B]]

通过列标签选择多列：df.loc[:, [column_A, column_B]] In [27]: df.loc[:, ['A', 'B']] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020 2013-01-06 -0.673690 0.113648

通过index标签切片和列标签选择某个区域（包含切片末尾位置）：df.loc[index1:index3, [column_A, column_B]]：

通过index标签切片和列标签选择某个区域（包含切片末尾位置）df.loc[index1:index3, [column_A, column_B]]： In [28]: df.loc['20130102':'20130104', ['A', 'B']] Out[28]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771

通过index标签列标签选择某个值：df.loc[index1, column_A] / df.at[index1, column_A]

In [30]: df.loc[dates[0], 'A'] Out[30]: 0.4691122999071863 In [31]: df.at[dates[0], 'A'] Out[31]: 0.4691122999071863

通过标签来选择数据：df.loc

通过行号选择某行：df.iloc[3]

通过行号选择某行：df.iloc[3] In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64

通过行号切片选择多行：df.iloc[1:3, :]

通过行号切片选择多行：df.iloc[1:3, :] In [35]: df.iloc[1:3, :] Out[35]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804

通过列号切片选择多列：df.iloc[:, 1:3]

通过列号切片选择多列：df.iloc[:, 1:3] In [36]: df.iloc[:, 1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.173215 0.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.567020 0.276232 2013-01-06 0.113648 -1.478427

通过行号切片和列号切片选择某个区域：df.iloc[3:5, 0:2]

通过行号切片和列号切片选择某个区域：df.iloc[3:5, 0:2] In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020

通过行号列表和列号列表选择某些数据：df.iloc[[1, 2, 4], [0, 2]]

通过行号列表和列号列表选择某些数据：df.iloc[[1, 2, 4], [0, 2]] In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232

通过行号和列号选择某个值：df.iloc[1, 1] / df.iat[1, 1]

通过行号和列号选择某个值：df.iloc[1, 1] / df.iat[1, 1] In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858 In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858

通过位置来选择数据：df.iloc

通过列值选择满足要求的数据：df[df['A'] > 0]

通过列值选择满足要求的数据：df[df['A'] > 0] In [39]: df[df['A'] > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860

对所有值进行判断，选择满足要求的数据：df[df > 0]

对所有值进行判断，选择满足要求的数据：df[df > 0] In [40]: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.567020 0.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988

通过isin()方法，选择满足要求的数据：df2[df2['E'].isin(['two', 'four'])]

通过isin()方法，选择满足要求的数据：df2[df2['E'].isin(['two', 'four'])] In [41]: df2 = df.copy() In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three In [44]: df2[df2['E'].isin(['two', 'four'])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

通过布尔索引来选择数据：True/False->选择/不选择

columns操作

选择某列：df["columns_A"] / df.columns_A

选择某些列：df[ ["columns_A", "columns_B"]]

通过正则表达式来选择某些列：df.filter(regex='regex')

重命名列：df.rename(colums={"y": "year"})

删除列：df.drop(columns = ["columns_A", "columns_B"])

设置数据

通过标签来设置某个数据的值：df.at[dates[0], 'A'] = 0

通过位置来设置某个数据的值：df.iat[0, 1] = 0

设置某列数据的值：df.loc[:, 'D'] = np.array([5] * len(df))

缺失数据

pandas主要使用值np.nan表示丢失的数据。

删除所有含有缺失值的行：df1.dropna(how='any')

删除所有含有缺失值的行：df1.dropna(how='any') In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0

用指定值填充所有的缺失值：df1.fillna(value=5)

用指定值填充所有的缺失值：df1.fillna(value=5) In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0

获取每个值是否是缺失值的布尔掩码：pd.isna(df1)

获取每个值是否是缺失值的布尔掩码：pd.isna(df1) In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [60]: pd.isna(df1) Out[60]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True

杂项操作

重新索引 df.reindex

重新索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。

减法：df.sub(Series, axis='index')

减法： In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) In [51]: df Out[51]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 2013-01-05 -0.424972 0.567020 0.276232 5 4.0 2013-01-06 -0.673690 0.113648 -1.478427 5 5.0 In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0 2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0 2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN

自定义函数：df.apply(f)

自定义函数：df.apply(f) In [54]: df2 Out[54]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0 In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 -1.389850 10 1.0 2013-01-03 0.350263 -2.277784 -1.884779 15 3.0 2013-01-04 1.071818 -2.984555 -2.924354 20 6.0 2013-01-05 0.646846 -2.417535 -2.648122 25 10.0 2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0 In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64

Series字符串方法：Series.str.lower()

Series在str属性中配备了一组字符串处理方法，这些方法使您可以轻松地对数组的每个元素进行操作，如下面的代码片段所示。请注意，str中的模式匹配通常默认情况下使用正则表达式（在某些情况下始终使用正则表达式）。 In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) In [72]: s.str.lower() Out[72]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object

数据表透视：pd.pivot_table(df, value="D", index=["A", "B"], columns=["C"])

数据表透视：pd.pivot_table(df, value="D", index=["A", "B"], columns=["C"]) In [101]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, .....: 'B': ['A', 'B', 'C'] * 4, .....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, .....: 'D': np.random.randn(12), .....: 'E': np.random.randn(12)}) .....: In [102]: df Out[102]: A B C D E 0 one A foo -1.202872 0.047609 1 one B foo -1.814470 -0.136473 2 two C foo 1.018601 -0.561757 3 three A bar -0.595447 -1.623033 4 one B bar 1.395433 0.029399 5 one C bar -0.392670 -0.542108 6 two A foo 0.007207 0.282696 7 three B foo 1.928123 -0.087302 8 one C foo -0.055224 -1.575170 9 one A bar 2.395985 1.771208 10 two B bar 1.552825 0.816482 11 three C bar 0.166599 1.100230 In [103]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[103]: C bar foo A B one A 2.395985 -1.202872 B 1.395433 -1.814470 C -0.392670 -0.055224 three A -0.595447 NaN B NaN 1.928123 C 0.166599 NaN two A NaN 0.007207 B 1.552825 NaN C NaN 1.018601

合并与分组

将列添加到DataFrame相对较快。但是，添加一行需要一个副本，并且可能很昂贵。我们建议将预构建的记录列表传递给DataFrame构造函数，而不是通过迭代地将记录追加（append）到其来构建DataFrame。合并的区别： concat：可以沿一条轴将多个对象连接到一起 merge：可以根据一个或多个键将不同的DataFrame中的行连接起来。 join：inner是交集，outer是并集。 “分组依据”是指涉及以下一个或多个步骤的过程：根据某些条件将数据分成几组将功能独立应用于每个组将结果合并为数据结构

concat多个子DataFram行：pd.concat([df1, df2, df3])

合并多个子DataFram：pd.concat([df1, df2, df3]) In [73]: df = pd.DataFrame(np.random.randn(10, 4)) In [74]: df Out[74]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495 # break it into pieces In [75]: pieces = [df[:3], df[3:7], df[7:]] In [76]: pd.concat(pieces) Out[76]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495

concat多个子DataFram列：pd.concat([df1, df2,df3], axis=1)

根据key来join多个子DataFram：pd.join(df1, df2, on="key")，SQL样式合并。

join多个子DataFram：pd.join(df1, df2, on="key")，SQL样式合并。例子1： In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) In [79]: left Out[79]: key lval 0 foo 1 1 foo 2 In [80]: right Out[80]: key rval 0 foo 4 1 foo 5 In [81]: pd.merge(left, right, on='key') Out[81]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5 例子2： In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]}) In [84]: left Out[84]: key lval 0 foo 1 1 bar 2 In [85]: right Out[85]: key rval 0 foo 4 1 bar 5 In [86]: pd.merge(left, right, on='key') Out[86]: key lval rval 0 foo 1 4 1 bar 2 5

根据某（几）列来进行分组：df.groupby('A').sum()

In [87]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B': ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C': np.random.randn(8), ....: 'D': np.random.randn(8)}) ....: In [88]: df Out[88]: A B C D 0 foo one 1.346061 -1.577585 1 bar one 1.511763 0.396823 2 foo two 1.627081 -0.105381 3 bar three -0.990582 -0.532532 4 foo two -0.441652 1.453749 5 bar two 1.211526 1.208843 6 foo one 0.268520 -0.080952 7 foo three 0.024580 -0.264610 In [89]: df.groupby('A').sum() Out[89]: C D A bar 1.732707 1.073134 foo 2.824590 -0.574779 In [90]: df.groupby(['A', 'B']).sum() Out[90]: C D A B bar one 1.511763 0.396823 three -0.990582 -0.532532 two 1.211526 1.208843 foo one 1.614581 -1.658537 three 0.024580 -0.264610 two 1.185429 1.348368

时间序列

pands具有简单，强大和高效的用于时间序列的方法，可以在频率转换期间执行重采样操作（例如，将秒数据转换为5分钟重采样数据）。这在金融应用程序中非常普遍，但不仅限于此。

重采样：Series.resample

重采样：Series.resample In [104]: rng = pd.date_range('1/1/2012', periods=100, freq='S') In [105]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) In [106]: ts.resample('5Min').sum() Out[106]: 2012-01-01 24182 Freq: 5T, dtype: int64

时区转换：Series.tz_convert()

In [107]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D') In [108]: ts = pd.Series(np.random.randn(len(rng)), rng) In [109]: ts Out[109]: 2012-03-06 1.857704 2012-03-07 -1.193545 2012-03-08 0.677510 2012-03-09 -0.153931 2012-03-10 0.520091 Freq: D, dtype: float64 In [110]: ts_utc = ts.tz_localize('UTC') In [111]: ts_utc Out[111]: 2012-03-06 00:00:00+00:00 1.857704 2012-03-07 00:00:00+00:00 -1.193545 2012-03-08 00:00:00+00:00 0.677510 2012-03-09 00:00:00+00:00 -0.153931 2012-03-10 00:00:00+00:00 0.520091 Freq: D, dtype: float64 In [112]: ts_utc.tz_convert('US/Eastern') Out[112]: 2012-03-05 19:00:00-05:00 1.857704 2012-03-06 19:00:00-05:00 -1.193545 2012-03-07 19:00:00-05:00 0.677510 2012-03-08 19:00:00-05:00 -0.153931 2012-03-09 19:00:00-05:00 0.520091 Freq: D, dtype: float64

分类数据

pandas可以在DataFrame中包含分类数据。

设置分类数据：df["raw_grade"].astype("category")

设置分类数据：df["raw_grade"].astype("category") In [123]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], .....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) .....: In [124]: df["grade"] = df["raw_grade"].astype("category") In [125]: df["grade"] Out[125]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]

给类别命名：df["grade"].cat.categories = ["very good", "good", "very bad"]

给类别命名：df["grade"].cat.categories = ["very good", "good", "very bad"] 将类别重命名为更有意义的名称（就地分配给Series.cat.categories！）。 In [127]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", .....: "good", "very good"]) .....: In [128]: df["grade"] Out[128]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]

增加新的类别名称：df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

增加新的类别名称：df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [127]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", .....: "good", "very good"]) .....: In [128]: df["grade"] Out[128]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]

按照命名的类别顺序排序（不是词汇顺序）：df.sort_values(by="grade")

按照命名的类别顺序排序（不是词汇顺序）：df.sort_values(by="grade") In [129]: df.sort_values(by="grade") Out[129]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good

给类别数据分组：df.groupby("grade").size()

给类别数据分组：df.groupby("grade").size() In [130]: df.groupby("grade").size() Out[130]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64

绘图

绘制折线图：pd.plot()

绘制直方图：pd.plot.hist()

绘制散点图：df.plot.scatter(x="columns_A", y ="columns_B")