通过group,我们可以: 根据某些条件将数据分组 对每组独立应用一个函数 将结果合并原数据中
In [91]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: In [92]: df Out[92]: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three -0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033对分组结果应用.sum()函数
In [93]: df.groupby('A').sum() Out[93]: C D A bar -2.802588 2.42611 foo 3.146492 -0.63958按多个列分组形成层次索引,然后应用该函数
In [94]: df.groupby(['A','B']).sum() Out[94]: C D A B bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434Stack:
In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', ....: 'foo', 'foo', 'qux', 'qux'], ....: ['one', 'two', 'one', 'two', ....: 'one', 'two', 'one', 'two']])) ....: In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) In [98]: df2 = df[:4] In [99]: df2 Out[99]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.575170 1.771208 two 0.816482 1.100230用stack()方法将列索引A、B改变成行索引
In [100]: stacked = df2.stack() In [101]: stacked Out[101]: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64 In [102]: stacked.unstack() Out[102]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.575170 1.771208 two 0.816482 1.100230 In [103]: stacked.unstack(1) Out[103]: second one two first bar A 0.029399 0.282696 B -0.542108 -0.087302 baz A -1.575170 0.816482 B 1.771208 1.100230 In [104]: stacked.unstack(0) Out[104]: first bar baz second one A 0.029399 -1.575170 B -0.542108 1.771208 two A 0.282696 0.816482 B -0.087302 1.100230Pivot Tables
In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, .....: 'B' : ['A', 'B', 'C'] * 4, .....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, .....: 'D' : np.random.randn(12), .....: 'E' : np.random.randn(12)}) .....: In [106]: df Out[106]: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.879024 1.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.006160 0.392149 4 one B bar -0.029716 0.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.035018 1.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.773723 1.824375 10 two B bar -1.170653 0.595974 11 three C bar 0.648740 1.167115我们可以很容易地从这些数据生成透视表:
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[107]: C bar foo A B one A -0.773723 1.418757 B -0.029716 -1.879024 C -1.146178 0.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826pandas有很强大的时间处理功能
创建一系列时间
# 从2011.1.1开始72小时 In [1]: rng = pd.date_range('1/1/2011', periods=72, freq='H') In [2]: rng[:5] Out[2]: DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00', '2011-01-01 02:00:00', '2011-01-01 03:00:00', '2011-01-01 04:00:00'], dtype='datetime64[ns]', freq='H')以时间为索引的pandas里的对象
In [3]: ts = pd.Series(np.random.randn(len(rng)), index=rng) In [4]: ts.head() Out[4]: 2011-01-01 00:00:00 0.469112 2011-01-01 01:00:00 -0.282863 2011-01-01 02:00:00 -1.509059 2011-01-01 03:00:00 -1.135632 2011-01-01 04:00:00 1.212112 Freq: H, dtype: float64改变每次的时间间隔
# to 45 minute frequency and forward fill In [5]: converted = ts.asfreq('45Min', method='pad') In [6]: converted.head() Out[6]: 2011-01-01 00:00:00 0.469112 2011-01-01 00:45:00 0.469112 2011-01-01 01:30:00 -0.282863 2011-01-01 02:15:00 -1.509059 2011-01-01 03:00:00 -1.135632 Freq: 45T, dtype: float64resample和他意思差不多
# Daily means In [7]: ts.resample('D').mean() Out[7]: 2011-01-01 -0.319569 2011-01-02 -0.337703 2011-01-03 0.117258 Freq: D, dtype: float64创建新的一列,类型是category
In [128]: df["grade"] = df["raw_grade"].astype("category") In [129]: df["grade"] Out[129]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]重新排序类别并同时添加缺少的类别(series.cat下的方法根据默认值返回一个新的Series)
In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [132]: df["grade"] Out[132]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]排序是每个类别中的顺序,而不是词序。
In [133]: df.sort_values(by="grade") Out[133]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good按类别列分组也显示空类别。
In [134]: df.groupby("grade").size() Out[134]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64可以将数据保存为csv文件
In [141]: df.to_csv('foo.csv')再从该文件读取数据
In [142]: pd.read_csv('foo.csv') Out[142]: Unnamed: 0 A B C D 0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536 3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896 4 2000-01-05 0.578117 0.511371 0.103552 -2.428202 5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409 6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753 .. ... ... ... ... ... 993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940 994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107 995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740 996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439 997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593 998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 [1000 rows x 5 columns]