机器学习数据科学包--pandas第二天（3）

mac2025-04-07 32

group

通过group，我们可以：根据某些条件将数据分组对每组独立应用一个函数将结果合并原数据中

In [91]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: In [92]: df Out[92]: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three -0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033

对分组结果应用.sum()函数

In [93]: df.groupby('A').sum() Out[93]: C D A bar -2.802588 2.42611 foo 3.146492 -0.63958

按多个列分组形成层次索引，然后应用该函数

In [94]: df.groupby(['A','B']).sum() Out[94]: C D A B bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434

重新改变表格结构

Stack:

In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', ....: 'foo', 'foo', 'qux', 'qux'], ....: ['one', 'two', 'one', 'two', ....: 'one', 'two', 'one', 'two']])) ....: In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) In [98]: df2 = df[:4] In [99]: df2 Out[99]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.575170 1.771208 two 0.816482 1.100230

用stack()方法将列索引A、B改变成行索引

In [100]: stacked = df2.stack() In [101]: stacked Out[101]: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64 In [102]: stacked.unstack() Out[102]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.575170 1.771208 two 0.816482 1.100230 In [103]: stacked.unstack(1) Out[103]: second one two first bar A 0.029399 0.282696 B -0.542108 -0.087302 baz A -1.575170 0.816482 B 1.771208 1.100230 In [104]: stacked.unstack(0) Out[104]: first bar baz second one A 0.029399 -1.575170 B -0.542108 1.771208 two A 0.282696 0.816482 B -0.087302 1.100230

Pivot Tables

In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, .....: 'B' : ['A', 'B', 'C'] * 4, .....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, .....: 'D' : np.random.randn(12), .....: 'E' : np.random.randn(12)}) .....: In [106]: df Out[106]: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.879024 1.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.006160 0.392149 4 one B bar -0.029716 0.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.035018 1.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.773723 1.824375 10 two B bar -1.170653 0.595974 11 three C bar 0.648740 1.167115

我们可以很容易地从这些数据生成透视表：

In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[107]: C bar foo A B one A -0.773723 1.418757 B -0.029716 -1.879024 C -1.146178 0.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826

时间序列

pandas有很强大的时间处理功能

创建一系列时间

# 从2011.1.1开始72小时 In [1]: rng = pd.date_range('1/1/2011', periods=72, freq='H') In [2]: rng[:5] Out[2]: DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00', '2011-01-01 02:00:00', '2011-01-01 03:00:00', '2011-01-01 04:00:00'], dtype='datetime64[ns]', freq='H')

以时间为索引的pandas里的对象

In [3]: ts = pd.Series(np.random.randn(len(rng)), index=rng) In [4]: ts.head() Out[4]: 2011-01-01 00:00:00 0.469112 2011-01-01 01:00:00 -0.282863 2011-01-01 02:00:00 -1.509059 2011-01-01 03:00:00 -1.135632 2011-01-01 04:00:00 1.212112 Freq: H, dtype: float64

改变每次的时间间隔

# to 45 minute frequency and forward fill In [5]: converted = ts.asfreq('45Min', method='pad') In [6]: converted.head() Out[6]: 2011-01-01 00:00:00 0.469112 2011-01-01 00:45:00 0.469112 2011-01-01 01:30:00 -0.282863 2011-01-01 02:15:00 -1.509059 2011-01-01 03:00:00 -1.135632 Freq: 45T, dtype: float64

resample和他意思差不多

# Daily means In [7]: ts.resample('D').mean() Out[7]: 2011-01-01 -0.319569 2011-01-02 -0.337703 2011-01-03 0.117258 Freq: D, dtype: float64

Categoricals

In [127]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

创建新的一列，类型是category

In [128]: df["grade"] = df["raw_grade"].astype("category") In [129]: df["grade"] Out[129]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]

重新排序类别并同时添加缺少的类别（series.cat下的方法根据默认值返回一个新的Series）

In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [132]: df["grade"] Out[132]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]

排序是每个类别中的顺序，而不是词序。

In [133]: df.sort_values(by="grade") Out[133]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good

按类别列分组也显示空类别。

In [134]: df.groupby("grade").size() Out[134]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64

画出数据

In [135]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) In [136]: ts = ts.cumsum() In [137]: ts.plot() Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff2ab2af550>

读写csv文件

可以将数据保存为csv文件

In [141]: df.to_csv('foo.csv')

再从该文件读取数据

In [142]: pd.read_csv('foo.csv') Out[142]: Unnamed: 0 A B C D 0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536 3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896 4 2000-01-05 0.578117 0.511371 0.103552 -2.428202 5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409 6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753 .. ... ... ... ... ... 993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940 994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107 995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740 996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439 997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593 998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 [1000 rows x 5 columns]

最新回复(0)