Python数据分析三剑客：NumPy、Pandas、Matplotlib详解

学习目标#

掌握NumPy数组操作和数学计算
熟练使用Pandas进行数据处理和分析
学会使用Matplotlib创建各种数据可视化
理解三个库的协同工作方式
掌握数据科学工作流程
学会处理真实世界的数据分析问题

学习计划#

NumPy基础：数组操作和数学计算
Pandas核心：数据结构和数据处理
Matplotlib可视化：图表创建和样式设置
三库协同：完整的数据分析工作流
实战案例：真实数据分析项目
性能优化和最佳实践

1. NumPy基础#

1.1 NumPy概述#

NumPy是Python科学计算的基础库，提供了高性能的多维数组对象和数学函数。

NumPy的核心特性：

多维数组（ndarray）
广播机制
线性代数运算
随机数生成
傅里叶变换

1.2 创建数组#

1
import numpy as np
2

3
# 创建数组的不同方法
4
# 从列表创建
5
arr1 = np.array([1, 2, 3, 4, 5])
6
print("一维数组:", arr1)
7

8
# 创建二维数组
9
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
10
print("二维数组:\n", arr2)
11

12
# 使用zeros创建
13
zeros_arr = np.zeros((3, 4))
14
print("零数组:\n", zeros_arr)
15

16
# 使用ones创建
17
ones_arr = np.ones((2, 3))
18
print("一数组:\n", ones_arr)
19

20
# 使用arange创建
21
range_arr = np.arange(0, 10, 2)
22
print("范围数组:", range_arr)
23

24
# 使用linspace创建
25
linspace_arr = np.linspace(0, 1, 5)
26
print("线性空间数组:", linspace_arr)
27

28
# 随机数组
29
random_arr = np.random.rand(3, 3)
30
print("随机数组:\n", random_arr)

预期输出：

1
一维数组: [1 2 3 4 5]
2
二维数组:
3
 [[1 2 3]
4
  [4 5 6]]
5
零数组:
6
 [[0. 0. 0. 0.]
7
  [0. 0. 0. 0.]
8
  [0. 0. 0. 0.]]
9
一数组:
10
 [[1. 1. 1.]
11
  [1. 1. 1.]]
12
范围数组: [0 2 4 6 8]
13
线性空间数组: [0.   0.25 0.5  0.75 1.  ]
14
随机数组:
15
 [[... 3x3 随机小数 ...]]

1.3 数组操作#

1
# 数组形状和维度
2
arr = np.array([[1, 2, 3], [4, 5, 6]])
3
print("数组形状:", arr.shape)
4
print("数组维度:", arr.ndim)
5
print("数组大小:", arr.size)
6
print("数据类型:", arr.dtype)
7

8
# 重塑数组
9
reshaped = arr.reshape(3, 2)
10
print("重塑后:\n", reshaped)
11

12
# 数组索引和切片
13
print("第一行:", arr[0])
14
print("第一列:", arr[:, 0])
15
print("子数组:", arr[0:1, 1:3])
16

17
# 布尔索引
18
bool_mask = arr > 3
19
print("布尔掩码:\n", bool_mask)
20
print("条件选择:", arr[bool_mask])

预期输出：

1
数组形状: (2, 3)
2
数组维度: 2
3
数组大小: 6
4
数据类型: int64
5
重塑后:
6
 [[1 2]
7
  [3 4]
8
  [5 6]]
9
第一行: [1 2 3]
10
第一列: [1 4]
11
子数组: [[2 3]]
12
布尔掩码:
13
 [[False False False]
14
  [ True  True  True]]
15
条件选择: [4 5 6]

1.4 数学运算#

1
# 基本数学运算
2
a = np.array([1, 2, 3, 4])
3
b = np.array([5, 6, 7, 8])
4

5
print("加法:", a + b)
6
print("乘法:", a * b)
7
print("平方:", a ** 2)
8
print("平方根:", np.sqrt(a))
9

10
# 统计函数
11
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
12
print("平均值:", np.mean(data))
13
print("中位数:", np.median(data))
14
print("标准差:", np.std(data))
15
print("方差:", np.var(data))
16
print("最大值:", np.max(data))
17
print("最小值:", np.min(data))
18

19
# 线性代数运算
20
matrix = np.array([[1, 2], [3, 4]])
21
print("矩阵:\n", matrix)
22
print("行列式:", np.linalg.det(matrix))
23
print("逆矩阵:\n", np.linalg.inv(matrix))
24
print("特征值:", np.linalg.eigvals(matrix))

预期输出：

1
加法: [ 6  8 10 12]
2
乘法: [ 5 12 21 32]
3
平方: [ 1  4  9 16]
4
平方根: [1.         1.41421356 1.73205081 2.        ]
5
平均值: 5.5
6
中位数: 5.5
7
标准差: 2.8722813232690143
8
方差: 8.25
9
最大值: 10
10
最小值: 1
11
矩阵:
12
 [[1 2]
13
  [3 4]]
14
行列式: -2.0000000000000004
15
逆矩阵:
16
 [[-2.   1. ]
17
  [ 1.5 -0.5]]
18
特征值: [-0.37228132  5.37228132]

1.5 广播机制#

1
# 广播示例
2
arr = np.array([[1, 2, 3], [4, 5, 6]])
3
scalar = 2
4

5
# 标量与数组运算
6
print("数组 + 标量:\n", arr + scalar)
7
print("数组 * 标量:\n", arr * scalar)
8

9
# 不同形状数组的广播
10
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
11
arr2 = np.array([10, 20, 30])
12

13
print("广播加法:\n", arr1 + arr2)

预期输出：

1
数组 + 标量:
2
 [[ 3  4  5]
3
  [ 6  7  8]]
4
数组 * 标量:
5
 [[ 2  4  6]
6
  [ 8 10 12]]
7
广播加法:
8
 [[11 22 33]
9
  [14 25 36]]

2. Pandas核心#

2.1 Pandas概述#

Pandas是Python数据分析的核心库，提供了高性能、易用的数据结构和数据分析工具。

Pandas的主要数据结构：

Series：一维带标签数组
DataFrame：二维表格数据结构
Panel：三维数据结构（已弃用）

2.2 Series操作#

1
import pandas as pd
2

3
# 创建Series
4
s1 = pd.Series([1, 3, 5, 7, 9])
5
print("Series:", s1)
6

7
# 带索引的Series
8
s2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
9
print("带索引Series:\n", s2)
10

11
# 从字典创建Series
12
dict_data = {'a': 1, 'b': 2, 'c': 3}
13
s3 = pd.Series(dict_data)
14
print("从字典创建:\n", s3)
15

16
# Series操作
17
print("索引:", s2.index)
18
print("值:", s2.values)
19
print("数据类型:", s2.dtype)
20
print("大小:", s2.size)
21

22
# 数学运算
23
print("求和:", s2.sum())
24
print("平均值:", s2.mean())
25
print("标准差:", s2.std())

预期输出：

1
Series: 0    1
2
1    3
3
2    5
4
3    7
5
4    9
6
dtype: int64
7
带索引Series:
8
 a    1
9
b    2
10
c    3
11
d    4
12
dtype: int64
13
从字典创建:
14
 a    1
15
b    2
16
c    3
17
dtype: int64
18
索引: Index(['a', 'b', 'c', 'd'], dtype='object')
19
值: [1 2 3 4]
20
数据类型: int64
21
大小: 4
22
求和: 10
23
平均值: 2.5
24
标准差: 1.2909944487358056

2.3 DataFrame操作#

1
# 创建DataFrame
2
data = {
3
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
4
    'age': [25, 30, 35, 40],
5
    'city': ['New York', 'London', 'Paris', 'Tokyo'],
6
    'salary': [50000, 60000, 70000, 80000]
7
}
8
df = pd.DataFrame(data)
9
print("DataFrame:\n", df)
10

11
# 从CSV文件读取
12
# df = pd.read_csv('data.csv')
13

14
# 基本属性
15
print("形状:", df.shape)
16
print("列名:", df.columns)
17
print("索引:", df.index)
18
print("数据类型:\n", df.dtypes)
19
print("信息:\n", df.info())
20

21
# 查看数据
22
print("前3行:\n", df.head(3))
23
print("后2行:\n", df.tail(2))
24
print("描述性统计:\n", df.describe())

预期输出：

1
DataFrame:
2
      name  age      city  salary
3
0    Alice   25  New York   50000
4
1      Bob   30    London   60000
5
2  Charlie   35     Paris   70000
6
3    David   40     Tokyo   80000
7
形状: (4, 4)
8
列名: Index(['name', 'age', 'city', 'salary'], dtype='object')
9
索引: RangeIndex(start=0, stop=4, step=1)
10
数据类型:
11
 name      object
12
 age        int64
13
 city     object
14
 salary    int64
15
dtype: object
16
信息:
17
 None  ← 在网页中 info() 返回 None，表结构打印在控制台
18
前3行:
19
      name  age      city  salary
20
0    Alice   25  New York   50000
21
1      Bob   30    London   60000
22
2  Charlie   35     Paris   70000
23
后2行:
24
    name  age   city  salary
25
2  Charlie   35  Paris   70000
26
3    David   40  Tokyo   80000
27
描述性统计:
28
              age        salary
29
count   4.000000      4.000000
30
mean   32.500000  65000.000000
31
std     6.454972  12909.944487
32
min    25.000000  50000.000000
33
25%    28.750000  57500.000000
34
50%    32.500000  65000.000000
35
75%    36.250000  72500.000000
36
max    40.000000  80000.000000

2.4 数据选择和索引#

1
# 列选择
2
print("选择单列:", df['name'])
3
print("选择多列:\n", df[['name', 'age']])
4

5
# 行选择
6
print("第一行:", df.iloc[0])
7
print("前两行:\n", df.iloc[0:2])
8

9
# 条件选择
10
young_people = df[df['age'] < 35]
11
print("年轻人:\n", young_people)
12

13
high_salary = df[df['salary'] > 60000]
14
print("高薪人员:\n", high_salary)
15

16
# 复合条件
17
filtered = df[(df['age'] > 30) & (df['salary'] > 60000)]
18
print("复合条件筛选:\n", filtered)

预期输出：

1
选择单列: 0      Alice
2
1        Bob
3
2    Charlie
4
3      David
5
Name: name, dtype: object
6
选择多列:
7
      name  age
8
0    Alice   25
9
1      Bob   30
10
2  Charlie   35
11
3    David   40
12
第一行: name      Alice
13
age           25
14
city     New York
15
salary      50000
16
Name: 0, dtype: object
17
前两行:
18
      name  age      city  salary
19
0    Alice   25  New York   50000
20
1      Bob   30    London   60000
21
年轻人:
22
      name  age      city  salary
23
0    Alice   25  New York   50000
24
1      Bob   30    London   60000
25
高薪人员:
26
      name  age    city  salary
27
2  Charlie   35   Paris   70000
28
3    David   40   Tokyo   80000
29
复合条件筛选:
30
      name  age   city  salary
31
2  Charlie   35  Paris   70000
32
3    David   40  Tokyo   80000

2.5 数据处理#

1
# 缺失值处理
2
df_with_na = pd.DataFrame({
3
    'A': [1, 2, np.nan, 4],
4
    'B': [5, np.nan, np.nan, 8],
5
    'C': [9, 10, 11, np.nan]
6
})
7
print("原始数据:\n", df_with_na)
8

9
# 检查缺失值
10
print("缺失值统计:\n", df_with_na.isnull().sum())
11

12
# 删除缺失值
13
df_cleaned = df_with_na.dropna()
14
print("删除缺失值后:\n", df_cleaned)
15

16
# 填充缺失值
17
df_filled = df_with_na.fillna(0)
18
print("填充0后:\n", df_filled)
19

20
# 数据排序
21
df_sorted = df.sort_values('age', ascending=False)
22
print("按年龄排序:\n", df_sorted)
23

24
# 数据分组
25
grouped = df.groupby('city')['salary'].mean()
26
print("按城市分组平均薪资:\n", grouped)

预期输出：

1
原始数据:
2
      A    B     C
3
0  1.0  5.0   9.0
4
1  2.0  NaN  10.0
5
2  NaN  NaN  11.0
6
3  4.0  8.0   NaN
7
缺失值统计:
8
A    1
9
B    2
10
C    1
11
dtype: int64
12
删除缺失值后:
13
     A    B    C
14
0  1.0  5.0  9.0
15
填充0后:
16
     A    B     C
17
0  1.0  5.0   9.0
18
1  2.0  0.0  10.0
19
2  0.0  0.0  11.0
20
3  4.0  8.0   0.0
21
按年龄排序:
22
      name  age      city  salary
23
3    David   40     Tokyo   80000
24
2  Charlie   35     Paris   70000
25
1      Bob   30    London   60000
26
0    Alice   25  New York   50000
27
按城市分组平均薪资:
28
 city
29
London      60000.0
30
New York    50000.0
31
Paris       70000.0
32
Tokyo       80000.0
33
Name: salary, dtype: float64

2.6 数据合并和连接#

1
# 创建两个DataFrame
2
df1 = pd.DataFrame({
3
    'id': [1, 2, 3, 4],
4
    'name': ['Alice', 'Bob', 'Charlie', 'David']
5
})
6

7
df2 = pd.DataFrame({
8
    'id': [1, 2, 3, 5],
9
    'salary': [50000, 60000, 70000, 90000]
10
})
11

12
# 内连接
13
merged_inner = pd.merge(df1, df2, on='id', how='inner')
14
print("内连接:\n", merged_inner)
15

16
# 左连接
17
merged_left = pd.merge(df1, df2, on='id', how='left')
18
print("左连接:\n", merged_left)
19

20
# 外连接
21
merged_outer = pd.merge(df1, df2, on='id', how='outer')
22
print("外连接:\n", merged_outer)
23

24
# 连接操作
25
concatenated = pd.concat([df1, df2], axis=1)
26
print("连接:\n", concatenated)

预期输出：

1
内连接:
2
    id     name  salary
3
0   1    Alice   50000
4
1   2      Bob   60000
5
2   3  Charlie   70000
6
左连接:
7
    id     name   salary
8
0   1    Alice  50000.0
9
1   2      Bob  60000.0
10
2   3  Charlie  70000.0
11
3   4    David      NaN
12
外连接:
13
    id     name   salary
14
0   1    Alice  50000.0
15
1   2      Bob  60000.0
16
2   3  Charlie  70000.0
17
3   4    David      NaN
18
4   5      NaN  90000.0
19
连接:
20
    id     name   id   salary
21
0   1    Alice  1.0  50000.0
22
1   2      Bob  2.0  60000.0
23
2   3  Charlie  3.0  70000.0
24
3   4    David  5.0  90000.0

3. Matplotlib可视化#

3.1 Matplotlib概述#

Matplotlib是Python最流行的绘图库，可以创建各种静态、动态和交互式图表。

Matplotlib的主要特性：

支持多种图表类型
高度可定制
支持多种输出格式
与NumPy和Pandas完美集成

3.2 基础绘图#

1
import matplotlib.pyplot as plt
2
import numpy as np
3

4
# 设置中文字体
5
plt.rcParams['font.sans-serif'] = ['SimHei']
6
plt.rcParams['axes.unicode_minus'] = False
7

8
# 创建数据
9
x = np.linspace(0, 10, 100)
10
y = np.sin(x)
11

12
# 基础线图
13
plt.figure(figsize=(10, 6))
14
plt.plot(x, y, 'b-', linewidth=2, label='sin(x)')
15
plt.title('正弦函数')
16
plt.xlabel('x')
17
plt.ylabel('y')
18
plt.legend()
19
plt.grid(True)
20
plt.show()
21

22
# 散点图
23
x_scatter = np.random.randn(100)
24
y_scatter = np.random.randn(100)
25

26
plt.figure(figsize=(8, 6))
27
plt.scatter(x_scatter, y_scatter, alpha=0.6)
28
plt.title('散点图')
29
plt.xlabel('x')
30
plt.ylabel('y')
31
plt.show()

【预期输出（图形效果说明）】

1
图1：线图“正弦函数”，蓝色实线，带网格与图例。
2
图2：散点图，100 个点，透明度 0.6。

3.3 多子图#

1
# 创建子图
2
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
3

4
# 第一个子图：线图
5
x = np.linspace(0, 2*np.pi, 100)
6
axes[0, 0].plot(x, np.sin(x))
7
axes[0, 0].set_title('正弦函数')
8
axes[0, 0].grid(True)
9

10
# 第二个子图：余弦函数
11
axes[0, 1].plot(x, np.cos(x), 'r-')
12
axes[0, 1].set_title('余弦函数')
13
axes[0, 1].grid(True)
14

15
# 第三个子图：散点图
16
x_scatter = np.random.randn(50)
17
y_scatter = np.random.randn(50)
18
axes[1, 0].scatter(x_scatter, y_scatter)
19
axes[1, 0].set_title('随机散点图')
20

21
# 第四个子图：柱状图
22
categories = ['A', 'B', 'C', 'D']
23
values = [4, 3, 2, 1]
24
axes[1, 1].bar(categories, values)
25
axes[1, 1].set_title('柱状图')
26

27
plt.tight_layout()
28
plt.show()

【预期输出（图形效果说明）】

1
2x2 子图：左上 sin 曲线，右上 cos 曲线，左下散点，右下柱状图。

3.4 统计图表#

1
# 柱状图
2
categories = ['苹果', '香蕉', '橙子', '葡萄']
3
sales = [23, 45, 56, 78]
4

5
plt.figure(figsize=(8, 6))
6
bars = plt.bar(categories, sales, color=['red', 'yellow', 'orange', 'purple'])
7
plt.title('水果销售统计')
8
plt.xlabel('水果种类')
9
plt.ylabel('销售量')
10

11
# 添加数值标签
12
for bar, value in zip(bars, sales):
13
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
14
             str(value), ha='center', va='bottom')
15

16
plt.show()
17

18
# 饼图
19
sizes = [30, 25, 20, 15, 10]
20
labels = ['苹果', '香蕉', '橙子', '葡萄', '其他']
21
colors = ['red', 'yellow', 'orange', 'purple', 'gray']
22

23
plt.figure(figsize=(8, 8))
24
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
25
plt.title('水果销售比例')
26
plt.axis('equal')
27
plt.show()
28

29
# 直方图
30
data = np.random.normal(0, 1, 1000)
31

32
plt.figure(figsize=(8, 6))
33
plt.hist(data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
34
plt.title('正态分布直方图')
35
plt.xlabel('值')
36
plt.ylabel('频次')
37
plt.grid(True, alpha=0.3)
38
plt.show()

【预期输出（图形效果说明）】

1
柱状图带数值标签；饼图为圆形含百分比；直方图 30 个箱，网格可见。

3.5 高级可视化#

1
# 热力图
2
import seaborn as sns
3

4
# 创建相关性矩阵
5
np.random.seed(0)
6
data = np.random.randn(100, 4)
7
df_corr = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
8
correlation_matrix = df_corr.corr()
9

10
plt.figure(figsize=(8, 6))
11
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
12
plt.title('相关性热力图')
13
plt.show()
14

15
# 箱线图
16
data_box = [np.random.normal(0, std, 100) for std in range(1, 4)]
17
labels = ['组1', '组2', '组3']
18

19
plt.figure(figsize=(8, 6))
20
plt.boxplot(data_box, labels=labels)
21
plt.title('箱线图')
22
plt.ylabel('值')
23
plt.show()
24

25
# 3D图
26
from mpl_toolkits.mplot3d import Axes3D
27

28
fig = plt.figure(figsize=(10, 8))
29
ax = fig.add_subplot(111, projection='3d')
30

31
x = np.linspace(-5, 5, 100)
32
y = np.linspace(-5, 5, 100)
33
X, Y = np.meshgrid(x, y)
34
Z = np.sin(np.sqrt(X**2 + Y**2))
35

36
surf = ax.plot_surface(X, Y, Z, cmap='viridis')
37
ax.set_title('3D表面图')
38
ax.set_xlabel('X')
39
ax.set_ylabel('Y')
40
ax.set_zlabel('Z')
41

42
fig.colorbar(surf)
43
plt.show()

【预期输出（图形效果说明）】

1
热力图显示 4x4 相关系数矩阵；箱线图 3 组；3D 表面图带颜色映射与colorbar。

4. 三库协同工作#

4.1 数据科学工作流#

1
# 完整的数据分析示例
2
import numpy as np
3
import pandas as pd
4
import matplotlib.pyplot as plt
5
import seaborn as sns
6

7
# 1. 数据生成/加载
8
np.random.seed(42)
9
n_samples = 1000
10

11
# 生成模拟数据
12
data = {
13
    'age': np.random.normal(35, 10, n_samples),
14
    'income': np.random.normal(50000, 15000, n_samples),
15
    'education_years': np.random.normal(16, 3, n_samples),
16
    'satisfaction': np.random.uniform(1, 10, n_samples)
17
}
18

19
df = pd.DataFrame(data)
20

21
# 2. 数据探索
22
print("数据基本信息:")
23
print(df.info())
24
print("\n描述性统计:")
25
print(df.describe())
26

27
# 3. 数据清洗
28
# 处理异常值
29
df_clean = df[(df['age'] > 0) & (df['age'] < 100)]
30
df_clean = df_clean[(df_clean['income'] > 0) & (df_clean['income'] < 200000)]
31

32
print(f"清洗后数据量: {len(df_clean)}")
33

34
# 4. 数据分析
35
# 相关性分析
36
correlation = df_clean.corr()
37
print("\n相关性矩阵:")
38
print(correlation)
39

40
# 5. 数据可视化
41
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
42

43
# 年龄分布
44
axes[0, 0].hist(df_clean['age'], bins=30, alpha=0.7, color='skyblue')
45
axes[0, 0].set_title('年龄分布')
46
axes[0, 0].set_xlabel('年龄')
47
axes[0, 0].set_ylabel('频次')
48

49
# 收入分布
50
axes[0, 1].hist(df_clean['income'], bins=30, alpha=0.7, color='lightgreen')
51
axes[0, 1].set_title('收入分布')
52
axes[0, 1].set_xlabel('收入')
53
axes[0, 1].set_ylabel('频次')
54

55
# 年龄vs收入散点图
56
axes[1, 0].scatter(df_clean['age'], df_clean['income'], alpha=0.6)
57
axes[1, 0].set_title('年龄 vs 收入')
58
axes[1, 0].set_xlabel('年龄')
59
axes[1, 0].set_ylabel('收入')
60

61
# 相关性热力图
62
sns.heatmap(correlation, annot=True, cmap='coolwarm', ax=axes[1, 1])
63
axes[1, 1].set_title('相关性热力图')
64

65
plt.tight_layout()
66
plt.show()
67

68
# 6. 统计分析
69
print("\n统计分析结果:")
70
print(f"平均年龄: {df_clean['age'].mean():.2f}")
71
print(f"平均收入: {df_clean['income'].mean():.2f}")
72
print(f"年龄与收入相关系数: {df_clean['age'].corr(df_clean['income']):.3f}")

4.2 时间序列分析#

1
# 时间序列数据处理
2
import pandas as pd
3
import numpy as np
4
import matplotlib.pyplot as plt
5

6
# 创建时间序列数据
7
dates = pd.date_range('2023-01-01', periods=365, freq='D')
8
np.random.seed(42)
9

10
# 生成模拟销售数据
11
sales_data = {
12
    'date': dates,
13
    'sales': np.random.normal(1000, 200, 365) + 50 * np.sin(np.arange(365) * 2 * np.pi / 365),
14
    'temperature': np.random.normal(20, 10, 365) + 15 * np.sin(np.arange(365) * 2 * np.pi / 365)
15
}
16

17
df_ts = pd.DataFrame(sales_data)
18
df_ts.set_index('date', inplace=True)
19

20
# 时间序列分析
21
plt.figure(figsize=(15, 10))
22

23
# 销售趋势
24
plt.subplot(3, 1, 1)
25
plt.plot(df_ts.index, df_ts['sales'])
26
plt.title('销售趋势')
27
plt.ylabel('销售额')
28

29
# 温度趋势
30
plt.subplot(3, 1, 2)
31
plt.plot(df_ts.index, df_ts['temperature'])
32
plt.title('温度趋势')
33
plt.ylabel('温度 (°C)')
34

35
# 销售vs温度散点图
36
plt.subplot(3, 1, 3)
37
plt.scatter(df_ts['temperature'], df_ts['sales'], alpha=0.6)
38
plt.title('温度 vs 销售额')
39
plt.xlabel('温度 (°C)')
40
plt.ylabel('销售额')
41

42
plt.tight_layout()
43
plt.show()
44

45
# 月度统计
46
monthly_sales = df_ts['sales'].resample('M').mean()
47
monthly_temp = df_ts['temperature'].resample('M').mean()
48

49
print("月度平均销售额:")
50
print(monthly_sales)
51
print("\n月度平均温度:")
52
print(monthly_temp)

5. 实战案例#

5.1 股票数据分析#

1
# 股票数据分析示例
2
import yfinance as yf
3
import pandas as pd
4
import numpy as np
5
import matplotlib.pyplot as plt
6

7
# 获取股票数据（这里使用模拟数据）
8
np.random.seed(42)
9
dates = pd.date_range('2023-01-01', periods=252, freq='B')
10
stock_data = {
11
    'Date': dates,
12
    'Close': 100 + np.cumsum(np.random.randn(252) * 0.02),
13
    'Volume': np.random.randint(1000000, 5000000, 252)
14
}
15

16
df_stock = pd.DataFrame(stock_data)
17
df_stock.set_index('Date', inplace=True)
18

19
# 计算技术指标
20
df_stock['SMA_20'] = df_stock['Close'].rolling(window=20).mean()
21
df_stock['SMA_50'] = df_stock['Close'].rolling(window=50).mean()
22
df_stock['Returns'] = df_stock['Close'].pct_change()
23

24
# 可视化
25
fig, axes = plt.subplots(3, 1, figsize=(15, 12))
26

27
# 股价走势
28
axes[0].plot(df_stock.index, df_stock['Close'], label='收盘价', linewidth=2)
29
axes[0].plot(df_stock.index, df_stock['SMA_20'], label='20日均线', alpha=0.7)
30
axes[0].plot(df_stock.index, df_stock['SMA_50'], label='50日均线', alpha=0.7)
31
axes[0].set_title('股票价格走势')
32
axes[0].set_ylabel('价格')
33
axes[0].legend()
34
axes[0].grid(True, alpha=0.3)
35

36
# 成交量
37
axes[1].bar(df_stock.index, df_stock['Volume'], alpha=0.7, color='orange')
38
axes[1].set_title('成交量')
39
axes[1].set_ylabel('成交量')
40

41
# 收益率分布
42
axes[2].hist(df_stock['Returns'].dropna(), bins=50, alpha=0.7, color='green')
43
axes[2].set_title('收益率分布')
44
axes[2].set_xlabel('收益率')
45
axes[2].set_ylabel('频次')
46

47
plt.tight_layout()
48
plt.show()
49

50
# 统计分析
51
print("股票数据分析结果:")
52
print(f"平均收盘价: {df_stock['Close'].mean():.2f}")
53
print(f"最高价: {df_stock['Close'].max():.2f}")
54
print(f"最低价: {df_stock['Close'].min():.2f}")
55
print(f"年化收益率: {df_stock['Returns'].mean() * 252 * 100:.2f}%")
56
print(f"年化波动率: {df_stock['Returns'].std() * np.sqrt(252) * 100:.2f}%")

5.2 机器学习数据预处理#

1
# 机器学习数据预处理示例
2
from sklearn.datasets import load_iris
3
from sklearn.preprocessing import StandardScaler
4
from sklearn.decomposition import PCA
5
import seaborn as sns
6

7
# 加载鸢尾花数据集
8
iris = load_iris()
9
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
10
df_iris['target'] = iris.target
11

12
# 数据探索
13
print("鸢尾花数据集信息:")
14
print(df_iris.info())
15
print("\n描述性统计:")
16
print(df_iris.describe())
17

18
# 数据可视化
19
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
20

21
# 特征分布
22
for i, feature in enumerate(iris.feature_names):
23
    row, col = i // 2, i % 2
24
    for target in range(3):
25
        data = df_iris[df_iris['target'] == target][feature]
26
        axes[row, col].hist(data, alpha=0.7, label=f'类别 {target}')
27
    axes[row, col].set_title(f'{feature} 分布')
28
    axes[row, col].legend()
29

30
plt.tight_layout()
31
plt.show()
32

33
# 相关性分析
34
plt.figure(figsize=(8, 6))
35
correlation_matrix = df_iris.drop('target', axis=1).corr()
36
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
37
plt.title('特征相关性热力图')
38
plt.show()
39

40
# 数据标准化
41
scaler = StandardScaler()
42
X_scaled = scaler.fit_transform(df_iris.drop('target', axis=1))
43
df_scaled = pd.DataFrame(X_scaled, columns=iris.feature_names)
44

45
# PCA降维
46
pca = PCA(n_components=2)
47
X_pca = pca.fit_transform(X_scaled)
48

49
# 可视化PCA结果
50
plt.figure(figsize=(10, 8))
51
for target in range(3):
52
    mask = df_iris['target'] == target
53
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1],
54
                label=f'类别 {target}', alpha=0.7)
55

56
plt.xlabel('主成分1')
57
plt.ylabel('主成分2')
58
plt.title('PCA降维结果')
59
plt.legend()
60
plt.grid(True, alpha=0.3)
61
plt.show()
62

63
print(f"PCA解释方差比例: {pca.explained_variance_ratio_}")
64
print(f"累计解释方差: {pca.explained_variance_ratio_.sum():.3f}")

6. 性能优化和最佳实践#

6.1 NumPy优化#

1
# NumPy性能优化示例
2
import time
3

4
# 比较Python列表和NumPy数组的性能
5
n = 1000000
6

7
# Python列表
8
start_time = time.time()
9
python_list = list(range(n))
10
python_result = [x**2 for x in python_list]
11
python_time = time.time() - start_time
12

13
# NumPy数组
14
start_time = time.time()
15
numpy_array = np.arange(n)
16
numpy_result = numpy_array**2
17
numpy_time = time.time() - start_time
18

19
print(f"Python列表时间: {python_time:.4f}秒")
20
print(f"NumPy数组时间: {numpy_time:.4f}秒")
21
print(f"性能提升: {python_time/numpy_time:.1f}倍")
22

23
# 向量化操作
24
# 低效方式
25
def slow_function(x):
26
    result = []
27
    for i in range(len(x)):
28
        if x[i] > 0:
29
            result.append(np.sqrt(x[i]))
30
        else:
31
            result.append(0)
32
    return np.array(result)
33

34
# 高效方式
35
def fast_function(x):
36
    return np.where(x > 0, np.sqrt(x), 0)
37

38
# 性能比较
39
x = np.random.randn(100000)
40
start_time = time.time()
41
result1 = slow_function(x)
42
slow_time = time.time() - start_time
43

44
start_time = time.time()
45
result2 = fast_function(x)
46
fast_time = time.time() - start_time
47

48
print(f"慢速函数时间: {slow_time:.4f}秒")
49
print(f"快速函数时间: {fast_time:.4f}秒")
50
print(f"性能提升: {slow_time/fast_time:.1f}倍")

6.2 Pandas优化#

1
# Pandas性能优化示例
2

3
# 1. 使用适当的数据类型
4
# 低效
5
df_inefficient = pd.DataFrame({
6
    'category': ['A', 'B', 'A', 'B'] * 1000,
7
    'value': [1, 2, 3, 4] * 1000
8
})
9

10
# 高效
11
df_efficient = pd.DataFrame({
12
    'category': pd.Categorical(['A', 'B', 'A', 'B'] * 1000),
13
    'value': [1, 2, 3, 4] * 1000
14
})
15

16
print("内存使用比较:")
17
print(f"低效方式: {df_inefficient.memory_usage(deep=True).sum()} bytes")
18
print(f"高效方式: {df_efficient.memory_usage(deep=True).sum()} bytes")
19

20
# 2. 使用向量化操作
21
# 低效方式
22
def slow_apply(df):
23
    return df['value'].apply(lambda x: x * 2)
24

25
# 高效方式
26
def fast_apply(df):
27
    return df['value'] * 2
28

29
# 性能比较
30
large_df = pd.DataFrame({'value': range(100000)})
31

32
start_time = time.time()
33
result1 = slow_apply(large_df)
34
slow_time = time.time() - start_time
35

36
start_time = time.time()
37
result2 = fast_apply(large_df)
38
fast_time = time.time() - start_time
39

40
print(f"apply方法时间: {slow_time:.4f}秒")
41
print(f"向量化操作时间: {fast_time:.4f}秒")
42
print(f"性能提升: {slow_time/fast_time:.1f}倍")

6.3 可视化最佳实践#

1
# 可视化最佳实践示例
2

3
# 设置全局样式
4
plt.style.use('seaborn-v0_8')
5
plt.rcParams['figure.figsize'] = (12, 8)
6
plt.rcParams['font.size'] = 12
7

8
# 创建示例数据
9
np.random.seed(42)
10
categories = ['A', 'B', 'C', 'D', 'E']
11
values1 = np.random.randint(10, 100, 5)
12
values2 = np.random.randint(10, 100, 5)
13

14
# 创建子图
15
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
16

17
# 1. 清晰的柱状图
18
bars1 = axes[0, 0].bar(categories, values1, color='skyblue', alpha=0.8)
19
axes[0, 0].set_title('清晰的柱状图', fontsize=14, fontweight='bold')
20
axes[0, 0].set_xlabel('类别')
21
axes[0, 0].set_ylabel('数值')
22

23
# 添加数值标签
24
for bar, value in zip(bars1, values1):
25
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
26
                    str(value), ha='center', va='bottom', fontweight='bold')
27

28
# 2. 对比柱状图
29
x = np.arange(len(categories))
30
width = 0.35
31

32
bars2 = axes[0, 1].bar(x - width/2, values1, width, label='组1', alpha=0.8)
33
bars3 = axes[0, 1].bar(x + width/2, values2, width, label='组2', alpha=0.8)
34

35
axes[0, 1].set_title('对比柱状图', fontsize=14, fontweight='bold')
36
axes[0, 1].set_xlabel('类别')
37
axes[0, 1].set_ylabel('数值')
38
axes[0, 1].set_xticks(x)
39
axes[0, 1].set_xticklabels(categories)
40
axes[0, 1].legend()
41

42
# 3. 散点图
43
x_scatter = np.random.randn(100)
44
y_scatter = np.random.randn(100)
45
colors = np.random.rand(100)
46

47
scatter = axes[1, 0].scatter(x_scatter, y_scatter, c=colors, alpha=0.6, cmap='viridis')
48
axes[1, 0].set_title('散点图', fontsize=14, fontweight='bold')
49
axes[1, 0].set_xlabel('X轴')
50
axes[1, 0].set_ylabel('Y轴')
51
plt.colorbar(scatter, ax=axes[1, 0])
52

53
# 4. 饼图
54
sizes = [30, 25, 20, 15, 10]
55
labels = ['A', 'B', 'C', 'D', 'E']
56
colors_pie = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#ff99cc']
57

58
wedges, texts, autotexts = axes[1, 1].pie(sizes, labels=labels, colors=colors_pie,
59
                                          autopct='%1.1f%%', startangle=90)
60
axes[1, 1].set_title('饼图', fontsize=14, fontweight='bold')
61

62
# 美化文本
63
for autotext in autotexts:
64
    autotext.set_color('white')
65
    autotext.set_fontweight('bold')
66

67
plt.tight_layout()
68
plt.show()

总结#

核心要点#

NumPy：提供高效的数组操作和数学计算基础
Pandas：强大的数据处理和分析工具
Matplotlib：灵活的数据可视化库
协同工作：三个库相互配合，形成完整的数据科学工具链

学习建议#

循序渐进：先掌握NumPy基础，再学习Pandas，最后深入Matplotlib
实践为主：多动手编写代码，处理真实数据
性能意识：注意代码效率，使用向量化操作
可视化思维：学会用图表表达数据洞察
持续学习：关注新特性和最佳实践

应用领域#

数据分析和探索
机器学习数据预处理
科学计算和数值分析
商业智能和报表
学术研究和论文写作

参考资料#

NumPy官方文档：https://numpy.org/doc/
Pandas官方文档：https://pandas.pydata.org/docs/
Matplotlib官方文档：https://matplotlib.org/
Python数据科学手册
数据科学实战指南

附录A：常用方法参数详解#

NumPy#

1
np.array(object, dtype=None, copy=True, ndmin=0, order=None)
2
- object: 可迭代对象或嵌套序列
3
- dtype: 指定元素数据类型（如 np.int64, np.float32）
4
- copy: 是否拷贝数据
5
- ndmin: 指定最小维度
6
- order: 内存布局（'C' 行优先，'F' 列优先）
7

8
np.arange([start,] stop[, step], dtype=None)
9
- start: 起始值（默认0）
10
- stop: 结束值（不包含）
11
- step: 步长（默认1）
12
- dtype: 元素类型
13

14
np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
15
- start/stop: 起止值
16
- num: 点数量
17
- endpoint: 是否包含 stop
18
- retstep: 是否返回步长
19
- dtype: 元素类型
20

21
ndarray.reshape(newshape, order='C')
22
- newshape: 新形状（可用-1自适应）
23
- order: 'C' 行优先，'F' 列优先
24

25
np.mean(a, axis=None, dtype=None, keepdims=False)
26
- a: 数组
27
- axis: 轴（None 为整体；0 按列；1 按行）
28
- dtype: 计算时使用的类型
29
- keepdims: 是否保留降维后的维度
30

31
np.linalg.inv(a)
32
- a: 可逆方阵

Pandas#

1
pd.Series(data=None, index=None, dtype=None, name=None)
2
- data: 可迭代、ndarray、字典等
3
- index: 索引标签
4
- dtype: 数据类型
5
- name: 名称
6

7
pd.DataFrame(data=None, index=None, columns=None, dtype=None)
8
- data: 字典/二维数组/记录序列等
9
- index: 行索引
10
- columns: 列名
11
- dtype: 统一数据类型
12

13
pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None,
14
            index_col=None, usecols=None, dtype=None, parse_dates=False,
15
            na_values=None, encoding=None)
16
- filepath_or_buffer: 文件路径或缓冲
17
- sep: 分隔符
18
- header: 表头行号或 None
19
- names: 列名（与 header 搭配）
20
- index_col: 作为索引的列
21
- usecols: 读取的列集合
22
- dtype: 指定列类型
23
- parse_dates: 解析日期列
24
- na_values: 额外的缺失值标记
25
- encoding: 文件编码
26

27
DataFrame.groupby(by, axis=0, as_index=True, sort=True)
28
- by: 列名或映射/函数
29
- as_index: 分组键是否作索引
30
- sort: 是否对组键排序
31

32
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
33
         left_index=False, right_index=False, suffixes=('_x','_y'))
34
- left/right: 左右 DataFrame
35
- how: 连接方式（'left'/'right'/'inner'/'outer'）
36
- on/left_on/right_on: 连接键
37
- left_index/right_index: 是否使用索引作为连接键
38
- suffixes: 重名列后缀
39

40
pd.concat(objs, axis=0, join='outer', ignore_index=False)
41
- objs: 序列或字典的对象集合
42
- axis: 0 纵向，1 横向
43
- join: 'outer'/'inner'
44
- ignore_index: 是否重建索引

Matplotlib（pyplot）#

1
plt.plot(x, y=None, fmt='', label=None, linewidth=None, color=None)
2
- x, y: 坐标序列（仅 x 时绘制 y=x 的索引）
3
- fmt: 格式字符串（'b-' 蓝色实线等）
4
- label: 图例标签
5
- linewidth/color: 线宽/颜色
6

7
plt.scatter(x, y, s=None, c=None, alpha=None, cmap=None, label=None)
8
- x, y: 点坐标
9
- s: 点大小
10
- c: 颜色或序列（可配合 cmap）
11
- alpha: 透明度
12
- cmap: 颜色映射
13

14
plt.bar(x, height, width=0.8, color=None, label=None)
15
- x: 类别或位置
16
- height: 高度
17
- width: 宽度
18
- color: 颜色
19

20
plt.hist(x, bins=10, range=None, density=False, color=None)
21
- x: 数据
22
- bins: 箱数或边界
23
- range: 范围
24
- density: 是否密度归一化
25

26
plt.pie(x, labels=None, colors=None, autopct=None, startangle=None)
27
- x: 扇区大小
28
- labels/colors: 标签/颜色
29
- autopct: 数值格式（如 '%1.1f%%'）
30
- startangle: 起始角度
31

32
plt.subplots(nrows=1, ncols=1, figsize=None, sharex=False, sharey=False)
33
- nrows/ncols: 子图行列
34
- figsize: 画布尺寸
35
- sharex/sharey: 共享坐标轴

附录B：关键代码输出示例#

NumPy 输出#

1
# 来自 1.2 节
2
一维数组: [1 2 3 4 5]
3
二维数组:
4
 [[1 2 3]
5
  [4 5 6]]
6
零数组:
7
 [[0. 0. 0. 0.]
8
  [0. 0. 0. 0.]
9
  [0. 0. 0. 0.]]
10
一数组:
11
 [[1. 1. 1.]
12
  [1. 1. 1.]]
13
范围数组: [0 2 4 6 8]
14
线性空间数组: [0.   0.25 0.5  0.75 1.  ]
15

16
# 来自 1.3 节
17
数组形状: (2, 3)
18
数组维度: 2
19
数组大小: 6
20
数据类型: int64
21
重塑后:
22
 [[1 2]
23
  [3 4]
24
  [5 6]]
25
第一行: [1 2 3]
26
第一列: [1 4]
27
子数组: [[2 3]]
28
布尔掩码:
29
 [[False False False]
30
  [ True  True  True]]
31
条件选择: [4 5 6]
32

33
# 来自 1.4 节（数值可能略有差异）
34
加法: [ 6  8 10 12]
35
乘法: [ 5 12 21 32]
36
平方: [ 1  4  9 16]
37
平方根: [1.         1.41421356 1.73205081 2.        ]
38
平均值: 5.5
39
中位数: 5.5
40
标准差: 2.8722813232690143
41
方差: 8.25
42
最大值: 10
43
最小值: 1
44
矩阵:
45
 [[1 2]
46
  [3 4]]
47
行列式: -2.0000000000000004
48
逆矩阵:
49
 [[-2.   1. ]
50
  [ 1.5 -0.5]]

Pandas 输出#

1
# 来自 2.3 节
2
DataFrame:
3
      name  age      city  salary
4
0    Alice   25  New York   50000
5
1      Bob   30    London   60000
6
2  Charlie   35     Paris   70000
7
3    David   40     Tokyo   80000
8

9
形状: (4, 4)
10
列名: Index(['name', 'age', 'city', 'salary'], dtype='object')
11
索引: RangeIndex(start=0, stop=4, step=1)
12
数据类型:
13
 name      object
14
 age        int64
15
 city     object
16
 salary    int64
17
dtype: object
18
前3行:
19
      name  age      city  salary
20
0    Alice   25  New York   50000
21
1      Bob   30    London   60000
22
2  Charlie   35     Paris   70000
23
描述性统计:
24
              age        salary
25
count   4.000000      4.000000
26
mean   32.500000  65000.000000
27
std     6.454972  12909.944487
28
min    25.000000  50000.000000
29
25%    28.750000  57500.000000
30
50%    32.500000  65000.000000
31
75%    36.250000  72500.000000
32
max    40.000000  80000.000000
33

34
# 来自 2.4 节
35
选择单列: 0      Alice
36
1        Bob
37
2    Charlie
38
3      David
39
Name: name, dtype: object
40
复合条件筛选:
41
      name  age   city  salary
42
2  Charlie   35  Paris   70000
43
3    David   40  Tokyo   80000
44

45
# 来自 2.5/2.6 节
46
按城市分组平均薪资:
47
 city
48
London      60000.0
49
New York    50000.0
50
Paris       70000.0
51
Tokyo       80000.0
52
Name: salary, dtype: float64
53
内连接:
54
    id     name  salary
55
0   1    Alice   50000
56
1   2      Bob   60000
57
2   3  Charlie   70000
58
外连接（示例）:
59
    id     name   salary
60
0   1    Alice  50000.0
61
1   2      Bob  60000.0
62
2   3  Charlie  70000.0
63
3   4    David      NaN
64
4   5      NaN  90000.0

Matplotlib 输出#

由于图像以窗口/内嵌方式呈现，这里给出典型效果说明：

线图：标题“正弦函数”，蓝色实线，带网格与图例 label=sin(x)
散点图：100 个点，透明度约 0.6，均匀分布在坐标系
多子图：2x2 布局，分别为 sin、cos、随机散点、柱状
柱状图：四类水果的高度柱，顶部有数值标注
饼图：五个扇区，带百分比注记，圆形
直方图：30 箱的正态分布直方图，浅蓝色，带网格