当前位置：首页 > news >正文

05_数据组合

news 2026/5/12 3:30:40

学习目标

熟练使用Pandas连接数据
熟练使用Pandas合并数据集

1 简介

在动手进行数据分析工作之前，需要进行数据清理工作，数据清理的主要目标是
- 每个观测值成一行
- 每个变量成一列
- 每种观测单元构成一张表格
数据整理好之后，可能需要多张表格组合到一起才能进行某些问题的分析
- 一张表保存公司名称，另一张表保存股票价格
- 单个数据集也可能会分割成多个，比如时间序列数据，每个日期可能在一个单独的文件中

2 连接数据

组合数据的一种方法是使用“连接”（concatenation)
- 连接是指把某行或某列追加到数据中
- 数据被分成了多份可以使用连接把数据拼接起来
- 把计算的结果追加到现有数据集，可以使用连接

2.1 添加行

加载多份数据，并连接起来

importpandasaspd df1=pd.read_csv('data/concat_1.csv')df2=pd.read_csv('data/concat_2.csv')df3=pd.read_csv('data/concat_3.csv')print(df1)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d3

print(df2)

显示结果：

A B C D0a4 b4 c4 d41a5 b5 c5 d52a6 b6 c6 d63a7 b7 c7 d7

print(df3)

显示结果：

A B C D0a8 b8 c8 d81a9 b9 c9 d92a10 b10 c10 d103a11 b11 c11 d11

可以使用concat函数将上面3个DataFrame连接起来，需将3个DataFrame放到同一个列表中

row_concat=pd.concat([df1,df2,df3])print(row_concat)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d30a4 b4 c4 d41a5 b5 c5 d52a6 b6 c6 d63a7 b7 c7 d70a8 b8 c8 d81a9 b9 c9 d92a10 b10 c10 d103a11 b11 c11 d11

从上面的结果中可以看到，concat函数把3个DataFrame连接在了一起（简单堆叠），可以通过 iloc ,loc等方法取出连接后的数据的子集

row_concat.iloc[3,]

显示结果：

A a3 B b3 C c3 D d3 Name:3, dtype: object

row_concat.loc[3,]

显示结果：

A B C D3a3 b3 c3 d33a7 b7 c7 d73a11 b11 c11 d11

concat可以连接多个对象，如果只需要向现有DataFrame追加一个对象，可以通过append函数来实现

print(df1.append(df2))

显示结果：

A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7

使用Python字典添加数据行

data_dict={'A':'n1','B':'n2','C':'n3','D':'n4'}df1.append(data_dict,ignore_index=True)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d34n1 n2 n3 n4

上面的例子中，向DataFrame中append一个字典的时候，必须传入ignore_index = True
如果是两个或者多个DataFrame连接，可以通过ignore_index = True参数，忽略后面DataFrame的索引

row_concat_ignore_index=pd.concat([df1,df2,df3],ignore_index=True)print(row_concat_ignore_index)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d34a4 b4 c4 d45a5 b5 c5 d56a6 b6 c6 d67a7 b7 c7 d78a8 b8 c8 d89a9 b9 c9 d910a10 b10 c10 d1011a11 b11 c11 d11

2.2 添加列

使用concat函数添加列，与添加行的方法类似，需要多传一个axis参数 axis的默认值是index 按行添加，传入参数 axis = columns 即可按列添加

col_concat=pd.concat([df1,df2,df3],axis=1)print(col_concat)

显示结果：

A B C D A B C D A B C D0a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d81a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d92a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d103a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11

通过列名获取子集

print(col_concat['A'])

显示结果：

A A A 0 a0 a4 a8 1 a1 a5 a9 2 a2 a6 a10 3 a3 a7 a11

向DataFrame添加一列，不需要调用函数，通过dataframe[‘列名’] = [‘值’] 即可

col_concat['new_col']=['n1','n2','n3','n4']print(col_concat)

显示结果：

A B C D A B C D A B C D new_col 0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1 1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2 2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3 3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4

也可以通过dataframe[‘列名’] = Series对象这种方式添加一列

col_concat['new_col_series']=pd.Series(['n1','n2','n3','n4'])col_concat

显示结果：

A B C D A B C D A B C D new_col new_col_series0a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1 n11a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2 n22a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3 n33a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4 n4

按列合并数据之后，可以重置列索引，获得有序索引

print(pd.concat([df1,df2,df3],axis='columns',ignore_index=True))

显示结果：

0 1 2 3 4 5 6 7 8 9 10 11 0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11

2.3 concat连接具有不同行列索引的数据

将上面例子中的数据集做调整，修改列名

df1.columns=['A','B','C','D']df2.columns=['E','F','G','H']df3.columns=['A','C','F','H']

print(df1)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d3

print(df2)

显示结果：

E F G H0a4 b4 c4 d41a5 b5 c5 d52a6 b6 c6 d63a7 b7 c7 d7

print(df3)

显示结果：

A C F H0a8 b8 c8 d81a9 b9 c9 d92a10 b10 c10 d103a11 b11 c11 d11

使用concat直接连接，数据会堆叠在一起，列名相同的数据会合并到一列，合并后不存在的数据会用NaN填充

row_concat=pd.concat([df1,df2,df3])print(row_concat)

显示结果：

A B C D E F G H0a0 b0 c0 d0 NaN NaN NaN NaN1a1 b1 c1 d1 NaN NaN NaN NaN2a2 b2 c2 d2 NaN NaN NaN NaN3a3 b3 c3 d3 NaN NaN NaN NaN0NaN NaN NaN NaN a4 b4 c4 d41NaN NaN NaN NaN a5 b5 c5 d52NaN NaN NaN NaN a6 b6 c6 d63NaN NaN NaN NaN a7 b7 c7 d70a8 NaN b8 NaN NaN c8 NaN d81a9 NaN b9 NaN NaN c9 NaN d92a10 NaN b10 NaN NaN c10 NaN d103a11 NaN b11 NaN NaN c11 NaN d11

如果在连接的时候只想保留所有数据集中都有的数据，可以使用join参数，默认是’outer’保留所有数据，如果设置为’inner’ 只保留数据中的共有部分

print(pd.concat([df1,df2,df3],join='inner'))

显示结果：

Empty DataFrame Columns:[]Index:[0,1,2,3,0,1,2,3,0,1,2,3]

print(pd.concat([df1,df3],join='inner'))

显示结果：

A C0a0 c01a1 c12a2 c23a3 c30a8 b81a9 b92a10 b103a11 b11

连接具有不同行索引的数据

df1.index=[0,1,2,3]df2.index=[4,5,6,7]df3.index=[0,2,5,7]

print(df1)

显示结果：

A B C D0a0 b0 c0 d01a1 b1 c1 d12a2 b2 c2 d23a3 b3 c3 d3

print(df2)

显示结果：

E F G H4a4 b4 c4 d45a5 b5 c5 d56a6 b6 c6 d67a7 b7 c7 d7

print(df3)

显示结果：

A C F H0a8 b8 c8 d82a9 b9 c9 d95a10 b10 c10 d107a11 b11 c11 d11

传入axis = ‘columns’ ，连接后的DataFrame按列添加，并匹配各自行索引，缺失值用NaN表示

col_concat=pd.concat([df1,df2,df3],axis='columns')print(col_concat)

显示结果：

A B C D E F G H A C F H0a0 b0 c0 d0 NaN NaN NaN NaN a8 b8 c8 d81a1 b1 c1 d1 NaN NaN NaN NaN NaN NaN NaN NaN2a2 b2 c2 d2 NaN NaN NaN NaN a9 b9 c9 d93a3 b3 c3 d3 NaN NaN NaN NaN NaN NaN NaN NaN4NaN NaN NaN NaN a4 b4 c4 d4 NaN NaN NaN NaN5NaN NaN NaN NaN a5 b5 c5 d5 a10 b10 c10 d106NaN NaN NaN NaN a6 b6 c6 d6 NaN NaN NaN NaN7NaN NaN NaN NaN a7 b7 c7 d7 a11 b11 c11 d11

使用join = 'inner’参数，只保留索引匹配的结果

print(pd.concat([df1,df3],axis=1,join='inner'))

显示结果：

A B C D A C F H0a0 b0 c0 d0 a8 b8 c8 d82a2 b2 c2 d2 a9 b9 c9 d9

3 合并多个数据集

在使用concat连接数据时，涉及到了参数join（join = ‘inner’,join = ‘outer’）
数据库中可以依据共有数据把两个或者多个数据表组合起来，即join操作
DataFrame 也可以实现类似数据库的join操作
Pandas可以通过pd.join命令组合数据，也可以通过pd.merge命令组合数据
- merge更灵活
- 如果想依据行索引来合并DataFrame可以考虑使用join函数

加载数据：

fromsqlalchemyimportcreate_engine#需要安装sqlalchemy pip install sqlalchemyengine=create_engine('sqlite:///data/chinook.db')#连接数据库tracks=pd.read_sql_table('tracks',engine)tracks.head()

显示结果：
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
1 2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
2 3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho… 230619 3990994 0.99
3 4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D… 252051 4331779 0.99
4 5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
0	1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
1	2	Balls to the Wall	2	2	1	None	342562	5510424	0.99
2	3	Fast As a Shark	3	2	1	F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho…	230619	3990994	0.99
3	4	Restless and Wild	3	2	1	F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D…	252051	4331779	0.99
4	5	Princess of the Dawn	3	2	1	Deaffy & R.A. Smith-Diesel	375418	6290521	0.99

read_sql_table函数可以从数据库中读取表，第一个参数是表名，第二个参数是数据库连接对象

genres=pd.read_sql_table('genres',engine)print(genres)

显示结果：

GenreId Name01Rock12Jazz23Metal34Alternative&Punk45Rock And Roll56Blues67Latin78Reggae89Pop910Soundtrack1011Bossa Nova1112Easy Listening1213Heavy Metal1314R&B/Soul1415Electronica/Dance1516World1617Hip Hop/Rap1718Science Fiction1819TV Shows1920Sci Fi&Fantasy2021Drama2122Comedy2223Alternative2324Classical2425Opera

3.1 一对一合并

最简单的合并只涉及两个DataFrame——把一列与另一列连接，且要连接的列不含任何重复值

先从tracks中提取部分数据，使其不含重复的’GenreId’值

tracks_subset=tracks.loc[[0,62,76,98,110,193,204,281,322,359],]tracks_subset

显示结果：
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
62 63 Desafinado 8 1 2 None 185338 5990473 0.99
76 77 Enter Sandman 9 1 3 Apocalyptica 221701 7286305 0.99
98 99 Your Time Has Come 11 1 4 Cornell, Commerford, Morello, Wilk 255529 8273592 0.99
110 111 Money 12 1 5 Berry Gordy, Jr./Janie Bradford 147591 2365897 0.99
193 194 First Time I Met The Blues 20 1 6 Eurreal Montgomery 140434 4604995 0.99
204 205 Jorge Da Capadócia 21 1 7 Jorge Ben 177397 5842196 0.99
281 282 Girassol 26 1 8 Bino Farias/Da Gama/Lazão/Pedro Luis/Toni Garrido 249808 8327676 0.99
322 323 Dig-Dig, Lambe-Lambe (Ao Vivo) 29 1 9 Cassiano Costa/Cintia Maviane/J.F./Lucas Costa 205479 6892516 0.99
359 360 Vai-Vai 2001 32 1 10 None 276349 9402241 0.99

TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
0	1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
62	63	Desafinado	8	1	2	None	185338	5990473	0.99
76	77	Enter Sandman	9	1	3	Apocalyptica	221701	7286305	0.99
98	99	Your Time Has Come	11	1	4	Cornell, Commerford, Morello, Wilk	255529	8273592	0.99
110	111	Money	12	1	5	Berry Gordy, Jr./Janie Bradford	147591	2365897	0.99
193	194	First Time I Met The Blues	20	1	6	Eurreal Montgomery	140434	4604995	0.99
204	205	Jorge Da Capadócia	21	1	7	Jorge Ben	177397	5842196	0.99
281	282	Girassol	26	1	8	Bino Farias/Da Gama/Lazão/Pedro Luis/Toni Garrido	249808	8327676	0.99
322	323	Dig-Dig, Lambe-Lambe (Ao Vivo)	29	1	9	Cassiano Costa/Cintia Maviane/J.F./Lucas Costa	205479	6892516	0.99
359	360	Vai-Vai 2001	32	1	10	None	276349	9402241	0.99

通过’GenreId’列合并数据，how参数指定连接方式
- how = ’left‘ 对应SQL中的left outer保留左侧表中的所有key
- how = ’right‘ 对应SQL中的right outer保留右侧表中的所有key
- how = ‘outer’ 对应SQL中的full outer保留左右两侧侧表中的所有key
- how = ‘inner’ 对应SQL中的inner只保留左右两侧都有的key

genre_track=genres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='left')print(genre_track)

显示结果：

GenreId Name TrackId Milliseconds01Rock1.0343719.012Jazz63.0185338.023Metal77.0221701.034Alternative&Punk99.0255529.045Rock And Roll111.0147591.056Blues194.0140434.067Latin205.0177397.078Reggae282.0249808.089Pop323.0205479.0910Soundtrack360.0276349.01011Bossa Nova NaN NaN1112Easy Listening NaN NaN1213Heavy Metal NaN NaN1314R&B/Soul NaN NaN1415Electronica/Dance NaN NaN1516World NaN NaN1617Hip Hop/Rap NaN NaN1718Science Fiction NaN NaN1819TV Shows NaN NaN1920Sci Fi&Fantasy NaN NaN2021Drama NaN NaN2122Comedy NaN NaN2223Alternative NaN NaN2324Classical NaN NaN2425Opera NaN NaN

genre_track=genres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='right')print(genre_track)

显示结果：

GenreId Name TrackId Milliseconds 0 1 Rock 1 343719 1 2 Jazz 63 185338 2 3 Metal 77 221701 3 4 Alternative & Punk 99 255529 4 5 Rock And Roll 111 147591 5 6 Blues 194 140434 6 7 Latin 205 177397 7 8 Reggae 282 249808 8 9 Pop 323 205479 9 10 Soundtrack 360 276349

3.2 多对一合并

计算每名用户的平均消费
- 从三张表中获取数据，用户表获取用户id，姓名
- 发票表，获取发票id，用户id
- 发票详情表，获取发票id,单价，数量

cust=pd.read_sql_table('customers',engine,columns=['CustomerId','FirstName','LastName'])invoice=pd.read_sql_table('invoices',engine,columns=['InvoiceId','CustomerId'])ii=pd.read_sql_table('invoice_items',engine,columns=['InvoiceId','UnitPrice','Quantity'])

根据用户Id(‘CustomerId’)合并用户表和发票表，根据发票Id (‘InvoiceId’)合并发票和发票详情表

cust_inv=cust.merge(invoice,on='CustomerId').merge(ii,on='InvoiceId')print(cust_inv.head())

显示结果：

CustomerId FirstName LastName InvoiceId UnitPrice Quantity01Luís Gonçalves981.99111Luís Gonçalves981.99121Luís Gonçalves1210.99131Luís Gonçalves1210.99141Luís Gonçalves1210.991

计算用户每笔消费的总金额
- DataFrame的assign方法创建新列

total=cust_inv['Quantity']*cust_inv['UnitPrice']cust_inv=cust_inv.assign(Total=total)print(cust_inv.head())

显示结果：

CustomerId FirstName LastName InvoiceId UnitPrice Quantity Total01Luís Gonçalves981.9911.9911Luís Gonçalves981.9911.9921Luís Gonçalves1210.9910.9931Luís Gonçalves1210.9910.9941Luís Gonçalves1210.9910.99

按照用户Id，姓名分组，分组后对总金额求和，并排序

cols=['CustomerId','FirstName','LastName']cust_inv.groupby(cols)['Total'].sum().sort_values(ascending=False).head()

显示结果：

CustomerId FirstName LastName 6 Helena Holý 49.62 26 Richard Cunningham 47.62 57 Luis Rojas 46.62 46 Hugh O'Reilly 45.62 45 Ladislav Kovács 45.62 Name: Total, dtype: float64