Pandas之旅（五): 构建模型初入门：检验数据一致性

hyderhan

2019-07-01

Pandas 如何根据需要创建简单模型

大家好，今天这一期我想和大家分享有关于pandas创建模型的部分，首先让我们来看一个比较常见的场景：

你每天需要打开N个excel进行相同的操作，各种眼花缭乱的VBA函数后老眼昏花。。。。

这种情况下，最好的解决办法是先仔细想想业务需求是什么，根据实际情况可以用pandas搭建一个小型模型，一旦搭建完毕，你每天上班时就可以愉快地运行Python脚本，转身去喝杯咖啡，几分钟后心满意足地回来，发现所有的繁琐操作已经搞定了，生活是这么美好、、、

闲话少说，让我今天抛砖引玉，为大家简单介绍一个我使用比较多的小模型：检验数据一致性（新老数据增加和减少的数量一致），今天的文章主要分为5部分

制作假数据
明确模型目的
开始实践
源码及GitHub地址

好啦，话不多说，让我们一个个看吧

1. 制作假数据

import os

#这两行仅仅是切换路径，方便我上传Github，大家不用理会
os.chdir("F:\\Python教程\\segmentfault\\pandas_share\\Pandas之旅_05 如何构建基础模型")
os.getcwd()

'F:\\Python教程\\segmentfault\\pandas_share\\Pandas之旅_05 如何构建基础模型'

首先让我们一起制作一些假数据，我这里接下来生成一些有关订单的假数据，当然，到了文章的最后可能你会发现我们的模型并不是完美适用于这个类型，你会在生活中根据自己需要来调整，但是至少基础的思路已经有啦！

先建立一个fake_product的字典，keys是产品，value是单价，这里我们用一个在网上随便找到的商品名称的csv数据集,它只有一列ProductNames，product_names.csv和最后的代码都会放在github上，如果大家感兴趣可以下载~

import numpy as np
import pandas as pd
f"Using {pd.__name__},{pd.__version__}"

'Using pandas,0.23.0'

fake_df = pd.read_csv("product_names.csv")
fake_df.head(10)

	Product_Names
0	TrailChef Deluxe Cook Set
1	TrailChef Double Flame
2	Star Dome
3	Star Gazer 2
4	Hibernator Lite
5	Hibernator Extreme
6	Hibernator Camp Cot
7	Firefly Lite
8	Firefly Extreme
9	EverGlow Single

fake_df['Product_Names'].is_unique

True

这里我们可以看到，数据集主要包括的就是一些产品的名字，而且没有重复值，我们现在把他们导出至一个字典，并随机给每个产品任意的价格(在20至100之间),因为这里我们要随机生成一些假数据，所以让我们引用random这个包

import random

fake_product = { k:random.randint(20,100) for k in fake_df['Product_Names']}
fake_product

{'TrailChef Deluxe Cook Set': 62,
 'TrailChef Double Flame': 78,
 'Star Dome': 58,
 'Star Gazer 2': 73,
 'Hibernator Lite': 56,
 'Hibernator Extreme': 99,
 'Hibernator Camp Cot': 33,
 'Firefly Lite': 27,
 'Firefly Extreme': 30,
 'EverGlow Single': 44,
 'EverGlow Butane': 33,
 'Husky Rope 50': 59,
 'Husky Rope 60': 81,
 'Husky Rope 100': 71,
 'Husky Rope 200': 81,
 'Granite Climbing Helmet': 86,
 'Husky Harness': 76,
 'Husky Harness Extreme': 73,
 'Granite Signal Mirror': 67,
 'Granite Carabiner': 63,
 'Granite Belay': 49,
 'Granite Pulley': 48,
 'Firefly Climbing Lamp': 47,
 'Firefly Charger': 60,
 'Firefly Rechargeable Battery': 52,
 'Granite Chalk Bag': 22,
 'Granite Ice': 71,
 'Granite Hammer': 50,
 'Granite Shovel': 41,
 'Granite Grip': 74,
 'Granite Axe': 68,
 'Granite Extreme': 74,
 'Mountain Man Extreme': 87,
 'Polar Sun': 82,
 'Polar Ice': 47,
 'Edge Extreme': 53,
 'Bear Survival Edge': 81,
 'Glacier GPS Extreme': 48,
 'BugShield Extreme': 87,
 'Sun Shelter Stick': 42,
 'Compact Relief Kit': 46,
 'Aloe Relief': 24,
 'Infinity': 73,
 'TX': 43,
 'Legend': 100,
 'Kodiak': 44,
 'Capri': 31,
 'Cat Eye': 62,
 'Dante': 71,
 'Fairway': 77,
 'Inferno': 59,
 'Maximus': 38,
 'Trendi': 35,
 'Zone': 87,
 'Max Gizmo': 67,
 'Pocket Gizmo': 73,
 'Ranger Vision': 73,
 'Trail Master': 96,
 'Hailstorm Steel Irons': 79,
 'Hailstorm Titanium Irons': 31,
 'Lady Hailstorm Steel Irons': 91,
 'Lady Hailstorm Titanium Irons': 99,
 'Hailstorm Titanium Woods Set': 74,
 'Hailstorm Steel Woods Set': 30,
 'Lady Hailstorm Titanium Woods Set': 99,
 'Lady Hailstorm Steel Woods Set': 84,
 'Course Pro Putter': 64,
 'Blue Steel Putter': 26,
 'Blue Steel Max Putter': 96,
 'Course Pro Golf and Tee Set': 90,
 'Course Pro Umbrella': 20,
 'Course Pro Golf Bag': 66,
 'Course Pro Gloves': 61,
 'TrailChef Canteen': 60,
 'TrailChef Kitchen Kit': 53,
 'TrailChef Cup': 88,
 'TrailChef Cook Set': 27,
 'TrailChef Single Flame': 45,
 'TrailChef Kettle': 70,
 'TrailChef Utensils': 88,
 'Star Gazer 6': 42,
 'Star Peg': 28,
 'Hibernator': 47,
 'Hibernator Self - Inflating Mat': 66,
 'Hibernator Pad': 89,
 'Hibernator Pillow': 84,
 'Canyon Mule Climber Backpack': 82,
 'Canyon Mule Weekender Backpack': 92,
 'Canyon Mule Journey Backpack': 82,
 'Canyon Mule Cooler': 23,
 'Canyon Mule Carryall': 56,
 'Firefly Mapreader': 77,
 'Firefly 2': 76,
 'Firefly 4': 75,
 'Firefly Multi-light': 91,
 'EverGlow Double': 34,
 'EverGlow Lamp': 28,
 'Mountain Man Analog': 39,
 'Mountain Man Digital': 85,
 'Mountain Man Deluxe': 84,
 'Mountain Man Combination': 40,
 'Venue': 56,
 'Lux': 44,
 'Polar Sports': 20,
 'Polar Wave': 62,
 'Bella': 45,
 'Hawk Eye': 42,
 'Seeker 35': 81,
 'Seeker 50': 90,
 'Opera Vision': 98,
 'Glacier Basic': 63,
 'Glacier GPS': 66,
 'Trail Scout': 32,
 'BugShield Spray': 34,
 'BugShield Lotion Lite': 90,
 'BugShield Lotion': 84,
 'Sun Blocker': 88,
 'Sun Shelter 15': 45,
 'Sun Shelter 30': 100,
 'Sun Shield': 62,
 'Deluxe Family Relief Kit': 43,
 'Calamine Relief': 82,
 'Insect Bite Relief': 72,
 'Star Lite': 32,
 'Star Gazer 3': 95,
 'Single Edge': 87,
 'Double Edge': 20,
 'Bear Edge': 80,
 'Glacier Deluxe': 82,
 'BugShield Natural': 83,
 'TrailChef Water Bag': 99,
 'Canyon Mule Extreme Backpack': 58,
 'EverGlow Kerosene': 78,
 'Sam': 67,
 'Polar Extreme': 34,
 'Seeker Extreme': 43,
 'Seeker Mini': 26,
 'Flicker Lantern': 44,
 'Trail Star': 47,
 'Zodiak': 31,
 'Sky Pilot': 58,
 'Retro': 99,
 'Astro Pilot': 99,
 'Auto Pilot': 20}

len(fake_product)

这里我们看到生成了一个有144个item组成，key为产品名称，value及单价的fake_product字典，接下来为了省事，
我简单地创建了一个方法get_fake_data可以让我们最终得到一个填充好的假数据集合，返回的也是字典

def get_fake_data(id_range_start,id_range_end,random_quantity_range=50):
#     Id=["A00"+str(i) for i in range(0,id_range)]
    Id=[]
    Quantity = []
    Product_name=[]
    Unit_price=[]
    Total_price=[]

    for i in range(id_range_start,id_range_end):
        random_quantity = random.randint(1,random_quantity_range)
        name, price = random.choice(list(fake_product.items()))

        Id.append("A00"+str(i))
        Quantity.append(random_quantity)
        Product_name.append(name)
        Unit_price.append(price)
        Total_price.append(price*random_quantity)
   
    result = {
    'Product_ID':Id,
    'Product_Name':Product_name,
    'Quantity':Quantity,
    'Unit_price':Unit_price,
    'Total_price':Total_price
}
    
    return result

# total = [quantity[i]* v for i,v in enumerate(unit_price)]    也可以最后用推导式来求total，皮一下
# total_price=[q*p for q in quantity for p in unit_price]

首先，这个方法不够简洁，大家可以优化一下，但是今天的重点在于小模型，让我们着重看一下最后返回的dict，它包含如下几列：

Product_ID：订单号，按照顺序递增生成
Product_Name：产品名称，随机生成
Quantity：随机生成在1~random_quantity_range之间的每个订单的产品订购量
Unit_price:产品价格
Total_price：总价

每组数据长度均为 id_range_end - id_range_start，现在让我们生成两组假数据：

fake_data= get_fake_data(1,len(fake_product)+1)

这里我们可以看到我们生成了一组假数据，Id从A001 ~ A00145

让我们简单看看假数据的keys和每组数据的长度：

fake_data.keys()

dict_keys(['Product_ID', 'Product_Name', 'Quantity', 'Unit_price', 'Total_price'])

for v in fake_data.values():
    print(len(v))

可以发现每组key对应的list长度都是144

2. 明确模型的目的

我们可以利用pandas自带的from_dict方法把dict转化为Dataframe，这里我们分别用刚刚生成的fake_data来模拟1月的库存和2月的库存情况，我们可以把fake_data分成两组，A001-A00140一组，A008-A00144一组，这样就完美的模拟了实际情况。

因为大多数的商品名称不会改变（8~140的部分），但是从一月到二月，因为各种原因我们减少了7个商品种类的库存（1-7），又增加了4个种类的库存（141-144），我们这里验证一致性的公式就是：

新增的 + 一月数据总量 = 减少的 + 二月数据总量

3. 开始实践

现在让我们来实现这个小模型，首先生成stock_jan，stock_fev两个dataframe

stock= pd.DataFrame.from_dict(fake_data)
stock.head()

	Product_ID	Product_Name	Quantity	Unit_price	Total_price
0	A001	Course Pro Golf Bag	39	66	2574
1	A002	EverGlow Kerosene	18	78	1404
2	A003	Lux	24	44	1056
3	A004	Course Pro Putter	12	64	768
4	A005	Seeker 50	42	90	3780

stock.set_index(stock['Product_ID'],inplace=True)
stock.drop('Product_ID',axis=1,inplace=True)
stock.head()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A001	Course Pro Golf Bag	39	66	2574
A002	EverGlow Kerosene	18	78	1404
A003	Lux	24	44	1056
A004	Course Pro Putter	12	64	768
A005	Seeker 50	42	90	3780

# 获得1月份stock数据,A001-A00140
stock_jan=stock[:'A00140']
stock_jan.tail()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A00136	Flicker Lantern	1	44	44
A00137	BugShield Spray	8	34	272
A00138	Glacier Basic	25	63	1575
A00139	Sun Blocker	23	88	2024
A00140	Granite Carabiner	11	63	693

# 获得2月份stock数据
stock_fev=stock['A008':]
stock_fev.tail()

	Product_Name	Quantity	Unit_price	Total_price
Product_ID
A00140	Granite Carabiner	11	63	693
A00141	TrailChef Utensils	24	88	2112
A00142	TrailChef Deluxe Cook Set	9	62	558
A00143	Trail Star	21	47	987
A00144	Ranger Vision	19	73	1387

现在让我们简单停顿一下，看看这两个df：

stock_jan: A001 - A00140的所有数据
stock_fev: A008 - A00144的所有数据

接下来的操作很简单，用我们上篇文章提到的merge函数，这里merge的公有列为索引Product_ID，Product_Name,使用的是outer merge

merge_keys=['Product_ID','Product_Name']

check_corehence = stock_jan.merge(stock_fev,on=merge_keys,how='outer',suffixes=("_jan","_fev"))
check_corehence.head(10)

	Product_Name	Quantity_jan	Unit_price_jan	Total_price_jan	Quantity_fev	Unit_price_fev	Total_price_fev
Product_ID
A001	Course Pro Golf Bag	39.0	66.0	2574.0	NaN	NaN	NaN
A002	EverGlow Kerosene	18.0	78.0	1404.0	NaN	NaN	NaN
A003	Lux	24.0	44.0	1056.0	NaN	NaN	NaN
A004	Course Pro Putter	12.0	64.0	768.0	NaN	NaN	NaN
A005	Seeker 50	42.0	90.0	3780.0	NaN	NaN	NaN
A006	Course Pro Golf Bag	27.0	66.0	1782.0	NaN	NaN	NaN
A007	Husky Rope 100	3.0	71.0	213.0	NaN	NaN	NaN
A008	EverGlow Double	18.0	34.0	612.0	18.0	34.0	612.0
A009	Opera Vision	30.0	98.0	2940.0	30.0	98.0	2940.0
A0010	TX	38.0	43.0	1634.0	38.0	43.0	1634.0

</div>

check_corehence.tail()

	Product_Name	Quantity_jan	Unit_price_jan	Total_price_jan	Quantity_fev	Unit_price_fev	Total_price_fev
Product_ID
A00140	Granite Carabiner	11.0	63.0	693.0	11.0	63.0	693.0
A00141	TrailChef Utensils	NaN	NaN	NaN	24.0	88.0	2112.0
A00142	TrailChef Deluxe Cook Set	NaN	NaN	NaN	9.0	62.0	558.0
A00143	Trail Star	NaN	NaN	NaN	21.0	47.0	987.0
A00144	Ranger Vision	NaN	NaN	NaN	19.0	73.0	1387.0

大家可以发现前7行正是减少的商品库存，而后4行正是二月份新增的商品库存，现在让我们分别获得减少的商品库存数据和新增的商品库存数据：

new_stock = check_corehence.loc[(check_corehence['Quantity_jan'].isnull()) & (check_corehence['Quantity_fev'].notnull())]
num_new = new_stock.shape[0]
num_new

remove_stock = check_corehence.loc[(check_corehence['Quantity_fev'].isnull()) & (check_corehence['Quantity_jan'].notnull())]
num_remove = remove_stock.shape[0]
num_remove

再让我们分别看看1月和2月的数据量：

# 1月数据量
num_stock_jan = stock_jan.shape[0]
num_stock_jan

# 2月数据量
num_stock_fev = stock_fev.shape[0]
num_stock_fev

现在让我们套入公式：

num_stock_jan + num_new

num_stock_fev + num_remove

结果相等，数据一致性过关！

4. 源码及GitHub地址

这一期为大家分享了一个简单的pandas检验数据一致性的模型，模型还是非常初级阶段，功能非常简单，但是基础的搭建流程想必大家已经熟悉了，接下来小伙伴们可以根据业务需求搭建自己的模型啦，只要你每天和Excel打交道，总有一款模型适合你

我把这一期的ipynb文件和py文件,以及用到的商品目录Category List放到了Github上，大家如果想要下载可以点击下面的链接：

Github仓库地址： https://github.com/yaozeliang/pandas_share

希望大家能够继续支持我，完结，撒花

pandas 大数据

安科网

Pandas之旅（五): 构建模型初入门：检验数据一致性

hyderhan

Pandas 如何根据需要创建简单模型

1. 制作假数据

2. 明确模型的目的

3. 开始实践

4. 源码及GitHub地址

hyderhan

相关推荐

教你几招，Pandas轻松处理超大规模数据

Python 中利用Pandas处理复杂的Excel数据

秒懂！图解四个实用的Pandas函数！

不常见的Pandas小窍门：我打赌一定有你不知道的

在pandas中利用hdf5高效存储数据

别找了，这是Pandas最详细教程了

Pandas这样来设置，做数据分析舒适百倍

高效的10个Pandas函数，你都用过了吗？

10 个加速Python数据分析的简单的小技巧

Pandas

Pandas闪回咒！如何在Python中重写SQL查询？

高效的10个Pandas函数，你都用过吗？

推荐5个实用的Pandas技巧

pandas 一维台账数据与二维表格数据的转换

用于ETL的Python数据转换工具

pandas 的DataFrame.apply()

【Pandas】基本功能

【pandas】概述

数据分析三剑客之Pandas时间序列

初探pandas——索引和查询数据

hyderhan