设计数据密集型应用第三部分：派生数据- 学习笔记- 青岛软件培训-选择一家好的青岛软件培训学校，就要看教学质量和口碑

】
　　event log中只记录uid，而其他属性需要从user database（profile information）读取，这样避免了profile数据的冗余
　　每次通过网路去读取user profile 显然是不切实际的，拖慢批处理速度；而且由于profile 是可变的，导致批处理结果不是确定性的。一个友好的解决办法是：冗余一份数据，放到批处理系统中。
　　
　　下面是一个reduce_side join的例子。称之为sort-merge join，因为不管是User event 还是 User profile都按照userID进行hash，因此都一个用户的event 和 profile会分配到都一个reducer。
　　

批处理使用场景
- 搜索引擎 search index
　　关于增量创建search index，写入新的segment文件，后台批量合并压缩。
　　new segment files and asynchronously merges and compacts segment files in the background.
- 机器学习与推荐
　　一般来说，将数据写入到一个key value store，然后给用户查询
　　怎么讲批处理的结果导入到kvs？直接导入是不大可能的。写入到一个新的db，然后切换。
　　build a brand-new database inside the batch job and write it as files to the job’s output directory in the distributed filesystem
Beyond MapReduce
　　MapReduce的问题
　　（1）比较底层，需要写大量代码：using the raw MapReduce APIs is actually quite hard and laborious
　　解决办法：higher-level programming models (Pig, Hive, Cascading, Crunch) were created as abstractions on top of MapReduce.
　　（2） mapreduce execution model的问题，如下
Materialization of Intermediate State
　　materiallization（物化）是指：每一个MapReduce的输出都需要写入到文件再给下一个MapReduce Task Job。
　　显然，materiallization是提前计算，而不是按需计算。而Unix pipleline 是通过stream按需计算，只占用少量内存空间。
　　MapReduce相比unix pipeline缺陷
1. 　　MapReduce job完成之后才能进行下一个，而unix pipeline是同时执行的
2. 　　Mapper经常是多余的：很多时候仅仅是出去上一个reducer的输出
3. 　　中间状态的存储也是要冗余的，有点浪费
　　dataflow engines如Spark、Tez、Flink试图解决Mapreduce问题的
they handle an entire workflow as one job, rather than breaking it up into independent subjobs.
　　dataflow engines 没有明显的map reduce , 而是一个接一个的operator。其优势：
- 避免了无谓的sort（mr 在map和reduce之间总是要sort）
- 较少非必需的map task
- 由于知道整个流程，可以实现locality optimizations
- 中间状态写入内存或者本地文件，而不是HDFS
- operator流水线工作，不同等到上一个stage完全结束
- 在运行新的operator时，可以复用JVM
Stream Processing
　　上图（a）中的event只需要被任意一个consumer消费即可，而（b）中的每一个event则需要被所有关注该topic的consumer处理
Acknowledgments and redelivery
　　需要consumer的ack来保证消息已被消费，消息可能会被重复投递，因此需要幂等性
　　当 load balancing遇上redeliver，可能会出现messgae 乱序
logbased message brokers
　　一般的消息队列都是一次性消费，基于log的消息队列可以重复消费
The log-based approach trivially supports fan-out messaging, because several consumers can independently read the log without affecting each other—reading a message does not delete it from the log
　　其优点在于：持久化且immutable的日志允许comsumer重新处理所有的事件
This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process. It allows more experimentation and easier recovery from errors and bugs, making it a good tool for integrating dataflows within an organization

Databases and Streams
　　在log-based message broker有数据库的影子，即数据在log中，那么反过来呢，能否将message的思想应用于db，或者说db中是否本身就有message的思想？
　　其实是有的，在primary-secondary 中，primary写oplog， produce event；secondary读oplog， consume event。
Keep Systems in Sync
　　一份数据以不同的形式保存多分，db、cache、search index、recommend system、OLAP
　　很多都是使用full database dumps（batch process），这个速度太慢，又有滞后；多写（dual write）也是不现实的，增加应用层负担、耦合严重。
Change Data Capture
　　一般来说，应用（db_client）按照db的约束来使用db，而不是直接读取、解析replication log。但如果可以直接读取，则有很多用处，例如用来创建serach index、cache、data warehouse。
　　如下图所示
　　
　　前面是DB（leader），中间是log-based message broker，后面是derived data system(cache, data warehouse) as followers
　　这样做的潜在问题是，日志会越来越多，耗光磁盘，直接删除就的log也是不行的，可以周期性的log compaction：处理对一个key重复的操作，或者说已经被删除的key。这样也能解决新增加一个consumber，且consumber需要所有完整数据的情况。
Event Sourcing
event sourcing involves storing all changes to the application state as a log of change events.
　　CDC在数据层记录，增删改查，一个event可能对应多个data change；mutable
　　event sourcing 在应用层记录，immutable（不应该修改删除）
　　event soucing 一般只记录操作，不记录操作后的结果，因此需要所有数据才能恢复当前的状态
　　周期性的snapshot有助于性能
　　Commands and events: 二者并不等价，Command只是意图（比如想预定座位），只有通过检查，执行成功，才会生成对应的event，event 代表 fact
State, Streams, and Immutability
　　
　　上图非常有意思：state是event stream的累计值，积分的效果，而stream是state的瞬时值，微分的效果
　　Advantages of immutable events
- immutable event log　有利于追溯到任意时间点，也可以更容易从错误中恢复
- immutable event log　比当前状态有更多的信息：用户添加物品到购物篮，然后从购物篮移除；从状态来看，什么都没有发生，但event log却意义丰富
- Deriving several views from the same event log　　当有event log，很容易回放event，产生新的数据视图，而不用冒险修改当前使用的数据视图，做到灰度升级
Processing Streams
　　数据流应用广泛：
1. 写到其他数据系统：db、cache等
2. 推送给用户，或者实时展示
3. 产生其他的数据流，形成链路
　　stream processing 通常用于监控：风控、实时交易系统、机器状态、军事系统
　　CEP（Complex event processing）是对特定事件的监控，对于stream，设置匹配规则，满足条件则触发 complex event
　　In these systems, the relationship between queries and data is reversed compared to normal databases.
　　DB持久化数据，查询是临时的
　　而CEP持久化的是查询语句，数据时源源不断的
Reasoning About Time
　　批处理一般使用event time，而流处理可能采用本地时间（stream processing
关键字：

可能你正在寻找一家靠谱的IT培训机构，渴望突破职业瓶颈，找一份得体的工作。恰巧万码学堂正在寻找像你这样不甘平凡的追光者！我们拒绝纸上谈兵，直接参与真实开发流程！
现在行动，未来可期‌
立即拨打0532-85025005，预约免费职业规划咨询前20名咨询者赠送《2025高薪技术岗位白皮书》!
你不是在报名课程，而是在投资五年后的自己！

申请免费试听课程

万码学堂2025年课程全面升级

设计数据密集型应用第三部分：派生数据

批处理使用场景

Beyond MapReduce

Materialization of Intermediate State

Stream Processing

Acknowledgments and redelivery

logbased message brokers

Databases and Streams

Keep Systems in Sync

Change Data Capture

Event Sourcing

State, Streams, and Immutability

Processing Streams

Reasoning About Time

青岛软件培训

联系我们

电话咨询

扫码添加微信