[case44]聊聊storm trident的operations
序
本文主要研究一下storm trident的operations
function filter projection
Function
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Function.java
public interface Function extends EachOperation { /** * Performs the function logic on an individual tuple and emits 0 or more tuples. * * @param tuple The incoming tuple * @param collector A collector instance that can be used to emit tuples */ void execute(TridentTuple tuple, TridentCollector collector); }
- Function定义了execute方法,它发射的字段会追加到input tuple中
Filter
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Filter.java
public interface Filter extends EachOperation { /** * Determines if a tuple should be filtered out of a stream * * @param tuple the tuple being evaluated * @return `false` to drop the tuple, `true` to keep the tuple */ boolean isKeep(TridentTuple tuple); }
- Filter提供一个isKeep方法,用来决定该tuple是否输出
projection
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * Filters out fields from a stream, resulting in a Stream containing only the fields specified by `keepFields`. * * For example, if you had a Stream `mystream` containing the fields `["a", "b", "c","d"]`, calling" * * ```java * mystream.project(new Fields("b", "d")) * ``` * * would produce a stream containing only the fields `["b", "d"]`. * * * @param keepFields The fields in the Stream to keep * @return */ public Stream project(Fields keepFields) { projectionValidation(keepFields); return _topology.addSourcedNode(this, new ProcessorNode(_topology.getUniqueStreamId(), _name, keepFields, new Fields(), new ProjectedProcessor(keepFields))); }
- 这里使用了ProjectedProcessor来进行projection操作
repartitioning operations
partition
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * @param partitioner * @return */ public Stream partition(CustomStreamGrouping partitioner) { return partition(Grouping.custom_serialized(Utils.javaSerialize(partitioner))); }
- 这里使用了CustomStreamGrouping
partitionBy
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * @param fields * @return */ public Stream partitionBy(Fields fields) { projectionValidation(fields); return partition(Grouping.fields(fields.toList())); }
- 这里使用Grouping.fields
identityPartition
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * @return */ public Stream identityPartition() { return partition(new IdentityGrouping()); }
- 这里使用IdentityGrouping
shuffle
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * Use random round robin algorithm to evenly redistribute tuples across all target partitions * * @return */ public Stream shuffle() { return partition(Grouping.shuffle(new NullStruct())); }
- 这里使用Grouping.shuffle
localOrShuffle
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * Use random round robin algorithm to evenly redistribute tuples across all target partitions, with a preference * for local tasks. * * @return */ public Stream localOrShuffle() { return partition(Grouping.local_or_shuffle(new NullStruct())); }
- 这里使用Grouping.local_or_shuffle
global
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * All tuples are sent to the same partition. The same partition is chosen for all batches in the stream. * @return */ public Stream global() { // use this instead of storm's built in one so that we can specify a singleemitbatchtopartition // without knowledge of storm's internals return partition(new GlobalGrouping()); }
- 这里使用GlobalGrouping
batchGlobal
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * All tuples in the batch are sent to the same partition. Different batches in the stream may go to different * partitions. * * @return */ public Stream batchGlobal() { // the first field is the batch id return partition(new IndexHashGrouping(0)); }
- 这里使用IndexHashGrouping,是对整个batch维度的repartition
broadcast
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Repartitioning Operation * * Every tuple is replicated to all target partitions. This can useful during DRPC – for example, if you need to do * a stateQuery on every partition of data. * * @return */ public Stream broadcast() { return partition(Grouping.all(new NullStruct())); }
- 这里使用Grouping.all
groupBy
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/** * ## Grouping Operation * * @param fields * @return */ public GroupedStream groupBy(Fields fields) { projectionValidation(fields); return new GroupedStream(this, fields); }
- 这里返回的是GroupedStream
aggregators
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
//partition aggregate public Stream partitionAggregate(Aggregator agg, Fields functionFields) { return partitionAggregate(null, agg, functionFields); } public Stream partitionAggregate(CombinerAggregator agg, Fields functionFields) { return partitionAggregate(null, agg, functionFields); } public Stream partitionAggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) { projectionValidation(inputFields); return chainedAgg() .partitionAggregate(inputFields, agg, functionFields) .chainEnd(); } public Stream partitionAggregate(ReducerAggregator agg, Fields functionFields) { return partitionAggregate(null, agg, functionFields); } public Stream partitionAggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) { projectionValidation(inputFields); return chainedAgg() .partitionAggregate(inputFields, agg, functionFields) .chainEnd(); } //aggregate public Stream aggregate(Fields inputFields, Aggregator agg, Fields functionFields) { projectionValidation(inputFields); return chainedAgg() .aggregate(inputFields, agg, functionFields) .chainEnd(); } public Stream aggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) { projectionValidation(inputFields); return chainedAgg() .aggregate(inputFields, agg, functionFields) .chainEnd(); } public Stream aggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) { projectionValidation(inputFields); return chainedAgg() .aggregate(inputFields, agg, functionFields) .chainEnd(); } //persistent aggregate public TridentState persistentAggregate(StateFactory stateFactory, CombinerAggregator agg, Fields functionFields) { return persistentAggregate(new StateSpec(stateFactory), agg, functionFields); } public TridentState persistentAggregate(StateSpec spec, CombinerAggregator agg, Fields functionFields) { return persistentAggregate(spec, null, agg, functionFields); } public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, CombinerAggregator agg, Fields functionFields) { return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields); } public TridentState persistentAggregate(StateSpec spec, Fields inputFields, CombinerAggregator agg, Fields functionFields) { projectionValidation(inputFields); // replaces normal aggregation here with a global grouping because it needs to be consistent across batches return new ChainedAggregatorDeclarer(this, new GlobalAggScheme()) .aggregate(inputFields, agg, functionFields) .chainEnd() .partitionPersist(spec, functionFields, new CombinerAggStateUpdater(agg), functionFields); } public TridentState persistentAggregate(StateFactory stateFactory, ReducerAggregator agg, Fields functionFields) { return persistentAggregate(new StateSpec(stateFactory), agg, functionFields); } public TridentState persistentAggregate(StateSpec spec, ReducerAggregator agg, Fields functionFields) { return persistentAggregate(spec, null, agg, functionFields); } public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, ReducerAggregator agg, Fields functionFields) { return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields); } public TridentState persistentAggregate(StateSpec spec, Fields inputFields, ReducerAggregator agg, Fields functionFields) { projectionValidation(inputFields); return global().partitionPersist(spec, inputFields, new ReducerAggStateUpdater(agg), functionFields); }
- trident的aggregators主要分为三类,分别是partitionAggregate、aggregate、persistentAggregate;aggregator操作会改变输出
- partitionAggregate其作用的粒度为每个partition,而非整个batch
- aggregrate操作作用的粒度为batch,对每个batch,它先使用global操作将该batch的tuple从所有partition合并到一个partition,最后再对batch进行aggregation操作;这里提供了三类参数,分别是Aggregator、CombinerAggregator、ReducerAggregator;调用stream.aggregrate方法时,相当于一次global aggregation,此时使用Aggregator或ReducerAggregator时,stream会先将tuple划分到一个partition,然后再进行aggregate操作;而使用CombinerAggregator时,trident会进行优化,先对每个partition进行局部的aggregate操作,然后再划分到一个partition,最后再进行aggregate操作,因而相对Aggregator或ReducerAggregator可以节省网络传输耗时
- persistentAggregate操作会对stream上所有batch的tuple进行aggretation,然后将结果存储在state中
Aggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Aggregator.java
public interface Aggregator<T> extends Operation { T init(Object batchId, TridentCollector collector); void aggregate(T val, TridentTuple tuple, TridentCollector collector); void complete(T val, TridentCollector collector); }
- Aggregator首先会调用init进行初始化,然后通过参数传递给aggregate以及complete方法
- 对于batch partition中的每个tuple执行一次aggregate;当batch partition中的tuple执行完aggregate之后执行complete方法
- 假设自定义Aggregator为累加操作,那么对于[4]、[7]、[8]这批tuple,init为0,对于[4],val=0,0+4=4;对于[7],val=4,4+7=11;对于[8],val=11,11+8=19;然后batch结束,val=19,此时执行complete,可以使用collector发射数据
CombinerAggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/CombinerAggregator.java
public interface CombinerAggregator<T> extends Serializable { T init(TridentTuple tuple); T combine(T val1, T val2); T zero(); }
- CombinerAggregator每收到一个tuple,就调用init获取当前tuple的值,调用combine操作使用前一个combine的结果(
没有的话取zero的值
)与init取得的值进行新的combine操作,如果该partition中没有tuple,则返回zero方法的值 - 假设combine为累加操作,zero返回0,那么对于[4]、[7]、[8]这批tuple,init值分别是4、7、8,对于[4],没有前一个combine结果,于是val1=0,val2=4,combine结果为4;对于[7],val1=4,val2=7,combine结果为11;对于[8],val1为11,val2为8,combine结果为19
- CombinerAggregator操作的网络开销相对较低,因此性能比其他两类aggratator好
ReducerAggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/ReducerAggregator.java
public interface ReducerAggregator<T> extends Serializable { T init(); T reduce(T curr, TridentTuple tuple); }
- ReducerAggregator在对一批tuple进行计算时,先调用一次init获取初始值,然后再执行reduce操作,curr值为前一次reduce操作的值,没有的话,就是init值
- 假设reduce为累加操作,init返回0,那么对于[4]、[7]、[8]这批tuple,对于[4],init为0,然后curr=0,先是0+4=4;对于[7],curr为4,就是4+7=11;对于[8],curr为11,最后就是11+8=19
topology stream operations
join
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields) { return join(streams, joinFields, outFields, JoinType.INNER); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinType type) { return join(streams, joinFields, outFields, repeat(streams.size(), type)); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List<JoinType> mixed) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, List<JoinType> mixed) { return join(streams, joinFields, outFields, mixed, JoinOutFieldsMode.COMPACT); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mode); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinOutFieldsMode mode) { return join(streams, joinFields, outFields, JoinType.INNER, mode); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type, mode); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinType type, JoinOutFieldsMode mode) { return join(streams, joinFields, outFields, repeat(streams.size(), type), mode); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List<JoinType> mixed, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed, mode); } public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, List<JoinType> mixed, JoinOutFieldsMode mode) { switch (mode) { case COMPACT: return multiReduce(strippedInputFields(streams, joinFields), groupedStreams(streams, joinFields), new JoinerMultiReducer(mixed, joinFields.get(0).size(), strippedInputFields(streams, joinFields)), outFields); case PRESERVE: return multiReduce(strippedInputFields(streams, joinFields), groupedStreams(streams, joinFields), new PreservingFieldsOrderJoinerMultiReducer(mixed, joinFields.get(0).size(), getAllOutputFields(streams), joinFields, strippedInputFields(streams, joinFields)), outFields); default: throw new IllegalArgumentException("Unsupported out-fields mode: " + mode); } }
- 可以看到join最后调用了multiReduce,对于COMPACT类型使用的GroupedMultiReducer是JoinerMultiReducer,对于PRESERVE类型使用的GroupedMultiReducer是PreservingFieldsOrderJoinerMultiReducer
merge
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream merge(Fields outputFields, Stream... streams) { return merge(outputFields, Arrays.asList(streams)); } public Stream merge(Stream... streams) { return merge(Arrays.asList(streams)); } public Stream merge(List<Stream> streams) { return merge(streams.get(0).getOutputFields(), streams); } public Stream merge(Fields outputFields, List<Stream> streams) { return multiReduce(streams, new IdentityMultiReducer(), outputFields); }
- 可以看到merge最后是调用了multiReduce,使用的MultiReducer是IdentityMultiReducer
multiReduce
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream multiReduce(Stream s1, Stream s2, MultiReducer function, Fields outputFields) { return multiReduce(Arrays.asList(s1, s2), function, outputFields); } public Stream multiReduce(Fields inputFields1, Stream s1, Fields inputFields2, Stream s2, MultiReducer function, Fields outputFields) { return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields); } public Stream multiReduce(GroupedStream s1, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) { return multiReduce(Arrays.asList(s1, s2), function, outputFields); } public Stream multiReduce(Fields inputFields1, GroupedStream s1, Fields inputFields2, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) { return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields); } public Stream multiReduce(List<Stream> streams, MultiReducer function, Fields outputFields) { return multiReduce(getAllOutputFields(streams), streams, function, outputFields); } public Stream multiReduce(List<GroupedStream> streams, GroupedMultiReducer function, Fields outputFields) { return multiReduce(getAllOutputFields(streams), streams, function, outputFields); } public Stream multiReduce(List<Fields> inputFields, List<GroupedStream> groupedStreams, GroupedMultiReducer function, Fields outputFields) { List<Fields> fullInputFields = new ArrayList<>(); List<Stream> streams = new ArrayList<>(); List<Fields> fullGroupFields = new ArrayList<>(); for(int i=0; i<groupedStreams.size(); i++) { GroupedStream gs = groupedStreams.get(i); Fields groupFields = gs.getGroupFields(); fullGroupFields.add(groupFields); streams.add(gs.toStream().partitionBy(groupFields)); fullInputFields.add(TridentUtils.fieldsUnion(groupFields, inputFields.get(i))); } return multiReduce(fullInputFields, streams, new GroupedMultiReducerExecutor(function, fullGroupFields, inputFields), outputFields); } public Stream multiReduce(List<Fields> inputFields, List<Stream> streams, MultiReducer function, Fields outputFields) { List<String> names = new ArrayList<>(); for(Stream s: streams) { if(s._name!=null) { names.add(s._name); } } Node n = new ProcessorNode(getUniqueStreamId(), Utils.join(names, "-"), outputFields, outputFields, new MultiReducerProcessor(inputFields, function)); return addSourcedNode(streams, n); }
- multiReduce方法有个MultiReducer参数,join与merge虽然都调用了multiReduce,但是他们传的MultiReducer值不一样
小结
- trident的操作主要有几类,一类是基本的function、filter、projection操作;一类是repartitioning操作,主要是一些grouping;一类是aggregate操作,包括aggregate、partitionAggregate、persistentAggregate;一类是在topology对stream的join、merge操作
- function的话,若有emit字段会追加到原始的tuple上;filter用于过滤tuple;projection用于提取字段
- repartitioning操作有Grouping.local_or_shuffle、Grouping.shuffle、Grouping.all、GlobalGrouping、CustomStreamGrouping、IdentityGrouping、IndexHashGrouping等;partition操作可以理解为将输入的tuple分配到task上,也可以理解为是对stream进行grouping
- aggregate操作的话,普通的aggregate操作有3类接口,分别是Aggregator、CombinerAggregator、ReducerAggregator,其中Aggregator是最为通用的,它继承了Operation接口,而且在方法参数里头可以使用到collector,这是CombinerAggregator与ReducerAggregator所没有的;而CombinerAggregator与Aggregator及ReducerAggregator不同的是,调用stream.aggregrate方法时,trident会优先在partition进行局部聚合,然后再归一到一个partition做最后聚合,相对来说比较节省网络传输耗时,但是如果将CombinerAggregator与非CombinerAggregator的进行chaining的话,就享受不到这个优化;partitionAggregate主要是在partition维度上进行操作;而persistentAggregate则是在整个stream的维度上对所有batch的tuple进行操作,结果持久化在state上
- 对于stream的join及merge操作,其最后都是依赖multiReduce来实现,只是传递的MultiReducer值不一样;join的话join的话需要字段来进行匹配(
字段名可以不一样
),可以选择JoinType,是INNER还是OUTER,不过join是对于spout的small batch来进行join的;merge的话,就是纯粹的几个stream进行tuple的归总。