导图社区 大数据生态技术架构
这是一篇关于大数据生态技术架构的思维导图,主要内容有采集、计算、ML、安全、治理、测试、调度、架构、基设等。
编辑于2022-05-27 15:57:43国内外,大中小企业在用大数据技术架构大数据生态技术架构
采集
爬虫
八爪鱼
后羿采集器
Scrapy
汇总:https://www.cnblogs.com/cy163/p/3869175.html
finndycloud
httrack
同步
Logstash
Flinkx
Cloudera Flume
DataX
源码:https://github.com/alibaba/DataX
Facebook Scribe
Debezium
Canal
源码:https://github.com/alibaba/canal
Sqoop
Maxwell
源码:https://github.com/zendesk/maxwell
Chukwa
宜信Dbus
Datalink
https://github.com/ucarGroup/DataLink
DataBus
airbyte
https://github.com/airbytehq/airbyte/
开源
Kettle
Apache hop
NiFi
Streamsets
https://streamsets.com/
https://github.com/streamsets/datacollector-oss
airbyte
https://airbyte.com/
https://github.com/airbytehq/airbyte
waterdrop
https://github.com/InterestingLab/waterdrop
https://github.com/apache/incubator-seatunnel
Wormhole
https://github.com/edp963/wormhole
Piflow
https://github.com/cas-bigdatalab/piflow
Marmaray
https://github.com/uber/marmaray
Apache Gobblin
https://github.com/apache/incubator-gobblin
meltano
https://github.com/aroder/meltano
https://gitlab.com/meltano/meltano
PipelineWise
https://github.com/transferwise/pipelinewise
singer
https://www.singer.io/
pipelinewise
https://github.com/transferwise/pipelinewise
gazette
https://github.com/gazette/core
flow
https://github.com/estuary/flow
Porter
https://gitee.com/sxfad/porter
商用
Datastage
Informatica
tapdata
https://tapdata.net/
fivetran
https://www.fivetran.com/
datapipline
https://www.datapipeline.com/
计算
批处理
MR
Spark
DataTorrent
Vespa
https://github.com/vespa-engine
Cubert
https://github.com/linkedin/Cubert
流处理
Flink/Blink
Storm
Facebook Puma
Twitter Rainbird
Spark Streaming/Spark Struted Streaming
宜信Wormhole
https://github.com/cas-bigdatalab/piflow
编程框架
Apache Beam
Apache Apex
迭代计算
Twister
Apache Giraph
Apache Hama
Guagua
https://github.com/ShifuML/guagua
网格计算
量子计算
ML
RapidMiner
Weka
H2O
Apache MADlib
一站式机器学习平台
submarine
https://github.com/apache/submarine
Kubeflow
MLFlow
SQLFlow
Acumos AI
https://www.acumos.org/platform/
Polyaxon
kedro
PaddlePaddle
推荐参考文件
https://neptune.ai/blog/the-best-kubeflow-alternatives
https://neptune.ai/blog/the-best-mlflow-alternatives
字节跳动开源
klever
https://github.com/kleveross/klever
webank开源
Prophecis
https://github.com/WeBankFinTech/Prophecis
爱奇艺
Deepthought
滴滴开源
DLFlow
https://github.com/didi/dlflow
广州泰迪开源
https://github.com/GZTipDM/TipDM
九章云
https://github.com/DataCanvasIO
https://baijiahao.baidu.com/s?id=1687330658782863165&wfr=spider&for=pc
Intel
https://github.com/intel-analytics/analytics-zoo
安全
权限
Apache Ranger
Apache Sentry
Apache Eagle
认证
Kerberos
Knox
LDAP
脱敏
https://github.com/armenak/DataDefender
https://github.com/arx-deidentifier/arx
https://github.com/microsoft/presidio
https://github.com/datanymizer/datanymizer
https://github.com/DivanteLtd/anonymizer
https://github.com/Allurx/desensitization
https://github.com/haozhang-x/spring-boot-data-desensitize
https://github.com/aws/aws-encryption-sdk-java
https://github.com/adorsys/datasafe
治理
元数据管理
开源
Apache Atlas
WhereHows
https://github.com/akhandoker/WhereHows
https://github.com/linkedin/datahub
Apache griffin
NetFlix metacat
https://github.com/Netflix/metacat
Lyft Amundsen
https://github.com/lyft/amundsen
openrefine
https://github.com/OpenRefine/OpenRefine
marquez
https://github.com/MarquezProject/marquez
https://blog.csdn.net/xiangwang2206/article/details/109634852
商用
Uber Databook
AirBnd Dataportal
Google Catalog
亿信华辰
https://www.edq.com/
安全
cloudra sentry
hdp ranger
血缘分析
主数据管理
https://github.com/datacleaner/DataCleaner
数据质量
Apache griffin
awslabs/deequ
https://gitee.com/dawsongzhao/deequ
datacleaner/DataCleaner
https://github.com/rhiever/datacleaner
https://github.com/datacleaner/DataCleaner
great-expectations/great_expectations
https://github.com/great-expectations/great_expectations
OpenRefine/OpenRefine
pyeve/cerberus
ResidentMario/missingno
WeBankFinTech/Qualitis
mobydq
https://github.com/ubisoft/mobydq
子主题
https://github.com/rich-iannone/pointblank
https://github.com/sodadata/soda-sql
https://github.com/topics/data-quality
测试
参考文档
工具
Berkeley BigDataBench
Hadoop基准测试工具
TestDFSIO
mrbench
nnbench
数据生成
BDGS
微型负载
Hadoop GridMix
TeraSort
YCSB
LinkBench
综合测试
Hibench
BigDataBenchmark
http://prof.ict.ac.cn/publications/
端到端测试
Bigbench
论文:Bigbench:Towards an industry standard benchmark for big data analytics
标准:
TPC-DS
数据生成
terrapin
https://github.com/pinterest/terrapin
调度
Oozie
Azkaban
EasyScheduler
源码地址:
介绍博客:
Luigi
Airflow
Conductor
选型分析:
EasyScheduler VS Azkaban VS Airflow
博文地址:
TASKCTL
https://sourceforge.net/projects/taskctl-source/?source=navbar
zeus
xxl-job
https://github.com/xuxueli/xxl-job
架构
Lamda
Kappa
基设
裸金属
云平台
DC/OS
源码:https://github.com/dcos/dcos
虚拟机
缓冲
Kafka
RocketMQ
ActiveMQ
RabbitMQ
ZeroMQ
Pulsar
挖掘
Mahout
KNIME
存储
分布式文件系统
HDFS
KFS
GFS
Ceph
GlusterFS
Alluxio
数据库
MPP
Green Plum
NoSQL
Cassandra
HBase
二级索引
Phonex
elastic search
云厂商
https://yq.aliyun.com/articles/703153?spm=a2c4e.11155435.0.0.22e14394zqsnXc
Pharos
https://mp.weixin.qq.com/s/iH-RpgdwldNXAeBdrdNp9A
hindex 华为
https://github.com/Huawei-Hadoop/hindex
https://juejin.im/post/5b69698c6fb9a04f92446e62
CockroachDB
Hypertable
asterixdb
https://github.com/apache/asterixdb
GridDB
https://github.com/griddb/griddb_nosql
ScyllaDB
https://github.com/scylladb/scylla
NewSQl
HTAP
开源
TBase
TiDB
Snappydata
Splice machine
Kudu
Delta Lake
iceberg
Hudi
SequoiaDB
https://gitee.com/wangzhonnew/SequoiaDB
Greenplum
商用
Google F1/Spinner
Oracle Exadata
SAP Hana
MemSQL
Vertica
ApsaraDB for HBase2.0
列式存储
Clickhouse
LevelDB
RocksDB
Vertica
行存
KV
Voldemort
Apache Accumulo
Redis
Hibari
图数据
Neo4j
Dgraph
JanusGraph
Apache TinkerPop Gremlin
Google开源
cayley
https://github.com/cayleygraph/cayley
欧若数网
nebula
https://github.com/vesoft-inc/nebula
tigergraph
创邻科技
galaxybase
百度
BGraph
https://ai.baidu.com/tech/kg/bgraph
HugeGraph
https://github.com/hugegraph
阿里
GDB
https://www.aliyun.com/product/gdb?spm=5176.12825654.eofdhaal5.61.23952c4aScJs23&aly_as=7w41-L_h
蚂蚁金服
GeaBase
https://tech.antfin.com/products/GEABASE
费马科技
LightGraph
PandaGraph
北大研究所
gStore
https://github.com/pkumod/gStore
星环
StellarDB
明略科技
字节跳动
https://zhuanlan.zhihu.com/p/79366551
华为
Titan
AWS
Neptune
微软
Graph Engine
Grakn
HGraphDB
https://github.com/rayokota/hgraphdb
妙盈科技
Miotech
阿里
Graphscope
https://github.com/alibaba/GraphScope
多模式
ArangoDB
OrientDB
存储格式
Avro
Parquet
Arrow
CarbonData
ORC
IndexR
https://github.com/shunfei/indexr
内存数据库
开源
Alluxio
Ignite
Terracotta
Apache Geode
商用
Gemfire
子主题
文档型
MongoDB
CouchDB
Couchbase
时序数据库
Open TSDB
TDegine
Influxdb
Prometheus
Graphite
Druid
m3
https://github.com/m3db/m3
iotdb
https://github.com/apache/iotdb
beringei
https://github.com/facebookarchive/beringei
timescaledb
https://github.com/timescale/timescaledb
KairosDB
https://github.com/kairosdb/kairosdb
Crate
https://github.com/crate/crate
questdb
https://github.com/questdb/questdb
对象存储
开源
minio
https://github.com/minio/minio
Apache Ozone
https://hadoop.apache.org/ozone/
商用
aws S3
阿里 OSS
云原生
PORLARDB
CynosDB
HashDB
位置数据库
cartodb
https://github.com/CartoDB/cartodb
查询
查询
Hive
Drill
Phoenix
https://www.cnblogs.com/yulu080808/p/8749056.html
Stinger/Tez
Pig
Shark
SparkSQL
Apache Tajo
Kylin
Surus
Trafodion
https://github.com/apache/trafodion
StreamCQL
联邦查询
Quicksql
跨数据源查询引擎
moonbox
https://github.com/edp963/moonbox
xsql
https://github.com/Qihoo360/XSQL
XSQL can be regard as presto implemented by Spark. XSQL is easier to deploy on yarn, more friendly to Spark Programmer
Dremio
https://github.com/dremio/dremio-oss
http://www.imooc.com/article/details/id/289835
Presto
QuickSQL与Presto区别
子主https://mp.weixin.qq.com/s?__biz=MjM5MDE0Mjc4MA==&mid=2651019031&idx=2&sn=4cd5490a14e1ed5fb5b9bd735f939591&chksm=bdbeaf448ac92652bc994a5b3ec13b5af3470031ff2226184a9e736c0c9262250c660a3f4c6d&scene=27#wechat_redirect题
drill
openLooKeng
https://openlookeng.io/
octosql
https://github.com/cube2222/octosql
apache metamodel
https://github.com/apache/metamodel
Linkis
https://github.com/WeBankFinTech/Linkis
MPP
Presto
Impala
HAWQ
http://hdb.docs.pivotal.io/230/hawq/overview/HAWQOverview.html
AI
OpenCog
TensorFlow
Keras
mxnet
Caffe
pytorch
管理
部署
Linkis
Livy
协调
Zookeeper
资源
Yarn
Mesos
展示
Lumify
Davinci
https://github.com/edp963/davinci
Surperset
Metabase
https://github.com/metabase/metabase
Hue
Zeppelin
绘图
Carbon
https://github.com/dawnlabs/carbon
D3.js
echart
InMaps
GraphX
Graph
Keylines
gephi
Graphviz
社交网络图
https://en.wikipedia.org/wiki/Social_network_analysis_software
Luigi
GraphBuilder
Kibana
Grafana
Nanocubes
CBoard
https://www.oschina.net/p/cboard
Scriptis
https://github.com/WeBankFinTech/Scriptis
RDB
PageNow
https://www.oschina.net/p/pagenow
metatron
https://github.com/metatron-app/metatron-discovery
RDP
DataGear
https://gitee.com/datagear/datagear
ChartFun
https://github.com/ddiu8081/ChartFun
datawrapper
https://github.com/datawrapper/datawrapper
avuejs
https://data.avuejs.com/
https://www.datawrapper.de/
http://datacolour.cn/
https://gitee.com/DataColour/DashboardClient
报表设计器
https://xinglie.github.io/report-designer/
监控
ELK
Grafana+Prometos
Nagios
Zabbix
Open falcon
Ganglia
运维
平台
Ambari
Cloudra Manager
EasyHadoop
工具
Ansible
平台
CDH
HDP
FusionInsight
MapR
Streamsets
https://github.com/streamsets
CDAP
https://github.com/cdapio/cdap
databand
https://gitee.com/475660/databand
BI
Tablend
Jaspersoft
Pentaho
SpagoBI
搜索
ElasticSearch
Solr
Lucene
Nutch
Sphinx
SenseiDB
Katta
https://gitee.com/yiidata/katta
跟踪
Dapper
Zipkin
Pinpoint
SkyWalking
中台
开源
weban datasphere studio
metatron discovery
dbt
https://github.com/fishtown-analytics/dbt
商业
dataiku
dataworks
dataphin
袋鼠云
数栖云
华为大禹
dbt
https://www.getdbt.com/
domnio
datarobot