/中文/
/中文/
/中文/
/中文/
/中文/
/中文/
/中文/
/中文/
/中文/
/中文/
直接使用磁盤進行存儲,線性讀寫,速度快。kafka中文手冊免費版是款具有分布式的軟件,kafka中文手冊免費版裏broker端不維護數據的消費狀態,提升了性能,避免了數據在JVM內存和係統內存之間的複製,減少耗性能的創建對象和垃圾回收。kafka下載地址等你哦!
主題和日誌
Let’s first dive into the core abstraction Kafka provides for a stream of records—the topic.首先我們考察下kafka提供的核心數據流結構– topic(主題)
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. topic是一個分類欄目,由於記錄一類數據發布的位置. topic在kafka中通常都有多個訂閱者, 也就是說一個topic在寫入數據後, 可以零個, 一個, 或多個訂閱者進行消費
For each topic, the Kafka cluster maintains a partitioned log that looks like this: 針對每個topic隊列, kafka集群構建一組這樣的分區日誌:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
每個日誌分區都是有序, 不可變, 持續提交的結構化日誌, 每條記錄提交到日誌分區時, 都分配一個有序的位移對象offset, 用以唯一區分記數據在分區的位置
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.
無論發布到Kafka的數據是否有被消費, 都會保留所有已經發布的記錄, Kafka使用可配置的數據保存周期策略, 例如, 如果保存策略設置為兩天, 則兩天前發布的數據可以被訂閱者消費, 過了兩天後, 數據占用的空間就會被刪除並回收. 在存儲數據上, kafka提供高效的O(1)性能處理算法, 所以保存長期時間不是一個問題
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from “now”.
實際上, 每個消費者唯一保存的元數據信息就是消費者當前消費日誌的位移位置. 位移位置是被消費者控製, 正常情況下, 如果消費者讀取記錄後, 位移位置往前移動. 但是事實上, 由於位移位置是消費者控製的, 所以消費者可以按照任何他喜歡的次序進行消費, 例如, 消費者可以重置位移到之前的位置以便重新處理數據, 或者跳過頭部從當前最新的位置進行消費
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.
這些特性表明Kafka消費者消費的代價是十分小的, 消費者可以隨時消費或停止, 而對集群或其他消費者沒有太多的影響, 例如你可以使用命令行工具, 像”tail”工具那樣讀取topic的內容, 而對其它消費者沒有影響
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
分區在日誌中有幾個目的, 首先, 它能擴大日誌在單個服務器裏麵的大小, 每個分區大小必須適應它從屬的服務器的規定的大小, 但是一個topic可以有任意很多個分區, 這樣topic就能存儲任意大小的數據量, 另一方麵, 分區還和並發有關係, 這個後麵會講到
Distribution 分布式
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
kafka的日誌分區機製跨越整個kafka日誌集群, 每個服務器使用一組公用的分區進行數據處理, 每個分區可以在集群中配置副本數
Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
每個分區都有一台服務器是主的, 另外零台或多台是從服務器, 主服務器責所有分區的讀寫請求, 從服務器被動從主分區同步數據. 如果主服務器分區的失敗了, 那麼備服務器的分區就會自動變成主的. 每台服務器的所有分區中, 隻有部分會作為主分區, 另外部分作為從分區, 這樣可以在集群中對個個服務器做負載均攤
Producers 生產者
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
生產者發布消息到他們選擇的topic中, 生產者負責選擇記錄要發布到topic的那個分區中, 這個可以簡單通過輪詢的方式進行負載均攤, 或者可以通過特定的分區選擇函數(基於記錄特定鍵值), 更多分區的用法後麵馬上介紹
Consumers 消費者
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
消費者使用消費組進行標記, 發布到topic裏麵的每條記錄, 至少會被消費組裏麵一個消費者實例進行消費. 消費者實例可以是不同的進程, 分布在不同的機器上
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
如果所有的消費者屬於同一消費組, 則記錄會有效地分攤到每一個消費者上, 也就是說每個消費者隻會處理部分記錄
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
如果所有的消費者都屬於不同的消費組, 則記錄會被廣播到所有的消費者上, 也就說每個消費者會處理所有記錄
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
圖為一個2個服務器的kafka集群, 擁有4個分區, 2個消費組, 消費組A有2個消費者, 消費組B有4個消費者
More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
在大多數情況下, 一般一個topic值需要少量的消費者組, 一個消費組對應於一個邏輯上的消費者. 每個消費組一般包含多個實例用於容錯和水平擴展. 這僅僅是發布訂閱語義,其中訂閱者是消費者群集,而不是單個進程.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a “fair share” of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
在kafka中實現日誌消費的方式, 是把日誌分區後分配到不同的消費者實例上, 所以每個實例在某個時間點都是”公平共享”式獨占每個分區. 在這個處理過程中, 維持組內的成員是由kafka協議動態決定的, 如果有新的實例加入組中, 則會從組中的其他成員分配一些分區給新成員, 如果某個實例銷毀了, 則它負責的分區也會分配給組內的其它成員
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
kafka值提供在一個日誌分區裏麵順序消費的能力, 在同一topic的不同分區裏麵是沒有保證的. 由於記錄可以結合鍵值做分區, 這樣的分區順序一般可以滿足各個應用的需求了, 但是如果你要求topic下的所有記錄都要按照次序進行消費, 則可以考慮一個topic值創建一個分區, 這樣意味著你這個topic隻能讓一個消費者消費