博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
初识Avro
阅读量:6272 次
发布时间:2019-06-22

本文共 6391 字,大约阅读时间需要 21 分钟。

Avro是Hadoop生态圈的一部分,由Hadoop的创始人Doug Cutting牵头开发,当前最新版本1.8.2。Avro是一个数据序列化系统,设计用于支持大批量数据交换的应用。它的主要特点有:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

本文分享的主要是用Avro1.8.2版本,下载地址为,语言为py3.

用Python3操作Avro:
1.创建avsc文件,如province.avsc:

{"namespace": "example.avro", "type": "record", "name": "User", "fields": [     {"name": "province", "type": "string"},     {"name": "abbreviation",  "type": ["string", "null"]},     {"name": "capital_city", "type": ["string", "null"]},     {"name": "area", "type": ["float", "null"]} ]}

type表示Avro的数据类型为recode类型,fields为字段说明,该avsc有四个字段:province字段,数据类型为string;abbreviation字段,数据类型为string或null;capital_city字段,数据类型为string或null;area字段,数据类型为float或null。

2.编写python3代码操作Avro,如下:

# -*- coding: utf-8 -*-import avro.schemafrom avro.datafile import DataFileReader, DataFileWriterfrom avro.io import DatumReader, DatumWriterschema = avro.schema.Parse(open("/home/vagrant/province.avsc").read())writer = DataFileWriter(open("/home/vagrant/provinces.avro", "wb"), DatumWriter(), schema)writer.append({
"province": "北京市", "abbreviation": "京", "capital_city":"北京", "area":1.68})writer.append({
"province": "上海市", "abbreviation": "沪", "capital_city":"上海", "area":0.63})writer.append({
"province": "天津市", "abbreviation": "津", "capital_city":"天津", "area":1.13})writer.append({
"province": "重庆市", "abbreviation": "渝", "capital_city":"重庆", "area":8.23})writer.append({
"province": "黑龙江省", "abbreviation": "黑", "capital_city":"哈尔滨", "area":45.48})writer.append({
"province": "吉林省", "abbreviation": "吉", "capital_city":"长春", "area":18.74})writer.append({
"province": "辽宁省", "abbreviation": "辽", "capital_city":"沈阳", "area":14.59})writer.append({
"province": "内蒙古", "abbreviation": "蒙", "capital_city":"呼和浩特", "area":118.3})writer.append({
"province": "河北省", "abbreviation": "冀", "capital_city":"石家庄", "area":18.77})writer.append({
"province": "新疆", "abbreviation": "新", "capital_city":"乌鲁木齐", "area":166})writer.append({
"province": "甘肃省", "abbreviation": "甘", "capital_city":"兰州", "area":45.44})writer.append({
"province": "青海省", "abbreviation": "青", "capital_city":"西宁", "area":72.23})writer.append({
"province": "陕西省", "abbreviation": "陕", "capital_city":"西安", "area":20.56})writer.append({
"province": "宁夏", "abbreviation": "宁", "capital_city":"银川", "area":6.64})writer.append({
"province": "河南省", "abbreviation": "豫", "capital_city":"郑州", "area":16.7})writer.append({
"province": "山东省", "abbreviation": "鲁", "capital_city":"济南", "area":15.38})writer.append({
"province": "山西省", "abbreviation": "晋", "capital_city":"太原", "area":15.63})writer.append({
"province": "安徽省", "abbreviation": "皖", "capital_city":"合肥", "area":13.97})writer.append({
"province": "湖北省", "abbreviation": "鄂", "capital_city":"武汉", "area":18.59})writer.append({
"province": "湖南省", "abbreviation": "湘", "capital_city":"长沙", "area":21.18})writer.append({
"province": "江苏省", "abbreviation": "苏", "capital_city":"南京", "area":10.26})writer.append({
"province": "四川省", "abbreviation": "川", "capital_city":"成都", "area":48.14})writer.append({
"province": "贵州省", "abbreviation": "黔", "capital_city":"贵阳", "area":17.6})writer.append({
"province": "云南省", "abbreviation": "滇", "capital_city":"昆明", "area":38.33})writer.append({
"province": "广西省", "abbreviation": "桂", "capital_city":"南宁", "area":23.6})writer.append({
"province": "西藏", "abbreviation": "藏", "capital_city":"拉萨", "area":122.8})writer.append({
"province": "浙江省", "abbreviation": "浙", "capital_city":"杭州", "area":10.2})writer.append({
"province": "江西省", "abbreviation": "赣", "capital_city":"南昌", "area":16.7})writer.append({
"province": "广东省", "abbreviation": "粤", "capital_city":"广州", "area":18})writer.append({
"province": "福建省", "abbreviation": "闽", "capital_city":"福州", "area":12.13})writer.append({
"province": "台湾省", "abbreviation": "台", "capital_city":"台北", "area":3.6})writer.append({
"province": "海南省", "abbreviation": "琼", "capital_city":"海口", "area":3.4})writer.append({
"province": "香港", "abbreviation": "港", "capital_city":"香港", "area":0.1101})writer.append({
"province": "澳门", "abbreviation": "澳", "capital_city":"澳门", "area":0.00254})writer.close()reader = DataFileReader(open("/home/vagrant/provinces.avro", "rb"), DatumReader())for user in reader: print(user)reader.close()

运行结果如下:

这里写图片描述
同时,会生成/home/vagrant/provinces.avro文件.

接下来我们在Hive中操作Avro文件。

首先需要将province.avsc和provinces.avro文件放在hdfs端

hdfs dfs -put ~/province.avsc /user/hive/warehouse/hdfs dfs -put ~/provinces.avro /user/hive/warehouse/

进入hive,创建provinces表,表的结构由province.avsc描述。

hive> CREATE TABLE provinces    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'    > STORED AS INPUTFORMAT    > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'    > OUTPUTFORMAT    > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'    > TBLPROPERTIES (    > 'avro.schema.url'='hdfs:///user/hive/warehouse/province.avsc');

查看province表格:

hive> desc provinces;OKprovince                string                                      abbreviation            string                                      capital_city            string                                      area                    float

从hdfs端导入数据:

load data inpath 'hdfs:///user/hive/warehouse/provinces.avro' overwrite into table provinces;

查看provinces.avro数据

hive> select * from provinces;OK北京市 京   北京  1.68上海市 沪   上海  0.63天津市 津   天津  1.13重庆市 渝   重庆  8.23黑龙江省    黑   哈尔滨 45.48吉林省 吉   长春  18.74辽宁省 辽   沈阳  14.59内蒙古 蒙   呼和浩特    118.3河北省 冀   石家庄 18.77新疆  新   乌鲁木齐    166.0甘肃省 甘   兰州  45.44青海省 青   西宁  72.23陕西省 陕   西安  20.56宁夏  宁   银川  6.64河南省 豫   郑州  16.7山东省 鲁   济南  15.38山西省 晋   太原  15.63安徽省 皖   合肥  13.97湖北省 鄂   武汉  18.59湖南省 湘   长沙  21.18江苏省 苏   南京  10.26四川省 川   成都  48.14贵州省 黔   贵阳  17.6云南省 滇   昆明  38.33广西省 桂   南宁  23.6西藏  藏   拉萨  122.8浙江省 浙   杭州  10.2江西省 赣   南昌  16.7广东省 粤   广州  18.0福建省 闽   福州  12.13台湾省 台   台北  3.6海南省 琼   海口  3.4香港  港   香港  0.1101澳门  澳   澳门  0.00254

本次分享到此结束,欢迎大家批评和交流~~


参考网址:
1.Avro Documentation:
2.Hive AvroSerDe:
3.avro总结:

你可能感兴趣的文章
Threejs 官网 - Three.js 的图形用户界面工具(GUI Tools with Three.js)
查看>>
Atitit.Java exe bat 作为windows系统服务程序运行
查看>>
session的生命周期
查看>>
数据库的本质、概念及其应用实践(二)
查看>>
iOS开发多线程--(NSOperation/Queue)
查看>>
php的ajax简单实例
查看>>
maven常用构建命令
查看>>
C#:关联程序和文件
查看>>
推荐科研软件
查看>>
gradle
查看>>
如何取消未知类型文件默认用记事本打开
查看>>
[Javascript] Immute Object
查看>>
Java 关于finally、static
查看>>
Posix mq和SystemV mq区别
查看>>
P6 EPPM Manual Installation Guide (Oracle Database)
查看>>
XMPP协议、IM、客户端互联详解
查看>>
PHP写文件函数
查看>>
mysql的sql_mode合理设置
查看>>
函数连续性与可导性
查看>>
linux下libevent安装
查看>>