论文题目
An Efficient Spark-based Adaptive Windowing for Entity Matching
笔记
SNM Sorted Neighborhood Method
DCS++ Duplicate Count Strategy
修改mysql配置开启binlog
1
vim /etc/my.cnf
1 | [mysqld] |
给maxwell用户添加权限
1
mysql -h localhost -u root -p
1 | mysql> GRANT ALL on maxwell.* to 'maxwell'@'%' identified by 'XXXXXX'; |
在这里和kafka一起启动
先启动zookeeper:
1
2cd zookeeper-3.4.10
./bin/zkServer.sh start
1 | cd kafka_2.11-1.1.0 |
1 | cd maxwell-1.17.1 |
同时启动一个kafka console consumer:
1
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic maxwell --from-beginning
1 | mysql> create database test; |
1 | mysql> insert into test values(1, 'dkey', 'Shanghai') |
从kafka console consumer可以读到json数据:
1
{"database":"test","table":"test","type":"insert","ts":1531214750,"xid":2143,"commit":true,"data":{"id":1,"name":"dkey","address":"Shanghai"}}
显示所有命名空间
1
2
3
4> list_namespace
NAMESPACE
default
hbase
显示某个命名空间下的所有表
1
2
3> list_namespace_tables 'default'
TABLE
users
查看表行数
1
> count 'user'
由于airflow1.9.0发行版存在时区问题,而根据stackoverflow,该问题在git的主分支上已得到修复,所以拉取airflow的git主分支并通过本地源码安装
1
2git clone https://github.com/apache/incubator-airflow.git
cd incubator-airflow
进入到项目目录后,使用“pip install .”命令可以安装从本地源码安装,使用all选项安装所有可能用到的组件, 包括hive, mysql等
1
pip install ".[all]"
在安装过程中可能会有报错,根据错误提示和安装环境逐个解决即可
也可以仅安装自己需要的组件
1
pip install ".[mysql, hive, celery]"
根据timezone文档 “Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also templates used in Operators are not converted. Time zone information is exposed and it is up to the writer of DAG what do with it.” Web UI 上显示的时间不会被转换为本地时区,而是只需要在写DAG时带入time zone即可
1 | cd airflow |
修改
1
default_timezone = Asia/Shanghai
查看配置生效:
1
2
3
4>>>from airflow.utils import timezone
>>>a_date = timezone.datetime(2018,6,1)
>>>a_date
输出:
1
datetime.datetime(2018, 6, 1, 0, 0, tzinfo=<Timezone [Asia/Shanghai]>)
1 | sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/ariflow |
1 | pip install cryptography |
Cenerate a Fernet key for airflow
celery需要使用redis
1
pip install redis
1 | vim airflow.cfg |
1 | executor = CeleryExecutor |
$L^p$ 范数: ${\left| {\bf{x}} \right|_p} = {\left(\sum\limits_i {\left| {x_i} \right|}^p \right)^{\frac{1}{p}}}$
内积: $\left\langle \bf{x,y} \right\rangle = \sum\limits_i {x_i\cdot y_i} $
现状: 我们可能需要同时在python2和python3的环境下开发,或者不同的项目需要安装不同的包, 我们希望不同项目安装的包之间互不干扰,这时就可以使用pyenv配置python的虚拟环境
最简单的方法就是使用pyenv-installer 按照官方文档配置即可,在此不赘述。