Twein


  • Home

  • Archives

S-DCS++论文笔记

Posted on 2018-10-09

论文题目

An Efficient Spark-based Adaptive Windowing for Entity Matching

笔记

SNM Sorted Neighborhood Method
DCS++ Duplicate Count Strategy

data skew problem

pandas-tricks

Posted on 2018-08-30

https://realpython.com/python-pandas-tricks/

maxwell配置使用

Posted on 2018-07-10

官方文档 参考 记录下maxwell的用法:

准备

修改mysql配置开启binlog

1
vim /etc/my.cnf

1
2
3
4
[mysqld]
server_id=1
log-bin=master
binlog_format=row

给maxwell用户添加权限

1
mysql -h localhost -u root -p

1
2
mysql> GRANT ALL on maxwell.* to 'maxwell'@'%' identified by 'XXXXXX';
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE on *.* to 'maxwell'@'%';

启动运行

在这里和kafka一起启动 先启动zookeeper:

1
2
cd zookeeper-3.4.10
./bin/zkServer.sh start

1
2
cd kafka_2.11-1.1.0
./bin/kafka-server-start.sh config/server.properties
1
2
3
cd maxwell-1.17.1
bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' \
--producer=kafka --kafka.bootstrap.servers=localhost:9092 --kafka_topic=maxwell

同时启动一个kafka console consumer:

1
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic maxwell --from-beginning

添加数据

1
2
3
mysql> create database test;
mysql> use test;
mysql> create table test(id int, name varchar(10), address verchar(20))
1
mysql> insert into test values(1, 'dkey', 'Shanghai')

从kafka console consumer可以读到json数据:

1
{"database":"test","table":"test","type":"insert","ts":1531214750,"xid":2143,"commit":true,"data":{"id":1,"name":"dkey","address":"Shanghai"}}

常用hbase shell命令

Posted on 2018-06-30

显示所有命名空间

1
2
3
4
> list_namespace
NAMESPACE
default
hbase

显示某个命名空间下的所有表

1
2
3
> list_namespace_tables 'default'
TABLE
users

查看表行数

1
> count 'user'

airflow安装及时区问题

Posted on 2018-06-19

从源码安装

由于airflow1.9.0发行版存在时区问题,而根据stackoverflow,该问题在git的主分支上已得到修复,所以拉取airflow的git主分支并通过本地源码安装

1
2
git clone https://github.com/apache/incubator-airflow.git
cd incubator-airflow

进入到项目目录后,使用“pip install .”命令可以安装从本地源码安装,使用all选项安装所有可能用到的组件, 包括hive, mysql等

1
pip install ".[all]"

在安装过程中可能会有报错,根据错误提示和安装环境逐个解决即可

也可以仅安装自己需要的组件

1
pip install ".[mysql, hive, celery]"

时区配置

根据timezone文档 “Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also templates used in Operators are not converted. Time zone information is exposed and it is up to the writer of DAG what do with it.” Web UI 上显示的时间不会被转换为本地时区,而是只需要在写DAG时带入time zone即可

1
2
cd airflow
vim airflow.cfg

修改

1
default_timezone = Asia/Shanghai

查看配置生效:

1
2
3
4
>>>from airflow.utils import timezone

>>>a_date = timezone.datetime(2018,6,1)
>>>a_date

输出:

1
datetime.datetime(2018, 6, 1, 0, 0, tzinfo=<Timezone [Asia/Shanghai]>)

配置mysql

1
sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/ariflow
1
pip install cryptography

Cenerate a Fernet key for airflow

配置celery模式

celery需要使用redis

1
pip install redis

1
vim airflow.cfg
1
2
3
4
executor = CeleryExecutor

broker_url = redis://localhost:6379/0
celery_result_backend = redis://localhost:6379/1

数学基础

Posted on 2018-06-13

$L^p$ 范数: ${\left| {\bf{x}} \right|_p} = {\left(\sum\limits_i {\left| {x_i} \right|}^p \right)^{\frac{1}{p}}}$

内积: $\left\langle \bf{x,y} \right\rangle = \sum\limits_i {x_i\cdot y_i} $

airflow介绍

Posted on 2018-06-03

airflow是由airbnb开源的工作流调度器,可以方便的处理复杂的任务依赖关系,

清除一个dag的状态

1
airflow clear dag_id

####

从源码安装

End-to-End Machine Learning Project

Posted on 2017-11-20

fetch the data:

Read more »

graph-databases

Posted on 2017-11-01

##

pyenv配置python多版本开发环境

Posted on 2017-10-16

现状: 我们可能需要同时在python2和python3的环境下开发,或者不同的项目需要安装不同的包, 我们希望不同项目安装的包之间互不干扰,这时就可以使用pyenv配置python的虚拟环境

Quick Start

Installation

最简单的方法就是使用pyenv-installer 按照官方文档配置即可,在此不赘述。

Read more »

Twein Yan

10 posts
13 tags
© 2018 Twein Yan
Powered by Hexo
|
Theme — NexT.Muse v5.1.4