Multi-threading

Create sub-thread

import threading

# this is target function
def run(name):
    print('Current task is', name)
    
if __name__ == '__main__':
    # create sub-thread
    t1 = threading.Thread(target=run, args=('Thread 1',))
    t2 = threading.Thread(target=run, args=('Thread 2',))
    
    # main thread waits for sub-thread
    t1.join()
    t2.join()
    
    print('Exit')

win11修改右键菜单

Posted on 2022-12-21 Edited on 2023-05-08 In Configuration

open CMD with admin right.

reg.exe add "HKCU\Software\Classes\CLSID\{86ca1aa0-34aa-4e8b-a509-50c905bae2a2}\InprocServer32" /f /ve # win10 mode
reg.exe delete "HKCU\Software\Classes\CLSID\{86ca1aa0-34aa-4e8b-a509-50c905bae2a2}\InprocServer32" /va /f # win11 mode

taskkill /f /im explorer.exe & start explorer.exe # reboot machine or run this line

why42

Posted on 2022-10-23

Please set random seed to 42.

why42

WSL2 Configuration

Posted on 2022-10-04 In Configuration

install WSL2 in Win10

open powershell as administrator.
winver to check your Windows version.
wsl -l -o to check all available distribution package.
wsl --install -d <distribution> to install specified Linux version.
(Optional) if you want to connect to your WSL via VScode, check this plugin Remote-WSL

install Anaconda in WSL2

update sudo apt update; sudo apt upgrade
check the download link
install via wget, eg. wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
run the installer script to install Anaconda. (remember to run conda init)

How to modify git message

Posted on 2022-10-04 Edited on 2023-05-08 In Coding

modify the last commit comment: git commit --amend --message="message" --author="my_name <name@email.com>"
modify historical commit comment:
1. view your commit history: git log
2. rebase: git rebase -i <commmit id> or git rebase -i HEAD~n
  1. note: you have to rebase to the commit before the one you want to modify.
3. change the flag:
  - pick: use commit (do nothing)
  - reword: use commit, but edit the commit message
  - edit: use commit, but stop for amending (u can modify files)
  - squash: use commit, but meld into previous commit
  - fixup: like "squash", but discard this commit's log message
  - exec: run command (the rest of the line) using shell
  - drop: remove commit
  If you just want to edit the commit message, choose reword.
4. modify the message
5. push git push <origin> <branch> -f. You have to use the flag -f to push the commit by force.

Knapsack Problem

Knapsack problem is a constraint optimization problem. You have to find an optimal answer given certain constraints (capacity of knapsack). There are 3 different variants: 0-1 knapsack, fractional knapsack and unbounded knapsack. We will discuss them one by one.

0-1 Knapsack Problem

What is a 0-1 Knapsack Problem?

In the 0-1 knapsack problem, we can either keep an item or not. There is no way to split an item. Each item has a value and a weight. The knapsack, or the bag, has a weight capacity. Our target is to fill the bag with maximum value while not exceeding its capacity.

Input: bag weight capacity \(W\), \(n\) items with value value[] and weight weight[].

Output: maximum feasible value

A brief intro of HDFS

Posted on 2022-10-04 In Big Data

HDFS (Hadoop Distributed File System)

Structure

NameNode: control the system

StandbyNode: handle logs of NameNode and serves as a backup of NameNode

DataNode: store data

Write:

NameNode get request from client->split data->StandbyNode inform NameNode of nodes to store data (workload balance)->NameNode passes the data to the DataNode->DataNode will pass data to the next DataNode (same piece of data will be copied and stored on multiple DataNodes)

Read:

NameNode get request from client->StandbyNode inform client where to find data->Client find the data from the nearest DataNode

How to handle node Failure:

DataNode: DataNodes will send signal to NameNode via heartbeat mechanism.

NameNode: StandbyNode serves as backup

Difference between dataset datalake and datawarehouse

Posted on 2022-10-04 In Big Data

Difference between Database, Datalake and Datawarehouse.

Database support an application. Any interactive application needs database to store streaming data. Thus database is good at real-time and streaming data ingest.

Datalake stores raw, untouched data from multiple applications. Data lake can support structured, semi-structured and unstructured data. Data lake is not designed for handling real-time data ingest. The frequency of data lake update depends on the frequency of data ETL (Extract, Transform, Load). Data lake provides data for data mining and machine learning.

Datawarehouse is designed for business intelligence. It stores predefined, processed data and only supports structured data and semi-structured data. Datawarehouse provides data for business report and business dashboard.

SQLSummary

Posted on 2022-09-07 Edited on 2023-05-08 In Big Data

SQL

Fast Reference

select: select columns
select distinct: remove redundant data in columns (reduce the number of rows)
limit: limit the number of results to be displayed
where: limit conditions
1. like: select string with certain pattern (work with wildcard)
2. in: select from several values
  1. eg
    1
    where column in (value1, value2,...)
3. between: select value/string/data between two values
  1. eg
    1
    where column between value1 and value2
alias: set alias (used in select, where)
order by: show result in certain order
insert into: insert a row with certain columns
1. eg
  1
  2
  insert into table_name (column1, column2, column3,...)
  values (value1, value2, value3,...)
  You do not need to name the columns if the row inserted contains them all.

update: update certain rows

eg

1
2
3

update table_name
set column1=value1, column2=value2,...
where <conditoins>

You have to set conditions when using update.

delete: delete certains rows
1. eg
  1
  2
  delete from table_name
  where <conditions>;
  You have to set conditions when using delete.

join: join two several tables

eg

select <columns>
from table1_name 
join table2_name
on table1_name.key=table2.name.key;

union: union several table (with same number of columns, same type in each column and same order of columns)
1. eg
  1
  2
  3
  select <columns> from table1_name
  union <all>
  select <columns> from table2_name;
  Use all to allow redundant data in the result of union.

insert into select: copy data from a table and insert to another

eg

1
2
3

insert into table2_name <columns>
select <columns>
from table1_name

create table: create a new table

eg

create table table_name
(
    column1 data_type(size) <constraint_name>,
    column2 data_type(size) <constraint_name>,
    ...
)

constraints: add constraint to give more rules in table

Table

not null	unique	primary key	foreign key	check	default
no null in this column	each value in this column is nuique	set primary key (not null & unique)	set foreign key	each value compromise certain condition	set default value

primary key: every table should have one and only one primary key.

eg

create table table_name
(
    column1 data_type(size),
    column2 data_type(size),
    ...
    primary key (column1)
)

foreign key: foreign key is the primary key of another table

check: check certain condition

eg

create table table_name
(
    column1 data_type(size),
    column2 data_type(size),
    ...
    check <condition>
)

or

create table table_name
(
    column1 data_type(size),
    column2 data_type(size),
    ...
    constraint constraint_name check <conditions>
)

index: create index

eg

1 2	create <unique> index index_name on table_name (column1, column2,...)

group by: group the results; it should work with aggregate function.

eg

1
2
3

select column1, aggregate_function(column_name)
from table_name
group by column1

having: aggregate function can not work with where and that is why we need having.
1. the order of these commands is where -> group by -> having.
exist: return true is the result exists

Difference between FactTable and DimensionalTable

Posted on 2022-09-07 In Big Data

Basics

事实表(Fact Table)又称主题表，其本质是记录交易的表，往往维度很大。

维度表(Dimension Table)又称维表，其本质是基础表，类似于事实表的外键，往往维度较少，冗余较大。

例如商品价格表（包含商品名、价格等）是维度表；售货表（包含商品名、顾客名、交易数量等）是事实表。例如顾客信息表（用户昵称、用户ip等）是维度表；顾客交易表（包含商品名、交易数量等）是事实表。

为什么我们需要维度表？

事实表一般非常大（往往是宽表），不适合使用常规方法查询，因此需要将一些维度剥离出来作为维度表。维度表出现的目的就是为了使事实表能够减少冗余，提高存储效率。

维度表用于关联事实表（其本身可能有冗余信息，可以帮助事实表节省空间），用于数据检索。一个事实表都需要和一个或多个维度表相关联。维度表内的内容往往用于描述事实表中的数据，因此多为静态数据。

维度表和事实表的信息可以聚合在一起形成一个信息全面的表，称之为指标表，又称宽表。在数据中台(middle-end)中使用宽表，可以减轻业务系统压力，同时为前端的服务提供快速的数据查询（相当于把数据压力剥离出来放到独立的系统中）。

大数据中常提到的星形结构、雪花结构等本质就是针对不同的场景需求对表采取了不同程度的拆分用于提升存储或是查询效率。

Reference

https://blog.csdn.net/weixin_45366499/article/details/120706308 https://blog.csdn.net/qq_28666081/article/details/104686822

Wu Times

MultiThreading in Python

Multi-threading

Create sub-thread

win11修改右键菜单

why42

WSL2 Configuration

How to modify git message