Multi-threading

Create sub-thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import threading

# this is target function
def run(name):
print('Current task is', name)

if __name__ == '__main__':
# create sub-thread
t1 = threading.Thread(target=run, args=('Thread 1',))
t2 = threading.Thread(target=run, args=('Thread 2',))

# main thread waits for sub-thread
t1.join()
t2.join()

print('Exit')
Read more »

open CMD with admin right.

1
2
3
4
reg.exe add "HKCU\Software\Classes\CLSID\{86ca1aa0-34aa-4e8b-a509-50c905bae2a2}\InprocServer32" /f /ve # win10 mode
reg.exe delete "HKCU\Software\Classes\CLSID\{86ca1aa0-34aa-4e8b-a509-50c905bae2a2}\InprocServer32" /va /f # win11 mode

taskkill /f /im explorer.exe & start explorer.exe # reboot machine or run this line

Please set random seed to 42.

why42

install WSL2 in Win10

  1. open powershell as administrator.
  2. winver to check your Windows version.
  3. wsl -l -o to check all available distribution package.
  4. wsl --install -d <distribution> to install specified Linux version.
  5. (Optional) if you want to connect to your WSL via VScode, check this plugin Remote-WSL

install Anaconda in WSL2

  1. update sudo apt update; sudo apt upgrade
  2. check the download link
  3. install via wget, eg. wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
  4. run the installer script to install Anaconda. (remember to run conda init)

  1. modify the last commit comment: git commit --amend --message="message" --author="my_name <name@email.com>"

  2. modify historical commit comment:

    1. view your commit history: git log

    2. rebase: git rebase -i <commmit id> or git rebase -i HEAD~n

      1. note: you have to rebase to the commit before the one you want to modify.
    3. change the flag:

      • pick: use commit (do nothing)
      • reword: use commit, but edit the commit message
      • edit: use commit, but stop for amending (u can modify files)
      • squash: use commit, but meld into previous commit
      • fixup: like "squash", but discard this commit's log message
      • exec: run command (the rest of the line) using shell
      • drop: remove commit

      If you just want to edit the commit message, choose reword.

    4. modify the message

    5. push git push <origin> <branch> -f. You have to use the flag -f to push the commit by force.

Knapsack Problem

Knapsack problem is a constraint optimization problem. You have to find an optimal answer given certain constraints (capacity of knapsack). There are 3 different variants: 0-1 knapsack, fractional knapsack and unbounded knapsack. We will discuss them one by one.

0-1 Knapsack Problem

What is a 0-1 Knapsack Problem?

In the 0-1 knapsack problem, we can either keep an item or not. There is no way to split an item. Each item has a value and a weight. The knapsack, or the bag, has a weight capacity. Our target is to fill the bag with maximum value while not exceeding its capacity.

Input: bag weight capacity \(W\), \(n\) items with value value[] and weight weight[].

Output: maximum feasible value

Read more »

HDFS (Hadoop Distributed File System)

Structure

NameNode: control the system

StandbyNode: handle logs of NameNode and serves as a backup of NameNode

DataNode: store data

Write:

NameNode get request from client->split data->StandbyNode inform NameNode of nodes to store data (workload balance)->NameNode passes the data to the DataNode->DataNode will pass data to the next DataNode (same piece of data will be copied and stored on multiple DataNodes)

Read:

NameNode get request from client->StandbyNode inform client where to find data->Client find the data from the nearest DataNode

How to handle node Failure:

DataNode: DataNodes will send signal to NameNode via heartbeat mechanism.

NameNode: StandbyNode serves as backup

Difference between Database, Datalake and Datawarehouse.

Database support an application. Any interactive application needs database to store streaming data. Thus database is good at real-time and streaming data ingest.

Datalake stores raw, untouched data from multiple applications. Data lake can support structured, semi-structured and unstructured data. Data lake is not designed for handling real-time data ingest. The frequency of data lake update depends on the frequency of data ETL (Extract, Transform, Load). Data lake provides data for data mining and machine learning.

Datawarehouse is designed for business intelligence. It stores predefined, processed data and only supports structured data and semi-structured data. Datawarehouse provides data for business report and business dashboard.

SQL

Fast Reference

  1. select: select columns

  2. select distinct: remove redundant data in columns (reduce the number of rows)

  3. limit: limit the number of results to be displayed

  4. where: limit conditions

    1. like: select string with certain pattern (work with wildcard)

    2. in: select from several values

      1. eg

        1
        where column in (value1, value2,...)
    3. between: select value/string/data between two values

      1. eg

        1
        where column between value1 and value2
  5. alias: set alias (used in select, where)

  6. order by: show result in certain order

  7. insert into: insert a row with certain columns

    1. eg

      1
      2
      insert into table_name (column1, column2, column3,...)
      values (value1, value2, value3,...)

      You do not need to name the columns if the row inserted contains them all.

  8. update: update certain rows

    1. eg

      1
      2
      3
      update table_name
      set column1=value1, column2=value2,...
      where <conditoins>

      You have to set conditions when using update.

  9. delete: delete certains rows

    1. eg

      1
      2
      delete from table_name
      where <conditions>;

      You have to set conditions when using delete.

  10. join: join two several tables

    1. eg

      1
      2
      3
      4
      select <columns>
      from table1_name
      join table2_name
      on table1_name.key=table2.name.key;
  11. union: union several table (with same number of columns, same type in each column and same order of columns)

    1. eg

      1
      2
      3
      select <columns> from table1_name
      union <all>
      select <columns> from table2_name;

      Use all to allow redundant data in the result of union.

  12. insert into select: copy data from a table and insert to another

    1. eg

      1
      2
      3
      insert into table2_name <columns>
      select <columns>
      from table1_name
  13. create table: create a new table

    1. eg

      1
      2
      3
      4
      5
      6
      create table table_name
      (
      column1 data_type(size) <constraint_name>,
      column2 data_type(size) <constraint_name>,
      ...
      )
    2. constraints: add constraint to give more rules in table

      1. Table

        not null unique primary key foreign key check default
        no null in this column each value in this column is nuique set primary key (not null & unique) set foreign key each value compromise certain condition set default value
      2. primary key: every table should have one and only one primary key.

        1. eg

          1
          2
          3
          4
          5
          6
          7
          create table table_name
          (
          column1 data_type(size),
          column2 data_type(size),
          ...
          primary key (column1)
          )
      3. foreign key: foreign key is the primary key of another table

      4. check: check certain condition

        1. eg

          1
          2
          3
          4
          5
          6
          7
          create table table_name
          (
          column1 data_type(size),
          column2 data_type(size),
          ...
          check <condition>
          )

          or

          1
          2
          3
          4
          5
          6
          7
          create table table_name
          (
          column1 data_type(size),
          column2 data_type(size),
          ...
          constraint constraint_name check <conditions>
          )
  14. index: create index

    1. eg

      1
      2
      create <unique> index index_name
      on table_name (column1, column2,...)
  15. group by: group the results; it should work with aggregate function.

    1. eg

      1
      2
      3
      select column1, aggregate_function(column_name)
      from table_name
      group by column1
  16. having: aggregate function can not work with where and that is why we need having.

    1. the order of these commands is where -> group by -> having.
  17. exist: return true is the result exists

Basics

事实表(Fact Table)又称主题表,其本质是记录交易的表,往往维度很大。

维度表(Dimension Table)又称维表,其本质是基础表,类似于事实表的外键,往往维度较少,冗余较大。

例如商品价格表(包含商品名、价格等)是维度表;售货表(包含商品名、顾客名、交易数量等)是事实表。 例如顾客信息表(用户昵称、用户ip等)是维度表;顾客交易表(包含商品名、交易数量等)是事实表。

为什么我们需要维度表?

事实表一般非常大(往往是宽表),不适合使用常规方法查询,因此需要将一些维度剥离出来作为维度表。维度表出现的目的就是为了使事实表能够减少冗余,提高存储效率。

维度表用于关联事实表(其本身可能有冗余信息,可以帮助事实表节省空间),用于数据检索。一个事实表都需要和一个或多个维度表相关联。维度表内的内容往往用于描述事实表中的数据,因此多为静态数据。

维度表和事实表的信息可以聚合在一起形成一个信息全面的表,称之为指标表,又称宽表。在数据中台(middle-end)中使用宽表,可以减轻业务系统压力,同时为前端的服务提供快速的数据查询(相当于把数据压力剥离出来放到独立的系统中)。

大数据中常提到的星形结构、雪花结构等本质就是针对不同的场景需求对表采取了不同程度的拆分用于提升存储或是查询效率。

Reference

https://blog.csdn.net/weixin_45366499/article/details/120706308 https://blog.csdn.net/qq_28666081/article/details/104686822

0%