CS503 Note

This is a Note for CS503.

理论课

Front End

  1. Web 1.0: 静态HTML -> Javascript & CSS;
  2. CGI(common gateway interface) 有程序操作,返回输出给浏览器;每个request需要一个程序处理,无法处理并发;
  3. Servlet: 绑定Html和程序,通过程序改变网页内容=JAVA + HTML;
  4. ASP,JSP,PHP;结合HTML和程序. JSP = HTML + JAVA; 支持语法检查;
  5. Ajax&JQuery: 异步加载;MVC: 设计模式;(前端后端需要合作,不能分离工作)
  6. Node: Client-side MVC; 分离前后端,只传输数据 -> Angular 和 React出现

React: 1.One way data flow;2.Vitual DOM;3.JSX

Node/Express

多线程切换context的代价大;单线程可以避免这种消耗;

Node是在server端运行javascript的环境,Express是一个简化搭建服务器的框架。

Express 中间件的思想,连接不同组件。

NoSQL

  • Key-value Store: Redis,Dynamo
  • Document-based Store: MongoDB
  • Column-based Store: BigTable, HBase, Cassandra

CAP Theorem: Consistensy, Availability, Patition-tolerance

MongoDB

Document-oriented Database

  • Map nicely to programming language data types
  • Enbeded documents reduce the need of joins
  • Dynamic schema
  • Store in Bson files

MongoDB -> databases -> collection -> documents

PROs: Simple use/Faster/Easier and faster integration
CONs: Cannot be used for heavy and complex transactions systems.

Using pyMongo to connect python with MongoDB, manage connections automatically.

Message Queue

  • An application framework for sending and receiving messages
  • A way to communicate between applications
  • A way to decouple components
  • A way to offload work

Message Queue Protocol

  1. AMQP - RabbitMQ
  2. STOMP - ActiveMQ
  3. XMPP

Using pika to connect python with RabbitMQ(CloudAMQP).

SOA(Service-oriented Architecture)

PROs: Isolation/Ownership/Scalability
CONs: Complexity/Latency/Test effort/DevOp, on-call

API design

  1. Web Service APIS: REST/JSON-RPC/XML-RPC
  2. Library-based APIS: Javascript
  3. Class-based API: Java API, Android API
  4. OS Functions and Roytines: File system
  5. Hardware APIS
  • RPC(Action oriented)
  • REST(Resource oriented)

Using python-jsonrpc in python to create RPC functions

Good API:

  1. Easy to learn and use with documentaion
  2. Hard to misuse
  3. Easy to read and maintain
  4. Sufficient powerful to meet the requirement
  5. Easy to extend
  • API should do one thing and do it well
  • API should be as small as possible
  • Implementation should not impact API(eg.Name)
  • Minimize accessibility of everything
  • Names Matter
  • Documentaion Matters
  • Consider performence consequence of API design decisions
  • Coexist peacefully with platform

Web Scraping

Application

  • Data source
  • Indexer & Crawler
  • Test

Tools:

  1. XPath
  2. Regular Expression
  3. Beautiful Soup

Basic flow

  1. Request web server to retrieve HTML -> requests
  2. Parse the HTML into structured data -> lxml
  3. Use XPath or Regex to extract infomation -> lxml, re
  4. Store the information -> pymongo

Integration with RabbitMQ

  • Store scraping tasks temporarily
  • Make scraper running continuously
  • Let scaper feed itself
  • Coordinate multiple scrapers working together

Avoid Blocking

  • Limit scraping rate
  • Follow website’s robot.txt
  • User Agent
  • Proxy
  • TOR(The Onion Router)

NLP(Natural Language Processing)

TF-IDF(Term Frequency-Inverse Document frequency)

Machine Leaning Basics

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning
  1. Classification
  2. Regression
  3. Similarity
  4. Ranking
  5. Sequence Prediction

TensorFlow

  • A deep learning library open-sourced by Google
  • TensorFlow provides primitives for defining functions on tensors and automatically computing their derivatives.

实践课

Create React App

sudo npm install -g create-react-app
create-react-app tap-news

npm start

Express Generater

var this 变量 {} []
apply

cookie
浏览器端保存信息

email:
index
passowrd:

bcrypt salt
passport
mongoose
validator 判断email

Auth0

pickle

时间衰减模型

选择 p=(1-a)*p+a

没有选择 p=(1-a)*p