Beijing Airbnb Visualisation

第一步,把数据搞到手

Housing Price Beijing [Lianjia]

Beijing regional housing price from Lianjia.com
Lianjia

Based on the mean price of the used houses.
Start Date:25/3/2020;
Ruoyu Wang

进入链家北京二手房信息网站, 打开开发者工具,点击Network会惊奇的发现有一个有priceMap的请求文件,接下来就用python来简单抓取一下吧。

lianjia

1.Import Libs

1
2
3
4
import requests
import json
import time
import pandas as pd

2.Define method

1
2
3
4
5
6
7
def get_page(url,header) :
my_headers = {"User-Agent": header}
try:
page = requests.get(url,headers = my_headers)
return page
except Exception as error:
print(str(error))

3.Make Request

1
2
3
4
url = "https://bj.lianjia.com/fangjia/priceMap/"
my_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) Apple WebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132; Ruoyu/For study use'

housing_price = get_page(url,my_agent)

4.To_csv

1
2
3
4
data = housing_price.json()
df = pd.DataFrame(data)
df = df.T.drop(["longitude","latitude"])
df.to_csv('dataset/housing_price_bj.csv', index=None,encoding="utf-8")

Tourism Attractions Beijing [Qunar.com]

Beijing tourism sight information scraping
Qunaerwang
Start Date:20/3/2020;
Start url:https://piao.qunar.com/ticket/list.htm?keyword=%E5%8C%97%E4%BA%AC&region=%E5%8C%97%E4%BA%AC&from=mpshouye_hotcity&sort=&page=1
Ruoyu Wang

接下来想去搞一下北京有什么好玩的?从小觉得去北京好像就是去看天安门故宫爬长城,其他好像也没什么了吧。
这次去去哪儿网上看看有什么热门景点,老样子用python爬一下网页。

1.Import Libs

1
2
3
4
import requests
from bs4 import BeautifulSoup as bs
import time
import pandas as pd

2.Make Request

1
2
3
4
5
6
7
def get_page(url,header) :
my_headers = {"User-Agent": header}
try:
page = requests.get(url,headers = my_headers)
return page
except Exception as error:
print(str(error))

3.Scraping

热门景点居然有274页,爬了30分钟。
然而得到结果是大量重复的数据。shit!!!

-下次爬虫时,一定要注意测试有效页码,以免耽误时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
my_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) Apple WebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132; Ruoyu/For study use'

def get_list(my_agent):
# start page 1
# valid pages : 15
page = 1
sight_list =[]
while page < 15:
the_url = 'https://piao.qunar.com/ticket/list.htm?keyword=%E5%8C%97%E4%BA%AC&region=%E5%8C%97%E4%BA%AC&from=mpshouye_hotcity&sort=&page={}'
re = get_page(the_url.format(page),my_agent) # call get_page function getting response
soup = bs(re.content,'html.parser')
print('Current working page:{}'.format(str(page)))
page += 1
# Deal results with Beautiful Soup
sight_items = soup.find_all('div', class_ ="sight_item")
# looping each tourism attraction
for i in sight_items:
sight_id = i.attrs['data-id']
sight_name = i.attrs['data-sight-name']
sight_address = i.attrs['data-address']
sight_lalg = i.attrs['data-point']
sight_lo,sight_la = sight_lalg.split(",")
sight_area = i.attrs['data-districts']
sight_des = i.find('div',class_='sight_item_info').find('div',class_='intro').text
sight_hot = i.find('div',class_="sight_item_hot").find('em').text.replace("热度","").strip()
# if the popularity is 0, then remove it
if float(sight_hot) == 0.0:
continue
# make a dictionary to store data
sight_dic = {"id":sight_id,"name":sight_name,"address":sight_address,
"latitude":sight_la,"longitude":sight_lo,"area":sight_area,
"description":sight_des,"popularity":sight_hot}
sight_list.append(sight_dic)
# wait 5s for each page processing
time.sleep(5)
print("Mission completed, got {} pages data".format(page))
return sight_list

4.To_csv

1
2
3
4
def toCsv(data,name):
df = pd.DataFrame.from_dict(data)
df.to_csv('{}.csv'.format(name),index=False,header=True,encoding="utf-8")
return df.sample(10)

Airbnb Dataset Download

Airbnb Lists Beijing[21 January, 2020]
Airbnb
Start Date:21/1/2020;
Ruoyu Wang

^ 天亮了,今天就写到这里吧!明天继续。

  • 版权声明: 本博客所有文章除特别声明外,均采用 Apache License 2.0 许可协议。转载请注明出处!
  • © 2020 Ruoyu Wang
  • PV: UV:

请我喝杯咖啡吧~

支付宝
微信