orm이란

04 Jul 2018 | python pycharm

개인적인 연습 내용을 정리한 글입니다.
더 좋은 방법이 있거나, 잘못된 부분이 있으면 편하게 의견 주세요. :)

Comment Read more

session과 cookie

25 Jun 2018 | python pycharm

개인적인 연습 내용을 정리한 글입니다.
더 좋은 방법이 있거나, 잘못된 부분이 있으면 편하게 의견 주세요. :)

session

http연결은 불연속적으로, http를 통해 우리가 여러번 요청을 보내도 서버에서는 우리가 보낸 사실을 모른다. 정확히는 ip주소는 같겠지만 그 사용자가 같은건 아니니까 서버입장에서는 어떤 사용자가 지속적으로 요청을 보내는 지를 알아야 하고 클라이언트 입장에서도 서버에 그걸 알려줘야 하는데 이를 알려주는 기법을 서버쪽에서는 세션이라고 한다.

session을 유지한다는 것은 연속성을 유지한다는 것을 의미한다.

-> cookie 기반 사용자 session

http의 규격중 하나로, 사용자가 어떤 웹사이트를 방문할 경우에 그 사이트가 사용하는 서버를 통해서 인터넷 사용자의 컴퓨터에 설치되는 작은 기록 정보파일을 일컫는다. 서버쪽에서 우리가 http response를 돌려주는데 (서버쪽으로 오는게 request) 이 request를 통해서 우리는 브라우저에 response를 돌려주는데 그 돌려주는 과정에서 쿠키라는 것을 담아라라는 명령을 보낼 수 있다. (http 메시지에)

그래서 특정 값을 서버쪽에서 요청이 왔을때 그 요청에 대해서 이 사용자는 이제 ABC라고 하면 (서버 쪽에 이 내용을 저장을 해놓고) 이 ABC라는 텍스트를 브라우저 쿠키공간에 저장을 하라고 보내면 그 브라우저는 자신의 쿠키를 저장하는 공간에 url과 같이 저장을 한다.

그러고 localhost:8000에서 어떤 요청이 왔는데 ABC라는 것을 저장을 해놓으라는 명령이 왔고 다음번에 똑같은 localhost:8000으로 똑같은 요청을 보낼때는 로컬에 있는 쿠키목록을 그대로 다 보낸다.

브라우저 -> 로그인요청(username, password 요청을 한다) -> 서버
                                                      -> authentication(인증)
                                                      주어진 username/password에 해당하는 유저가 있는지 검사
                                                      -> 인증에 성공하면, 그 사용자에 해당하는 "특정값"을 DB에 저장
                                                      -> "특정값"을 http response에 Set-Cookie헤더로 담아 전송
                                                        -> 브라우저는 response를 받고, Set-Cookie헤더에 담긴 내용을 쿠키 저장공간에 저장

-> 이후 브라우저 -> 서버로 가는 모든 요청에 쿠키저장공간에 있는 특정값을 함께 보냄
  -> 서버는 받은 request에 특정값이 있는 지 검사, 특정 유저에 매칭이 되는 같은 유저가 요청을 보냈다고 간주

해설

브라우저에서 서버에게 로그인을 하겠다고 요청을 보내는 것은 username과 password를 요청한다는 의미로 서버가 이를 받으면 일단 authentication을 거친다

(인증을 거친다는 것은) - 서버쪽에서 주어진 username/password에 해당하는 유저가 있는 지 검사하는 의미로 인증에 성공을 하면 사용자에 해당하는 특정값을 DB에 저장한다. (세션을 유지한다는 표현 -> 세션값을 저장한다라고 표현한다.)

DB에 저장을 하고나면 특정값을 HTTP response에 Set-Cookie헤더로 담아 (Ser-Cookie 헤더라는 것은 브라우저에게 이런 쿠키를 너가 세팅을 해라라는 것) 전송을 하면 브라우저가 이를 받겠고 담긴 내용을 쿠키저장공간에 저장을 한다. 이후 브라우저에서 서버로 가는 모든 요청에 쿠키 저장공간에 있는 특정값을 함께 보내게 되어진다.

그러면 서버는 받은 request에 특정값이 있는지를 검사하고 그게 특정 유저에 매칭이 되면 같은 유저가 요청을 보냈다고 간주한다.

그래서 우리가 admin 페이지에서 아무 셋팅을 하지 않았음에도 로그인을 하고 가만히 있어도 계속해서 로그인이 유지가 되는 이유가 바로 이것이다.

Comment Read more

python을 활용한 웹툰 크롤링 완성 - utils.py

10 Jun 2018 | python 크롤링

패스트캠퍼스 웹 프로그래밍 수업을 듣고 중요한 내용을 정리했습니다.
개인공부 후 자료를 남기기 위한 목적임으로 내용 상에 오류가 있을 수 있습니다.
이 포스팅에서는 크롤링하는 방법에 대해 설명합니다. 본 스크립트는 학습 목적으로만 사용해주세요.

import sys

from crawl import Webtoon


def ini():
    """
    초기 메서드
    :return:
    """
    while True:
        # 제목으로 웹툰 검색
        print('---------------------------------')
        keyword = input('검색할 웹툰명을 입력해주세요 :')
        result_lists = Webtoon.search_webtoon(keyword)
        # 검색 키워드의 결과가 없는 경우
        if not result_lists:
            print(f'일치하는 웹툰이 없습니다. 다시 입력해주세요\n')
            continue
        break

    # 검색한 키워드에 해당하는 웹툰 리스트 출력
    select_webtoon = print_search_list(result_lists)

    # 웹툰을 고르고 난 후 웹툰 정보/저장/ 관련 메서드 실행
    webtoon_select(select_webtoon)


def webtoon_select(webtoon):
    """
    선택한 웹툰의 정보, 저장을 위해 실행되는 메서드
    :param webtoon:(ini함수에서 선택된 Webtoon 인스턴스)
    :return:
    """
    while True:
        print('--------------------------------------------')
        print(f'현재 "{webtoon.title}" 웹툰이 선택되어 있습니다. ')
        print(f'1. {webtoon.title} 정보 보기')
        print(f'2. {webtoon.title} 저장 하기')
        print(f'3. 다른 웹툰 검색해서 선택하기')
        print(f'0. 종료하기')
        select = input("선택 :")

        # 웹툰 정보 출력 : 작가, 설명, 총 연재 수
        if select is '1':
            print('--------------------------------------------')
            print(webtoon.info)

        # 웹툰 저장
        elif select is '2':
            select_episode_download(webtoon)

        elif select is '3':
            ini()

        elif select is '0':
            print('웹툰 검색기를 종료합니다.')
            sys.exit(1)
        else:
            print(f'입력하신 번호가 올바르지 않습니다.\n')
            continue


def print_search_list(result_lists):
    """
    검색한 키워드에 해당하는 웹툰 리스트 출력 및 선택
    :param result_lists:
    :return:
    """
    while True:
        # 검색한 웹툰 리스트 출력
        for num, webtoon in enumerate(result_lists):
            # list는 0부터 시작하기때문에 사용자 편의를 위해 num+1
            print(f'{num+1}: {webtoon.title}')

        # input 경우 문자열이기 때문에 int형으로 변환
        select_str = input('만화를 선택해주세요 :')
        select = int(select_str) - 1

        # 선택 번호가 웹툰 리스트 인덱싱 값에 포함되지 않는 경우
        if select not in range(num + 1):
            print(f'입력하신 번호가 올바르지 않습니다.\n')
            continue

        # 입력받은 선택 번호에 맞는 웹툰 설정
        select_webtoon = result_lists[select]
        break
    return select_webtoon


def select_episode_download(webtoon):
    """
    선택한 웹툰 에피소드 저장 관련 메뉴 메서드
    :param webtoon:
    :return:
    """
    while True:
        print('--------------------------------------------')
        print(f'현재 "{webtoon.title}" 웹툰이 선택되어 있습니다. ')
        print(f'1. 모든 에피소드 저장하기')
        print(f'2. 한 에피소드만 저장하기')
        print(f'3. 뒤로 가기')
        select_download = input('선택 :')
        # 해당 웹툰 에피소드를 다운 받을때 1화부터 다운로드 하기 위해
        # reversed()함수를 이용하여 오름차순으로 변경
        reverse_webtoon_episode_list = list(reversed(webtoon.episode_list))
        if select_download is '1':
            # 전체 에피소드 다운로드
            for episode in reverse_webtoon_episode_list:
                episode.download_all_images()
                print(f'{episode.no}화가 다운되었습니다.')
        elif select_download is '2':
            while True:
                # 특정 에피소드만 다운로드
                for num, episode in enumerate(reverse_webtoon_episode_list):
                    print(f'{num+1}.{episode.title}')
                select_episode = input('선택 :')

                if int(select_episode) not in range(num + 1):
                    print(f'입력하신 번호가 올바르지 않습니다.\n')
                    continue
                e = reverse_webtoon_episode_list[int(select_episode) - 1]
                e.download_all_images()
                print(f'{e.no}화가 다운되었습니다.')
                break
        elif select_download is '3':
            break
        else:
            print(f'입력하신 번호가 올바르지 않습니다.\n')


if __name__ == '__main__':
    ini()

Comment Read more

python을 활용한 웹툰 크롤링 완성 - crawl.py

10 Jun 2018 | python 크롤링

패스트캠퍼스 웹 프로그래밍 수업을 듣고 중요한 내용을 정리했습니다.
개인공부 후 자료를 남기기 위한 목적임으로 내용 상에 오류가 있을 수 있습니다.
이 포스팅에서는 크롤링하는 방법에 대해 설명합니다. 본 스크립트는 학습 목적으로만 사용해주세요.

import os
from urllib import parse

import requests
from bs4 import BeautifulSoup


class EpisodeImage:
    def __init__(self, episode, url):
        self.episode = episode
        self.url = url


class Episode:
    def __init__(self, webtoon, no, url_thumbnail,
                 title, rating, created_date):
        self.webtoon = webtoon
        self.no = no
        self.url_thumbnail = url_thumbnail
        self.title = title
        self.rating = rating
        self.created_date = created_date
        self.image_list = list()

    @property
    def url(self):
        """
        self.webtoon, self.no 요소를 사용하여
        실제 에피소드 페이지 URL을 리턴
        :return:
        """
        url = 'http://comic.naver.com/webtoon/detail.nhn?'
        params = {
            'titleId': self.webtoon.webtoon_id,
            'no': self.no,
        }

        episode_url = url + parse.urlencode(params)
        return episode_url

    def get_image_url_list(self):

        file_path = f'./data/{self.webtoon.webtoon_id}-{self.no}.html'

        if not os.path.exists(file_path):
            with open(file_path, 'wt') as f:
                response = requests.get(self.url)
                f.write(response.text)
            html = response.text
        else:
            html = open(file_path, 'rt').read()

        soup = BeautifulSoup(html, 'lxml')
        wt_viewer = soup.select_one('div.wt_viewer').select('img')

        for img in wt_viewer:
            new_ep_image = EpisodeImage(
                episode=self.no,
                url=img.get('src')
            )
            self.image_list.append(new_ep_image)
        return self.image_list

    def download_all_images(self):
        for url in self.get_image_url_list():
            self.download(url)

    def download(self, url_img):
        """
        :param url_img: 실제 이미지의 URL
        :return:
        """
        # 서버에서 거부하지 않도록 HTTP헤더 중 'Referer'항목을 채워서 요청
        url_referer = f'http://comic.naver.com/webtoon/list.nhn?titleId={self.webtoon}'
        headers = {
            'Referer': url_referer,
        }
        response = requests.get(url_img.url, headers=headers)
        # 이미지 URL에서 이미지명을 가져옴
        file_name = url_img.url.rsplit('/', 1)[-1]

        # 이미지가 저장될 폴더 경로, 폴더가 없으면 생성해준다
        dir_path = f'data/{self.webtoon.webtoon_id}/{self.no}'
        os.makedirs(dir_path, exist_ok=True)

        # 이미지가 저장될 파일 경로, 'wb'모드로 열어 이진데이터를 기록한다
        file_path = f'{dir_path}/{file_name}'
        open(file_path, 'wb').write(response.content)

        # 저장된 이미지를 인터넷에서 볼 수 있도록 html 파일로 생성
        with open('data/{}/{}.html'.format(self.webtoon.webtoon_id, self.no), 'a') as f:
            f.write('<img src = {}/{}>'.format(self.no, file_name))


class Webtoon:
    def __init__(self, webtoon_id):
        self.webtoon_id = webtoon_id
        self._title = None
        self._author = None
        self._description = None
        self._episode_list = list()
        self._html = ''
        self.page = 1

    def _get_info(self, attr_name):
        if not getattr(self, attr_name):
            self.set_info()
        return getattr(self, attr_name)

    @property
    def episode_list(self):
        if not self._episode_list:
            self.crawl_episode_list()
        return self._episode_list

    @property
    def title(self):
        return self._get_info('_title')

    @property
    def author(self):
        return self._get_info('_author')

    @property
    def description(self):
        return self._get_info('_description')

    @property
    def html(self):
        # 인스턴스의 html속성값이 False(빈 문자열)일 경우
        # HTML파일을 저장하거나 불러올 경로
        file_path = 'data/_episode_list-{webtoon_id}-{page}.html'.format(webtoon_id=self.webtoon_id,
                                                                         page=self.page)
        # HTTP요청을 보낼 주소
        url_episode_list = 'http://comic.naver.com/webtoon/list.nhn'
        # HTTP요청시 전달할 GET Parameters
        params = {
            'titleId': self.webtoon_id,
            'page': self.page,
        }
        # HTML파일이 로컬에 저장되어 있는지 검사
        if os.path.exists(file_path):
            # 저장되어 있다면, 해당 파일을 읽어서 html변수에 할당
            html = open(file_path, 'rt').read()
        else:
            # 저장되어 있지 않다면, requests를 사용해 HTTP GET요청
            response = requests.get(url_episode_list, params)
            # 요청 응답객체의 text속성값을 html변수에 할당
            html = response.text
            # 받은 텍스트 데이터를 HTML파일로 저장
            open(file_path, 'wt').write(html)
        self._html = html
        return self._html

    @property
    def info(self):

        return '{title} \n' \
               '작가 : {author} \n' \
               '설명 : {description} \n' \
               '총 연재횟수 : {episode_list} 회'.format(title=self.title,
                                                  author=self.author,
                                                  description=self.description,
                                                  episode_list=len(self.episode_list))

    @classmethod
    def all_webtoon_crawler(cls, keyword):
        url = 'https://comic.naver.com/webtoon/weekday.nhn'
        response = requests.get(url)

        soup = BeautifulSoup(response.text, 'lxml')
        all_webtoon_list = soup.select('div.col_inner > ul > li > a')

        result = list()
        for webtoon in all_webtoon_list:
            title = webtoon.get_text()
            if keyword in title:
                href = webtoon.get('href', '')
                query_string = parse.urlsplit(href).query
                query_dict = dict(parse.parse_qsl(query_string))
                titleId = query_dict['titleId']
                check = [item for item in result if item['titleId'] == titleId]
                if not check:
                    result.append({
                        'titleId': titleId,
                        'title': title,
                    })
        return result

    @classmethod
    def search_webtoon(cls, keyword):
        """
        keyword와 일치하는 웹툰 제목 검색
        :param keyword:
        :return:
        """
        search_result = cls.all_webtoon_crawler(keyword)
        result_list = list()
        if search_result:
            for webtoon in search_result:
                webtoon_title_id = cls(webtoon_id=webtoon['titleId'])
                result_list.append(webtoon_title_id)
        return result_list

    def set_info(self):
        """
        자신의 html속성을 파싱한 결과를 사용해
        자신의 title, author, description속성값을 할당
        :return: None
        """
        # BeautifulSoup클래스형 객체 생성 및 soup변수에 할당
        soup = BeautifulSoup(self.html, 'lxml')

        h2_title = soup.select_one('div.detail > h2')
        title = h2_title.contents[0].strip()
        author = h2_title.contents[1].get_text(strip=True)
        # div.detail > p (설명)
        description = soup.select_one('div.detail > p').get_text(strip=True)

        # 자신의 html데이터를 사용해서 (웹에서 받아오거나, 파일에서 읽어온 결과)
        # 자신의 속성들을 지정
        self._title = title
        self._author = author
        self._description = description

    def crawl_episode_list(self):
        """
        자기자신의 webtoon_id에 해당하는 HTML문서에서 Episode목록을 생성
        :return:
        """
        while True:
            # BeautifulSoup클래스형 객체 생성 및 soup변수에 할당
            soup = BeautifulSoup(self.html, 'lxml')

            # 에피소드 목록을 담고 있는 table
            table = soup.select_one('table.viewList')
            # table내의 모든 tr요소 목록
            tr_list = table.select('tr')
            # list를 리턴하기 위해 선언
            # for문을 다 실행하면 episode_lists 에는 Episode 인스턴스가 들어가있음
            episode_list = list()
            # 첫 번째 tr은 thead의 tr이므로 제외, tr_list의 [1:]부터 순회
            for index, tr in enumerate(tr_list[1:]):
                # 에피소드에 해당하는 tr은 클래스가 없으므로,
                # 현재 순회중인 tr요소가 클래스 속성값을 가진다면 continue
                if tr.get('class'):
                    continue

                # 현재 tr의 첫 번째 td요소의 하위 img태그의 'src'속성값
                url_thumbnail = tr.select_one('td:nth-of-type(1) img').get('src')
                # 현재 tr의 첫 번째 td요소의 자식   a태그의 'href'속성값
                from urllib import parse
                url_detail = tr.select_one('td:nth-of-type(1) > a').get('href')
                query_string = parse.urlsplit(url_detail).query
                query_dict = parse.parse_qs(query_string)
                # print(query_dict)
                no = query_dict['no'][0]

                # 현재 tr의 두 번째 td요소의 자식 a요소의 내용
                title = tr.select_one('td:nth-of-type(2) > a').get_text(strip=True)
                # 현재 tr의 세 번째 td요소의 하위 strong태그의 내용
                rating = tr.select_one('td:nth-of-type(3) strong').get_text(strip=True)
                # 현재 tr의 네 번째 td요소의 내용
                created_date = tr.select_one('td:nth-of-type(4)').get_text(strip=True)

                # 매 에피소드 정보를 Episode 인보스턴스로 생성
                # new_episode = Episode 인스턴스
                new_episode = Episode(
                    webtoon=self,
                    no=no,
                    url_thumbnail=url_thumbnail,
                    title=title,
                    rating=rating,
                    created_date=created_date,
                )
                # episode_lists Episode 인스턴스들 추가
                self._episode_list.append(new_episode)
            # no가 1인경우 break
            if no == '1':
                break
            # 그 경우가 아니면 page를 1씩 추가하여 다음 페이지 웹툰 리스트 크롤링
            else:
                self.page += 1

Comment Read more

python을 활용한 웹툰 크롤링 - 내 코드 복습(수정필요)

03 Jun 2018 | python 크롤링

패스트캠퍼스 웹 프로그래밍 수업을 듣고 중요한 내용을 정리했습니다.
개인공부 후 자료를 남기기 위한 목적임으로 내용 상에 오류가 있을 수 있습니다.
이 포스팅에서는 크롤링하는 방법에 대해 설명합니다. 본 스크립트는 학습 목적으로만 사용해주세요.

import os
from urllib import parse

import requests
from bs4 import BeautifulSoup


class Webtoon:
    def __init__(self, webtoon_id):
        self.webtoon_id = webtoon_id
        self._title = None
        self._author = None
        self._description = None
        self._html = ''

    @property
    def html(self):
        if not self._html:
            file_path = 'data/_episode_list-{webtoon_id}.html'.format(webtoon_id=self.webtoon_id)
            url_episode_path = 'https://comic.naver.com/webtoon/list.nhn'
            params = {
                'titleId': self.webtoon_id,
            }

            if os.path.exists(file_path):
                html = open(file_path, 'rt').read()

            else:
                response = requests.get(url_episode_path, params)
                html = response.text
                open(file_path, 'wt').write(html)
            self._html = html
        return self._html

    def set_info(self):

        soup = BeautifulSoup(self.html, 'lxml')

        h2_title = soup.select_one('div.detail > h2')
        title = h2_title.contents[0].strip()
        author = h2_title.contents[2].get_text(strip=True)
        description = soup.select_one('div.detail > p').get_text(strip=True)

        self._title = title
        self._author = author
        self._description = description

    @property
    def title(self):
        if not self._title:
            self.set_info()
        return self._title

    @property
    def author(self):
        if not self._author:
            self.set_info()
        return self._author

    @property
    def description(self):
        if not self._description:
            self.set_info()
        return self._description


class Episode(Webtoon):
    def __init__(self, title, url):
        super().__init__(self)
        self._title = title
        self.url = url
        self._episode_list = list()

    def crawl_episode_list(self):

        soup = BeautifulSoup(self.html, 'lxml')

        table = soup.select_one('table.viewList')
        tr_list = table.select('tr')

        for index, tr in enumerate(tr_list[1:]):
            # print('==={}===\n{}\n'.format(index, tr))
            if tr.get('class'):
                continue

            url_thumbnail = tr.select_one('td:nth-of-type(1) img').get('src')

            from urllib import parse
            url_detail = tr.select_one('td:nth-of-type(1) > a').get('href')
            query_string = parse.urlsplit(url_detail).query
            query_dict = parse.parse_qs(query_string)
            no = query_dict['no'][0]
            title = tr.select_one('td:nth-of-type(2) a').get_text(strip=True)
            rating = tr.select_one('td:nth-of-type(3) strong').get_text(strip=True)
            create_date = tr.select_one('td:nth-of-type(4)').get_text(strip=True)

            episode_list = list()
            episode_list.append(url_thumbnail)
            episode_list.append(no)
            episode_list.append(title)
            episode_list.append(rating)
            episode_list.append(create_date)
            print(episode_list)
        return self._episode_list

    @property
    def episode_list(self):
        if not self._episode_list:
            self.crawl_episode_list()
        return self._episode_list

    def episode_url(self):
        url = 'http://comic.naver.com/webtoon/detail.nhn?'
        params = {
            'titleId': self.webtoon_id,
        }
        episode_url = url + parse.urlencode(params)
        return episode_url


class EpisodeImage:
    def __init__(self, webtoon_id, no, episode, url, file_path):
        self.episode = episode
        self.url = url
        self.file_path = file_path
        self.webtoon_id = webtoon_id
        self.no = no

    def get_image_url_list(self):
        file_path = 'data/episode_detail-{webtoon_id}-{episode_no}.html'.format(
            webtoon_id=self.webtoon_id,
            episode_no=self.no,
        )

        if os.path.exists(file_path):
            html = open(file_path, 'rt').read()

        else:
            response = requests.get(self.url)
            html = response.text
            open(file_path, 'wt').write(html)

        soup = BeautifulSoup(html, 'lxml')
        img_list = soup.select('div.wt_viewer > img').get('src')

        return [img.get('src') for img in img_list]

    def download_image_file_path(self, url_img):
        url_referer = f'http://comic.naver.com/webtoon/list.nhn?titleId={self.webtoon_id}'
        headers = {
            'Referer': url_referer,
        }
        response = requests.get(url_img, headers=headers)
        file_name = url_img.split('/', 1)[-1]

        dir_path = f'data/{self.webtoon_id}/{self.no}'
        os.makedirs(dir_path, exist_ok=True)

        file_path = f'{dir_path}/{file_name}'
        open(file_path, 'wb').write(response.content)

    def download_image_episode(self):
        for url in self.get_image_url_list():
            self.download_image_file_path(url)


if __name__ == '__main__':
    # 조건적으로 실행, 파이썬 자체가 내부적으로 사용하는 특별한 변수명. 우리가 실행했을때만 이 코드가 실행이 된다.
    webtoon1 = Webtoon(679519)
    print(webtoon1.title)
    print(webtoon1.author)
    print(webtoon1.description)
    e1 = webtoon1.episode_list[0]
    e1.download_image_episode()

Comment Read more

Older Newer

지혜의 개발공부로그

orm이란

session과 cookie

session

cookie

해설

python을 활용한 웹툰 크롤링 완성 - utils.py

python을 활용한 웹툰 크롤링 완성 - crawl.py

python을 활용한 웹툰 크롤링 - 내 코드 복습(수정필요)