네이버 블로그 한 페이지 제목 가져오기 (iframe 크롤링)

2022. 7. 10. 23:29

네이버 블로그 한 페이지 제목 가져오기 (iframe 크롤링)

글. 수알치 오상문

페이지에 접속해도 iframe에 있는 속성은 검색 안됩니다.

iframe 내용을 크롤링하려면 해당 프레임으로 전환하고 검색해야 합니다.

# iframe 페이지로 전환
content = browser.find_element_by_tag_name("iframe")
browser.switch_to.frame(content)

# 크롤링 작업...
...

# 크롤링 작업 후 본문 페이지로 나가려면 프레임 작업 종료후 전환
browser.switch_to.default_content()

------------------------------------------------------------------------------------------------------------------------
다음 소스 코드는 naver 수알치 블로그에 있는 최근 페이지에서 게시물 제목을 가져옵니다.

from urllib.request import urlopen
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time

# 크롬 옵션 설정
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

url = r"https://blog.naver.com/sualchi"
driver = None

try:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),

options=chrome_options)
    WebDriverWait(driver, 3)
    driver.get(url)
    WebDriverWait(driver, 10)
    time.sleep(2)

    ##    html = driver.page_source
    ##    soup = BeautifulSoup(html, "lxml")
    ##    time.sleep(1)

    if "No results found." in driver.page_source:
        print("Page not found")
        driver.quit()
        exit(1)
except:
    driver.quit()
    exit(1)

try:
   content = driver.find_element(By.ID, "mainFrame")  #iframe
   driver.switch_to.frame(content)
except Exception as e:
   print("iframe selection error:", e)
   driver.quit()  # 웹 드라이버 종료
   exit(1)

try:
    table = driver.find_element(By.XPATH, '//*[@id="listTopForm"]/table')
except Exception as e:
    print("테이블 선택 에러:", e)
    driver.quit()  # 웹 드라이버 종료
    exit(1)

# 게시물 목록 가져오기
print("게시물 첫 페이지 목록 시작 -----------------------------------")

tbody = table.find_elements(By.TAG_NAME, "tbody")
for tr in tbody:
   count = 0
   for td in tr.text.split('\n'):
      if count%2 == 0:
          print(count//2+1, td)
      count += 1

# 아래도 가능

#tbody = table.find_elements(By.TAG_NAME, "tbody")
#tr = tbody[0]
#for n, td in enumerate(tr.text.split('\n')[::2], start=1):
# print(n, td)

print("목록 종료 --------------------------------------------------")

# input("진행 종료?: ")  # 종료 대기
driver.quit()  # 웹 드라이버 종료

[실행결과]

게시물 첫 페이지 목록 시작 -----------------------------------
1 BeautifulSoup 기초
2 셀레니움, text와 innerHTML 차이
3 셀레니움, 요소(어트리뷰트) 속성 검사, 설정, 삭제
4 셀레니움, get_attribute()로 속성 값 얻기
5 JQuery 기본 문법
6 셀레니움, HTML Table 구조 읽기
7 셀레니움, XPath로 요소 찾기 표현식
8 셀레니움, 웹브라우저 OFF 상태로 크롤링 하는 옵션
9 selenium AttributeError 'Webdriver' object has no attribute 'find_element_by ... 에러
10 셀레니움, 페이지 로딩 기다리기 (특정 조건)
11 크롬 드라이버 자동 다운로드 및 DeprecationWarning: executable_path has been deprecated, please pass in a Service
12 셀레니움, 구글 이미지 검색하여 다운로드 하기
13 셀레니움, Yes24 베스트셀러 1위 제목 가져오기
14 파이썬, 크롤링 라이브러리
15 셀레니움 quit(), close() 사용법
16 봇 접근 차단 사이트 접속
17 셀레니움 기본 예제 2
18 웹뷰 폰트 깨질 때는 이렇게 설정해 보세요
19 AI 활용해 벽 뒤 사물 감지하는 군사 장비 개발
20 파이썬, 회사명에서 주식회사 표현 및 공백 제거 (정규식 아님)
목록 종료 --------------------------------------------------

[참고] 다른 사이트 iframe 테이블을 읽어와서 csv 저장하는 예제 (현재 사이트 접속 시 오류남)

# 출처: https://codemath.github.io/Selenium-%ED%99%9C%EC%9A%A9%EA%B8%B0(1)/
# -*- coding: utf-8 -*-
import urllib3  # python 2.x는 urllib2
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import csv
hdr = {'User-Agent': 'Mozilla/5.0', 'referer': 'http://www.google.com'}
urls = "http://61.73.100.30/kr/bbs/iframe/list.jsp?bbsid=1273209703694"
# csv open
f = open('output.csv', 'w')
wr = csv.writer(f)
wr.writerow(['#', '상호', '홈페이지', '대표자', '모바일'])
# python2:
# wr.writerow([u'#', u'상호'.encode('utf-8'), u'홈페이지'.encode('utf-8'),
#              u'대표자'.encode('utf-8'),u'모바일'.encode('utf-8')])
driver = webdriver.Chrome()
driver.get(urls)
a_link = 3
# 전체 크롤링 페이지는 152 페이지이고 
# 끝 페이지 게시물은 4개로 고정된 경우에 처리하기 
for each in range(1,153):
    if each != 152:
        sections = range(1,11)
    else:
        sections = range(1,5)
    for li in sections:
        ids = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/table/tbody/tr['+str(li)+']/td[1]').text
        title = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/table/tbody/tr['+str(li)+']/td[2]').text
        webs = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/table/tbody/tr['+str(li)+']/td[3]').text
        name = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/table/tbody/tr['+str(li)+']/td[4]').text
        mobile = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/table/tbody/tr['+str(li)+']/td[5]').text
        wr.writerow([ids, title, webs, name, mobile])
        # python 2
        # wr.writerow([ids.encode('utf-8'), title.encode('utf-8'), webs.encode('utf-8'),
        #              name.encode('utf-8'),mobile.encode('utf-8')])
        print(ids, title, webs, name, mobile)
    if a_link == 13:
        a_link = 4
    else:
        a_link += 1        
    page_btn = driver.find_element(By.XPATH, '//*[@id="content"]/div[3]/div/a[%s]' %(a_link))
    page_btn.click()
print("----- The End. -----")
f.close()
driver.quit()

저작자표시 비영리 변경금지 (새창열림)

'웹 크롤링, 스크래핑' 카테고리의 다른 글

파이썬, 셀레니움 속도 향상을 위한 5가지 팁 (0)	2022.07.13
셀레니움, 드롭다운 Select 상자 항목 선택 (0)	2022.07.11
BeautifulSoup 기초 (0)	2022.07.10
셀레니움, 요소(어트리뷰트) 속성 검사, 설정, 삭제 (0)	2022.07.10
셀레니움, text, innerText, innerHTML 차이, 테이블 예제 비교 (0)	2022.07.10

수알치 블로그