Why doesn't my spider find body text?

sigalizer · Oct-30-2019, 08:54 AM

So I'm trying to work with a Scrapy project that I found, called RISJbot (GitHub) for extracting the contents of news articles for my research, however I encountered a problem that I can't find the source of, or the way to fix it: the spider can't find body texts (in my case, most importantly: the articles themselves). I tried two of these spiders so far: the CNN one gave mixed results, the Washinton Post one couldn't find a single one.

It gives me this error message:

Error:
ERROR:RISJbot.pipelines.checkcontent:No bodytext: https://www.washingtonpost.com/world/europe/russia-and-cuba-rebuild-ties-that-frayed-after-cold-war/2019/10/29/d046cc0a-fa09-11e9-9e02-1d45cb3dfa8f_story.html

It also returns this error message, I'm not sure if it has any link to my problem:

Error:
ERROR:scrapy.utils.signal:Error caught on signal handler: > Traceback (most recent call last): File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\scrapy\extensions\feedexport.py", line 243, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'

When it doesn't find the body text, as a fallback, it generates a gzipped, Base 64-encoded version of the whole page. I managed to turn off this function, to check whether it has any sign of the part I'm looking for, and it indeed has the body text in it (albeit a very distorted version, with all html stuff, but I found a couple words), so it loads in, and it doesn't use JavaScript.

What would you recommend me to do? I couldn't find a way so far to fix it.

Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):

# -*- coding: utf-8 -*-
from RISJbot.spiders.newssitemapspider import NewsSitemapSpider
from RISJbot.loaders import NewsLoader
# Note: mutate_selector_del_xpath is somewhat naughty. Read its docstring.
from RISJbot.utils import mutate_selector_del_xpath
from scrapy.loader.processors import Identity, TakeFirst
from scrapy.loader.processors import Join, Compose, MapCompose
import re

class WashingtonPostSpider(NewsSitemapSpider):
    name = 'washingtonpost'
    # allowed_domains = ['washingtonpost.com']
    # A list of XML sitemap files, or suitable robots.txt files with pointers.
    sitemap_urls = ['https://www.washingtonpost.com/news-sitemaps/index.xml']

    def parse_page(self, response):
        """@url http://www.washingtonpost.com/business/2019/10/25/us-deficit-hit-billion-marking-nearly-percent-increase-during-trump-era/?hpid=hp_hp-top-table-main_deficit-210pm%3Ahomepage%2Fstory-ans
        @returns items 1
        @scrapes bodytext bylines fetchtime firstpubtime headline source url 
        @noscrapes modtime
        """
        s = response.selector
        # Remove any content from the tree before passing it to the loader.
        # There aren't native scrapy loader/selector methods for this.        
        #mutate_selector_del_xpath(s, '//*[@style="display:none"]')

        l = NewsLoader(selector=s)

        # WaPo's ISO date/time strings are invalid: <datetime>-500 instead of
        # <datetime>-05:00. Note that the various standardised l.add_* methods
        # will generate 'Failed to parse data' log items. We've got it properly
        # here, so they aren't important.
        l.add_xpath('firstpubtime',
                    '//*[@itemprop="datePublished" or '
                        '@property="datePublished"]/@content',
                    MapCompose(self.fix_iso_date)) # CreativeWork

        # These are duplicated in the markup, so uniquise them.
        l.add_xpath('bylines',
                    '//div[@itemprop="author-names"]/span/text()',
                    set)
        l.add_xpath('section',
                    '//*[contains(@class, "headline-kicker")]//text()')


        # Add a number of items of data that should be standardised across
        # providers. Can override these (for TakeFirst() fields) by making
        # l.add_* calls above this line, or supplement gaps by making them
        # below.
        l.add_fromresponse(response)
        l.add_htmlmeta()
        l.add_schemaorg(response)
        l.add_opengraph()
        l.add_scrapymeta(response)

        return l.load_item()

    def fix_iso_date(self, s):
        return re.sub(r'^([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}[+-])'
                            '([0-9])([0-9]{2})$',
                      r'\g<1>0\g<2>:\g<3>',
                      s)

sigalizer · (This post was last modified: Oct-30-2019, 01:35 PM by sigalizer.)

Update: I also checked NY Times and FOX, and they haven't found the bodytext either, so apparently it's a systematic issue, and those few CNN articles are the outliers (for example this one).

Does anyone have any idea why this might be, and why the CNN one might be different?

Edit: The CBS one also found the bodytext everywhere (for example here), which makes me even more confused.

**Larz60+** · Oct-30-2019, 05:02 PM

two different programming teams -- two different methods

sigalizer · Oct-30-2019, 09:02 PM

(Oct-30-2019, 05:02 PM)Larz60+ Wrote: two different programming teams -- two different methods

Could you elaborate on that? I guess I left out an important thing: these guys created a separate spider for each media outlet. So technically all of these should work, yet it's not the case. Furthermore, those that don't work have the exact same problem. I just can't find the root cause of the issue.

**Larz60+** · Oct-30-2019, 09:51 PM

What I was trying to say is that web sites can be put together in many different ways, so a spider for one won't necessarily work for another, especially if JavaScript is being used.

sigalizer · Oct-30-2019, 11:35 PM

That's why I pointed out that these are acrually separate spiders, slightly modified for each website, yet with the same issue. And I confirm that in a couple cases (like in the case of the Post), the aite doesn't use javascript for the bodytext, so the issue is somewhere else.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[BeautifulSoup] Find </body>?	Winfried	3	4,105	Jul-21-2023, 11:25 AM Last Post: Gaurav_Kumar
	Deployed Spider on Heroku: How do I email downloaded files?	JaneTan	2	2,981	Mar-24-2022, 08:31 AM Last Post: JaneTan
	find a hyperlink in Gmail body python 3(imap and selenium)	taomihiranga	1	9,994	Dec-30-2020, 05:31 PM Last Post: Gamer1057
	Get html body of URL	rama27	6	21,520	Aug-03-2020, 02:37 PM Last Post: snippsat
	Is it possible to perform a PUT request by passing a req body instead of an ID	ary	0	2,777	Feb-20-2019, 05:55 AM Last Post: ary
	XML Parsing - Find a specific text (ElementTree)	TeraX	3	6,816	Oct-09-2018, 09:06 AM Last Post: TeraX
	How to find particular text from td tag using bs4	Prince_Bhatia	7	8,901	Sep-24-2018, 08:36 PM Last Post: nilamo
	BS4 Not Able To Find Text In CSS Comments	digitalmatic7	4	7,279	Feb-27-2018, 03:45 AM Last Post: digitalmatic7
	In CSV, how to write the header after writing the body?	Tim	18	22,808	Jan-06-2018, 01:54 PM Last Post: Larz60+

Why doesn't my spider find body text?

User Panel Messages

Announcements